[GitHub] spark pull request #19560: [SPARK-22334][SQL] Check table size from filesyst...

2017-10-23 Thread jinxing64
Github user jinxing64 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19560#discussion_r146449741
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
@@ -120,22 +120,41 @@ class DetermineTableStats(session: SparkSession) 
extends Rule[LogicalPlan] {
 if DDLUtils.isHiveTable(relation.tableMeta) && 
relation.tableMeta.stats.isEmpty =>
   val table = relation.tableMeta
   val sizeInBytes = if 
(session.sessionState.conf.fallBackToHdfsForStatsEnabled) {
-try {
-  val hadoopConf = session.sessionState.newHadoopConf()
-  val tablePath = new Path(table.location)
-  val fs: FileSystem = tablePath.getFileSystem(hadoopConf)
-  fs.getContentSummary(tablePath).getLength
-} catch {
-  case e: IOException =>
-logWarning("Failed to get table size from hdfs.", e)
-session.sessionState.conf.defaultSizeInBytes
-}
+getSizeFromHdfs(table.location)
   } else {
 session.sessionState.conf.defaultSizeInBytes
   }
 
   val withStats = table.copy(stats = 
Some(CatalogStatistics(sizeInBytes = BigInt(sizeInBytes
   relation.copy(tableMeta = withStats)
+
+case relation: HiveTableRelation
+if DDLUtils.isHiveTable(relation.tableMeta) && 
relation.tableMeta.stats.nonEmpty &&
+  
session.sessionState.conf.verifyStatsFromFileSystemWhenBroadcastJoin &&
+  relation.tableMeta.stats.get.sizeInBytes <
+session.sessionState.conf.autoBroadcastJoinThreshold =>
+  val table = relation.tableMeta
+  val sizeInBytes = getSizeFromHdfs(table.location)
--- End diff --

Yes, I think it's good idea.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19560: [SPARK-22334][SQL] Check table size from filesyst...

2017-10-23 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19560#discussion_r146448976
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
@@ -120,22 +120,41 @@ class DetermineTableStats(session: SparkSession) 
extends Rule[LogicalPlan] {
 if DDLUtils.isHiveTable(relation.tableMeta) && 
relation.tableMeta.stats.isEmpty =>
   val table = relation.tableMeta
   val sizeInBytes = if 
(session.sessionState.conf.fallBackToHdfsForStatsEnabled) {
-try {
-  val hadoopConf = session.sessionState.newHadoopConf()
-  val tablePath = new Path(table.location)
-  val fs: FileSystem = tablePath.getFileSystem(hadoopConf)
-  fs.getContentSummary(tablePath).getLength
-} catch {
-  case e: IOException =>
-logWarning("Failed to get table size from hdfs.", e)
-session.sessionState.conf.defaultSizeInBytes
-}
+getSizeFromHdfs(table.location)
   } else {
 session.sessionState.conf.defaultSizeInBytes
   }
 
   val withStats = table.copy(stats = 
Some(CatalogStatistics(sizeInBytes = BigInt(sizeInBytes
   relation.copy(tableMeta = withStats)
+
+case relation: HiveTableRelation
+if DDLUtils.isHiveTable(relation.tableMeta) && 
relation.tableMeta.stats.nonEmpty &&
+  
session.sessionState.conf.verifyStatsFromFileSystemWhenBroadcastJoin &&
+  relation.tableMeta.stats.get.sizeInBytes <
+session.sessionState.conf.autoBroadcastJoinThreshold =>
+  val table = relation.tableMeta
+  val sizeInBytes = getSizeFromHdfs(table.location)
--- End diff --

If the metadata statistics are wrong, getting the size from files every 
time seems a burden. Can we show some message to users and suggest them to 
update table statistics?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19560: [SPARK-22334][SQL] Check table size from filesyst...

2017-10-23 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19560#discussion_r146448519
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -187,6 +187,15 @@ object SQLConf {
 .booleanConf
 .createWithDefault(false)
 
+  val VERIFY_STATS_FROM_FILESYSTEM_WHEN_BROADCASTJOIN =
--- End diff --

This config name implies it only does verification when broadcast join. 
However, seems that it verifies the statistics no matter doing broadcast join 
or not. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org