[jira] [Updated] (HIVE-16014) HiveMetastoreChecker should use hive.metastore.fshandler.threads instead of hive.mv.files.thread for pool size
[ https://issues.apache.org/jira/browse/HIVE-16014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Peña updated HIVE-16014: --- Resolution: Fixed Fix Version/s: 2.2.0 Status: Resolved (was: Patch Available) Thanks [~vihangk1]. I committed to master. > HiveMetastoreChecker should use hive.metastore.fshandler.threads instead of > hive.mv.files.thread for pool size > -- > > Key: HIVE-16014 > URL: https://issues.apache.org/jira/browse/HIVE-16014 > Project: Hive > Issue Type: Improvement >Reporter: Vihang Karajgaonkar >Assignee: Vihang Karajgaonkar > Fix For: 2.2.0 > > Attachments: HIVE-16014.01.patch, HIVE-16014.02.patch, > HIVE-16014.03.patch > > > HiveMetastoreChecker uses hive.mv.files.thread configuration value for > determining the pool size as below : > {noformat} > private void checkPartitionDirs(Path basePath, Set allDirs, int > maxDepth) throws IOException, HiveException { > ConcurrentLinkedQueue basePaths = new ConcurrentLinkedQueue<>(); > basePaths.add(basePath); > Set dirSet = Collections.newSetFromMap(new ConcurrentHashMapBoolean>()); > // Here we just reuse the THREAD_COUNT configuration for > // HIVE_MOVE_FILES_THREAD_COUNT > int poolSize = conf.getInt(ConfVars.HIVE_MOVE_FILES_THREAD_COUNT.varname, > 15); > // Check if too low config is provided for move files. 2x CPU is > reasonable max count. > poolSize = poolSize == 0 ? poolSize : Math.max(poolSize, > Runtime.getRuntime().availableProcessors() * 2); > {noformat} > msck is commonly used to add the missing partitions for the table from the > Filesystem. In such a case different pool sizes for HMSHandler and > HiveMetastoreChecker can affect the performance. Eg. If > {{hive.metastore.fshandler.threads}} is set to a lower value like 15 and > {{hive.mv.files.thread}} is much higher like 100 or vice versa the smaller > pool will become the bottleneck. If would be good to use > {{hive.metastore.fshandler.threads}} to size the pool for > HiveMetastoreChecker since the number missing partitions and number of > partitions to be added will most likely be the same. In such a case the > performance of the query will be optimum when both the pool sizes are same. > Since it is possible to tune both the configs individually it will be very > likely that they may be different. But since there is a strong co-relation > between amount of work done by HiveMetastoreChecker and > HiveMetastore.add_partitions call it might be a good idea to use > {{hive.metastore.fshandler.threads}} for pool size instead of > {{hive.mv.files.thread}} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (HIVE-16014) HiveMetastoreChecker should use hive.metastore.fshandler.threads instead of hive.mv.files.thread for pool size
[ https://issues.apache.org/jira/browse/HIVE-16014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vihang Karajgaonkar updated HIVE-16014: --- Attachment: HIVE-16014.03.patch Attaching the patch after rebasing to the latest code in master branch and resolving conflicts > HiveMetastoreChecker should use hive.metastore.fshandler.threads instead of > hive.mv.files.thread for pool size > -- > > Key: HIVE-16014 > URL: https://issues.apache.org/jira/browse/HIVE-16014 > Project: Hive > Issue Type: Improvement >Reporter: Vihang Karajgaonkar >Assignee: Vihang Karajgaonkar > Attachments: HIVE-16014.01.patch, HIVE-16014.02.patch, > HIVE-16014.03.patch > > > HiveMetastoreChecker uses hive.mv.files.thread configuration value for > determining the pool size as below : > {noformat} > private void checkPartitionDirs(Path basePath, Set allDirs, int > maxDepth) throws IOException, HiveException { > ConcurrentLinkedQueue basePaths = new ConcurrentLinkedQueue<>(); > basePaths.add(basePath); > Set dirSet = Collections.newSetFromMap(new ConcurrentHashMapBoolean>()); > // Here we just reuse the THREAD_COUNT configuration for > // HIVE_MOVE_FILES_THREAD_COUNT > int poolSize = conf.getInt(ConfVars.HIVE_MOVE_FILES_THREAD_COUNT.varname, > 15); > // Check if too low config is provided for move files. 2x CPU is > reasonable max count. > poolSize = poolSize == 0 ? poolSize : Math.max(poolSize, > Runtime.getRuntime().availableProcessors() * 2); > {noformat} > msck is commonly used to add the missing partitions for the table from the > Filesystem. In such a case different pool sizes for HMSHandler and > HiveMetastoreChecker can affect the performance. Eg. If > {{hive.metastore.fshandler.threads}} is set to a lower value like 15 and > {{hive.mv.files.thread}} is much higher like 100 or vice versa the smaller > pool will become the bottleneck. If would be good to use > {{hive.metastore.fshandler.threads}} to size the pool for > HiveMetastoreChecker since the number missing partitions and number of > partitions to be added will most likely be the same. In such a case the > performance of the query will be optimum when both the pool sizes are same. > Since it is possible to tune both the configs individually it will be very > likely that they may be different. But since there is a strong co-relation > between amount of work done by HiveMetastoreChecker and > HiveMetastore.add_partitions call it might be a good idea to use > {{hive.metastore.fshandler.threads}} for pool size instead of > {{hive.mv.files.thread}} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (HIVE-16014) HiveMetastoreChecker should use hive.metastore.fshandler.threads instead of hive.mv.files.thread for pool size
[ https://issues.apache.org/jira/browse/HIVE-16014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vihang Karajgaonkar updated HIVE-16014: --- Attachment: HIVE-16014.02.patch > HiveMetastoreChecker should use hive.metastore.fshandler.threads instead of > hive.mv.files.thread for pool size > -- > > Key: HIVE-16014 > URL: https://issues.apache.org/jira/browse/HIVE-16014 > Project: Hive > Issue Type: Improvement >Reporter: Vihang Karajgaonkar >Assignee: Vihang Karajgaonkar > Attachments: HIVE-16014.01.patch, HIVE-16014.02.patch > > > HiveMetastoreChecker uses hive.mv.files.thread configuration value for > determining the pool size as below : > {noformat} > private void checkPartitionDirs(Path basePath, Set allDirs, int > maxDepth) throws IOException, HiveException { > ConcurrentLinkedQueue basePaths = new ConcurrentLinkedQueue<>(); > basePaths.add(basePath); > Set dirSet = Collections.newSetFromMap(new ConcurrentHashMapBoolean>()); > // Here we just reuse the THREAD_COUNT configuration for > // HIVE_MOVE_FILES_THREAD_COUNT > int poolSize = conf.getInt(ConfVars.HIVE_MOVE_FILES_THREAD_COUNT.varname, > 15); > // Check if too low config is provided for move files. 2x CPU is > reasonable max count. > poolSize = poolSize == 0 ? poolSize : Math.max(poolSize, > Runtime.getRuntime().availableProcessors() * 2); > {noformat} > msck is commonly used to add the missing partitions for the table from the > Filesystem. In such a case different pool sizes for HMSHandler and > HiveMetastoreChecker can affect the performance. Eg. If > {{hive.metastore.fshandler.threads}} is set to a lower value like 15 and > {{hive.mv.files.thread}} is much higher like 100 or vice versa the smaller > pool will become the bottleneck. If would be good to use > {{hive.metastore.fshandler.threads}} to size the pool for > HiveMetastoreChecker since the number missing partitions and number of > partitions to be added will most likely be the same. In such a case the > performance of the query will be optimum when both the pool sizes are same. > Since it is possible to tune both the configs individually it will be very > likely that they may be different. But since there is a strong co-relation > between amount of work done by HiveMetastoreChecker and > HiveMetastore.add_partitions call it might be a good idea to use > {{hive.metastore.fshandler.threads}} for pool size instead of > {{hive.mv.files.thread}} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (HIVE-16014) HiveMetastoreChecker should use hive.metastore.fshandler.threads instead of hive.mv.files.thread for pool size
[ https://issues.apache.org/jira/browse/HIVE-16014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vihang Karajgaonkar updated HIVE-16014: --- Status: Patch Available (was: Open) Hi [~spena] can you please review. Its a simple patch to use a different config for the sizing the thread pool. > HiveMetastoreChecker should use hive.metastore.fshandler.threads instead of > hive.mv.files.thread for pool size > -- > > Key: HIVE-16014 > URL: https://issues.apache.org/jira/browse/HIVE-16014 > Project: Hive > Issue Type: Improvement >Reporter: Vihang Karajgaonkar >Assignee: Vihang Karajgaonkar > Attachments: HIVE-16014.01.patch > > > HiveMetastoreChecker uses hive.mv.files.thread configuration value for > determining the pool size as below : > {noformat} > private void checkPartitionDirs(Path basePath, Set allDirs, int > maxDepth) throws IOException, HiveException { > ConcurrentLinkedQueue basePaths = new ConcurrentLinkedQueue<>(); > basePaths.add(basePath); > Set dirSet = Collections.newSetFromMap(new ConcurrentHashMapBoolean>()); > // Here we just reuse the THREAD_COUNT configuration for > // HIVE_MOVE_FILES_THREAD_COUNT > int poolSize = conf.getInt(ConfVars.HIVE_MOVE_FILES_THREAD_COUNT.varname, > 15); > // Check if too low config is provided for move files. 2x CPU is > reasonable max count. > poolSize = poolSize == 0 ? poolSize : Math.max(poolSize, > Runtime.getRuntime().availableProcessors() * 2); > {noformat} > msck is commonly used to add the missing partitions for the table from the > Filesystem. In such a case different pool sizes for HMSHandler and > HiveMetastoreChecker can affect the performance. Eg. If > {{hive.metastore.fshandler.threads}} is set to a lower value like 15 and > {{hive.mv.files.thread}} is much higher like 100 or vice versa the smaller > pool will become the bottleneck. If would be good to use > {{hive.metastore.fshandler.threads}} to size the pool for > HiveMetastoreChecker since the number missing partitions and number of > partitions to be added will most likely be the same. In such a case the > performance of the query will be optimum when both the pool sizes are same. > Since it is possible to tune both the configs individually it will be very > likely that they may be different. But since there is a strong co-relation > between amount of work done by HiveMetastoreChecker and > HiveMetastore.add_partitions call it might be a good idea to use > {{hive.metastore.fshandler.threads}} for pool size instead of > {{hive.mv.files.thread}} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (HIVE-16014) HiveMetastoreChecker should use hive.metastore.fshandler.threads instead of hive.mv.files.thread for pool size
[ https://issues.apache.org/jira/browse/HIVE-16014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vihang Karajgaonkar updated HIVE-16014: --- Attachment: HIVE-16014.01.patch > HiveMetastoreChecker should use hive.metastore.fshandler.threads instead of > hive.mv.files.thread for pool size > -- > > Key: HIVE-16014 > URL: https://issues.apache.org/jira/browse/HIVE-16014 > Project: Hive > Issue Type: Improvement >Reporter: Vihang Karajgaonkar >Assignee: Vihang Karajgaonkar > Attachments: HIVE-16014.01.patch > > > HiveMetastoreChecker uses hive.mv.files.thread configuration value for > determining the pool size as below : > {noformat} > private void checkPartitionDirs(Path basePath, Set allDirs, int > maxDepth) throws IOException, HiveException { > ConcurrentLinkedQueue basePaths = new ConcurrentLinkedQueue<>(); > basePaths.add(basePath); > Set dirSet = Collections.newSetFromMap(new ConcurrentHashMapBoolean>()); > // Here we just reuse the THREAD_COUNT configuration for > // HIVE_MOVE_FILES_THREAD_COUNT > int poolSize = conf.getInt(ConfVars.HIVE_MOVE_FILES_THREAD_COUNT.varname, > 15); > // Check if too low config is provided for move files. 2x CPU is > reasonable max count. > poolSize = poolSize == 0 ? poolSize : Math.max(poolSize, > Runtime.getRuntime().availableProcessors() * 2); > {noformat} > msck is commonly used to add the missing partitions for the table from the > Filesystem. In such a case different pool sizes for HMSHandler and > HiveMetastoreChecker can affect the performance. Eg. If > {{hive.metastore.fshandler.threads}} is set to a lower value like 15 and > {{hive.mv.files.thread}} is much higher like 100 or vice versa the smaller > pool will become the bottleneck. If would be good to use > {{hive.metastore.fshandler.threads}} to size the pool for > HiveMetastoreChecker since the number missing partitions and number of > partitions to be added will most likely be the same. In such a case the > performance of the query will be optimum when both the pool sizes are same. > Since it is possible to tune both the configs individually it will be very > likely that they may be different. But since there is a strong co-relation > between amount of work done by HiveMetastoreChecker and > HiveMetastore.add_partitions call it might be a good idea to use > {{hive.metastore.fshandler.threads}} for pool size instead of > {{hive.mv.files.thread}} -- This message was sent by Atlassian JIRA (v6.3.15#6346)