Rajesh Balamohan created HIVE-6872: -------------------------------------- Summary: Explore options of optimizing FileSinkOperator-->getDynOutPaths() Key: HIVE-6872 URL: https://issues.apache.org/jira/browse/HIVE-6872 Project: Hive Issue Type: Bug Reporter: Rajesh Balamohan Priority: Critical
1. Download hive-testbench from https://github.com/cartershanklin/hive-testbench 2. Generate data using "./tpcds-setup.sh 10 /user/hive/external partitioned" 3. Most of the data population for tables with "partition + bucket + sorted data" will run a lot slower even with scale factor of 10 on 20 node cluster. Bottleneck seems to be in FileSinkOperator-->getDynOutPaths() where it tries to close FSPath writers. Every call takes almost 150-200 ms. set hive.enforce.bucketing=true; set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.max.dynamic.partitions.pernode=4096; With the above setting, one of the data loading (for web_sales table) took almost 4096 * 150 = 600 seconds in closing the writers sequentially. Purpose of this jira is to figure out options of optimizing this code path. -- This message was sent by Atlassian JIRA (v6.2#6252)