[ https://issues.apache.org/jira/browse/SPARK-9926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Or resolved SPARK-9926. ------------------------------ Resolution: Fixed Fix Version/s: 2.0.0 Target Version/s: 2.0.0 > Parallelize file listing for partitioned Hive table > --------------------------------------------------- > > Key: SPARK-9926 > URL: https://issues.apache.org/jira/browse/SPARK-9926 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 1.4.1, 1.5.0 > Reporter: Cheolsoo Park > Assignee: Ryan Blue > Fix For: 2.0.0 > > > In Spark SQL, short queries like {{select * from table limit 10}} run very > slowly against partitioned Hive tables because of file listing. In > particular, if a large number of partitions are scanned on storage like S3, > the queries run extremely slowly. Here are some example benchmarks in my > environment- > * Parquet-backed Hive table > * Partitioned by dateint and hour > * Stored on S3 > ||\# of partitions||\# of files||runtime||query|| > |1|972|30 secs|select * from nccp_log where dateint=20150601 and hour=0 limit > 10;| > |24|13646|6 mins|select * from nccp_log where dateint=20150601 limit 10;| > |240|136222|1 hour|select * from nccp_log where dateint>=20150601 and > dateint<=20150610 limit 10;| > The problem is that {{TableReader}} constructs a separate HadoopRDD per Hive > partition path and group them into a UnionRDD. Then, all the input files are > listed sequentially. In other tools such as Hive and Pig, this can be solved > by setting > [mapreduce.input.fileinputformat.list-status.num-threads|https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml] > high. But in Spark, since each HadoopRDD lists only one partition path, > setting this property doesn't help. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org