In SQLConf.scala , I found this: val PARALLEL_PARTITION_DISCOVERY_THRESHOLD = intConf( key = "spark.sql.sources.parallelPartitionDiscovery.threshold", defaultValue = Some(32), doc = "The degree of parallelism for schema merging and partition discovery of " + "Parquet data sources.")
But looks like it may not help your case. FYI On Fri, Jan 22, 2016 at 3:09 AM, Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > Hi Ted, > > I am using SPARK 1.5.2 as available currently in AWS EMR 4x. The data is > in TSV format. > > I do not see any affect of the work already done on this for the data > stored in HIVE as it takes around 50 mins just to collect the table > metadata over a 40 node cluster and the time is much the same for smaller > clusters of size 20. > > Spending 50 mins just to collect the meta-data is fine for once, but we > should be then able to store the object (which is in memory after reading > the meta-data for the first time) so that next time we can just restore the > object instead of reading the meta-data once again. Or we should be able to > parallelize the collection of meta-data so that it does not take such a > long time. > > Please advice. > > Regards, > Gourav > > > > > On Fri, Jan 22, 2016 at 10:15 AM, Ted Yu <yuzhih...@gmail.com> wrote: > >> There have been optimizations in this area, such as: >> https://issues.apache.org/jira/browse/SPARK-8125 >> >> You can also look at parent issue. >> >> Which Spark release are you using ? >> >> > On Jan 22, 2016, at 1:08 AM, Gourav Sengupta <gourav.sengu...@gmail.com> >> wrote: >> > >> > >> > Hi, >> > >> > I have a SPARK table (created from hiveContext) with couple of hundred >> partitions and few thousand files. >> > >> > When I run query on the table then spark spends a lot of time (as seen >> in the pyspark output) to collect this files from the several partitions. >> After this the query starts running. >> > >> > Is there a way to store the object which has collected all these >> partitions and files so that every time I restart the job I load this >> object instead of taking 50 mins to just collect the files before starting >> to run the query? >> > >> > >> > Please do let me know in case the question is not quite clear. >> > >> > Regards, >> > Gourav Sengupta >> > >> > >> > >> > >