In SQLConf.scala , I found this:

  val PARALLEL_PARTITION_DISCOVERY_THRESHOLD = intConf(
    key = "spark.sql.sources.parallelPartitionDiscovery.threshold",
    defaultValue = Some(32),
    doc = "The degree of parallelism for schema merging and partition
discovery of " +
      "Parquet data sources.")

But looks like it may not help your case.

FYI

On Fri, Jan 22, 2016 at 3:09 AM, Gourav Sengupta <gourav.sengu...@gmail.com>
wrote:

> Hi Ted,
>
> I am using SPARK 1.5.2 as available currently in AWS EMR 4x. The data is
> in TSV format.
>
> I do not see any affect of the work already done on this for the data
> stored in HIVE as it takes around 50 mins just to collect the table
> metadata over a 40 node cluster and the time is much the same for smaller
> clusters of size 20.
>
> Spending 50 mins just to collect the meta-data is fine for once, but we
> should be then able to store the object (which is in memory after reading
> the meta-data for the first time) so that next time we can just restore the
> object instead of reading the meta-data once again. Or we should be able to
> parallelize the collection of meta-data so that it does not take such a
> long time.
>
> Please advice.
>
> Regards,
> Gourav
>
>
>
>
> On Fri, Jan 22, 2016 at 10:15 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> There have been optimizations in this area, such as:
>> https://issues.apache.org/jira/browse/SPARK-8125
>>
>> You can also look at parent issue.
>>
>> Which Spark release are you using ?
>>
>> > On Jan 22, 2016, at 1:08 AM, Gourav Sengupta <gourav.sengu...@gmail.com>
>> wrote:
>> >
>> >
>> > Hi,
>> >
>> > I have a SPARK table (created from hiveContext) with couple of hundred
>> partitions and few thousand files.
>> >
>> > When I run query on the table then spark spends a lot of time (as seen
>> in the pyspark output) to collect this files from the several partitions.
>> After this the query starts running.
>> >
>> > Is there a way to store the object which has collected all these
>> partitions and files so that every time I restart the job I load this
>> object instead of taking  50 mins to just collect the files before starting
>> to run the query?
>> >
>> >
>> > Please do let me know in case the question is not quite clear.
>> >
>> > Regards,
>> > Gourav Sengupta
>> >
>> >
>> >
>>
>
>

Reply via email to