Which metastore are you using?

Le jeu. 25 avr. 2019 à 09:02, Juho Autio <juho.au...@rovio.com> a écrit :

> Would anyone be able to answer this question about the non-optimal
> implementation of insertInto?
>
> On Thu, Apr 18, 2019 at 4:45 PM Juho Autio <juho.au...@rovio.com> wrote:
>
>> Hi,
>>
>> My job is writing ~10 partitions with insertInto. With the same input /
>> output data the total duration of the job is very different depending on
>> how many partitions the target table has.
>>
>> Target table with 10 of partitions:
>> 1 min 30 s
>>
>> Target table with ~10000 partitions:
>> 13 min 0 s
>>
>> It seems that spark is always fetching the full list of partitions in
>> target table. When this happens, the cluster is basically idling while
>> driver is listing partitions.
>>
>> Here's a thread dump for executor driver from such idle time:
>> https://gist.github.com/juhoautio/bdbc8eb339f163178905322fc393da20
>>
>> Is there any way to optimize this currently? Is this a known issue? Any
>> plans to improve?
>>
>> My code is essentially:
>>
>> spark = SparkSession.builder \
>>     .config('spark.sql.hive.caseSensitiveInferenceMode', 'NEVER_INFER') \
>>     .config("hive.exec.dynamic.partition", "true") \
>>     .config('spark.sql.sources.partitionOverwriteMode', 'dynamic') \
>>     .config("hive.exec.dynamic.partition.mode", "nonstrict") \
>>     .enableHiveSupport() \
>>     .getOrCreate()
>>
>> out_df.write \
>>     .option('mapreduce.fileoutputcommitter.algorithm.version', '2') \
>>     .insertInto(target_table_name, overwrite=True)
>>
>> Table has been originally created from spark with saveAsTable.
>>
>> Does spark need to know anything about the existing partitions though? As
>> a manual workaround I would write the files directly to the partition
>> locations, delete existing files first if there's anything in that
>> partition, and then call metastore to ALTER TABLE IF NOT EXISTS ADD
>> PARTITION. This doesn't require previous knowledge on existing partitions.
>>
>> Thanks.
>>
>

Reply via email to