I have a pyspark job that inserts data into hive partitioned table using
`Insert Overwrite` statement.
Spark job loads data quickly (in 15 mins) to temp directory (~/.hive-***) in
S3. But, it's very slow in moving data from temp directory to the target
path, it takes more than 40 mins to move
I have a spark job that writes data to S3 as below.
source_data_df_to_write.select(target_columns_list) \
.write.partitionBy(target_partition_cols_list) \
.format("ORC").save(self.table_location_prefix + self.target_table,
mode="append")
My dataframe some times can have null values for columns.
Here's code with correct data frame.
self.session = SparkSession \
.builder \
.appName(self.app_name) \
.config("spark.dynamicAllocation.enabled", "false") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \