[Spark SQL] INSERT OVERWRITE to a hive partitioned table (pointing to s3) from spark is too slow.

2018-11-04 Thread ehbhaskar
I have a pyspark job that inserts data into hive partitioned table using `Insert Overwrite` statement. Spark job loads data quickly (in 15 mins) to temp directory (~/.hive-***) in S3. But, it's very slow in moving data from temp directory to the target path, it takes more than 40 mins to move

[Spark SQL] Couldn't save dataframe with null columns to S3.

2018-11-05 Thread ehbhaskar
I have a spark job that writes data to S3 as below. source_data_df_to_write.select(target_columns_list) \ .write.partitionBy(target_partition_cols_list) \ .format("ORC").save(self.table_location_prefix + self.target_table, mode="append") My dataframe some times can have null values for columns.

Re: [Spark SQL] INSERT OVERWRITE to a hive partitioned table (pointing to s3) from spark is too slow.

2018-11-05 Thread ehbhaskar
Here's code with correct data frame. self.session = SparkSession \ .builder \ .appName(self.app_name) \ .config("spark.dynamicAllocation.enabled", "false") \ .config("hive.exec.dynamic.partition.mode", "nonstrict") \