Reading Spark doc
(https://spark.apache.org/docs/latest/sql-data-sources-parquet.html). It's not
mentioned how to parallel read parquet file with SparkSession. Would
--num-executors just work? Any additional parameters needed to be added to
SparkSession as well?
Also if I want to parallel write data to database, would options
'numPartitions' and 'batchsize' enough to improve write performance? For
example,
mydf.format("jdbc").
option("driver", "org.postgresql.Driver").
option("url", url).
option("dbtable", table_name).
option("user", username).
option("password", password).
option("numPartitions", N) .
option("batchsize", M)
save
From Spark website
(https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#jdbc-to-other-databases),
I only find these two parameters that would have impact on db write
performance.
I appreciate any suggestions.