What happens is that Spark opens the files so in order to merge the schema.
Unfortunately spark has an assumption that the files are local so that
access would be fast which makes this step in s3 extremely slow.

If you know all the files use the same schema (e.g. it is a result of a
previous job) you can tell Spark to skip this check by specifying the option
"mergeSchema" "false" as in
read.format("parquet").option("mergeSchema","false").load("....path")





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Loading-lots-of-parquet-files-into-dataframe-from-s3-tp23127p23366.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to