Re: Loading lots of parquet files into dataframe from s3

arnonrgo Wed, 17 Jun 2015 07:59:09 -0700

What happens is that Spark opens the files so in order to merge the schema.
Unfortunately spark has an assumption that the files are local so that
access would be fast which makes this step in s3 extremely slow.


If you know all the files use the same schema (e.g. it is a result of a
previous job) you can tell Spark to skip this check by specifying the option
"mergeSchema" "false" as in
read.format("parquet").option("mergeSchema","false").load("....path")





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Loading-lots-of-parquet-files-into-dataframe-from-s3-tp23127p23366.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Loading lots of parquet files into dataframe from s3

Reply via email to