Re: How to read the schema of a partitioned dataframe without listing all the partitions ?

2018-04-27 Thread Walid Lezzar
I’m using spark 2.3 with schema merge set to false. I don’t think spark is reading any file indeed but it tries to list them all one by one and it’s super slow on s3 ! Pointing to a single partition manually is not an option as it requires me to be aware of the partitioning in order to add it

Re: How to read the schema of a partitioned dataframe without listing all the partitions ?

2018-04-27 Thread Yong Zhang
What version of Spark you are using? You can search "spark.sql.parquet.mergeSchema" on https://spark.apache.org/docs/latest/sql-programming-guide.html Starting from Spark 1.5, the default is already "false", which means Spark shouldn't scan all the parquet files to generate the schema.

Re: How to read the schema of a partitioned dataframe without listing all the partitions ?

2018-04-27 Thread ayan guha
You can specify the first folder directly and read it On Fri, 27 Apr 2018 at 9:42 pm, Walid LEZZAR wrote: > Hi, > > I have a parquet on S3 partitioned by day. I have 2 years of data (-> > about 1000 partitions). With spark, when I just want to know the schema of > this

How to read the schema of a partitioned dataframe without listing all the partitions ?

2018-04-27 Thread Walid LEZZAR
Hi, I have a parquet on S3 partitioned by day. I have 2 years of data (-> about 1000 partitions). With spark, when I just want to know the schema of this parquet without even asking for a single row of data, spark tries to list all the partitions and the nested partitions of the parquet. Which