Re: How to read the schema of a partitioned dataframe without listing all the partitions ?

2018-04-27 Thread Walid Lezzar
I’m using spark 2.3 with schema merge set to false. I don’t think spark is 
reading any file indeed but it tries to list them all one by one and it’s super 
slow on s3 ! 

Pointing to a single partition manually is not an option as it requires me to 
be aware of the partitioning in order to add it to the path and also, spark 
doesn’t include the partitioning column in that case.

> Le 27 avr. 2018 à 16:07, Yong Zhang  a écrit :
> 
> What version of Spark you are using?
> 
> You can search "spark.sql.parquet.mergeSchema" on 
> https://spark.apache.org/docs/latest/sql-programming-guide.html
> 
> Starting from Spark 1.5, the default is already "false", which means Spark 
> shouldn't scan all the parquet files to generate the schema.
> 
> Yong
> Spark SQL and DataFrames - Spark 2.3.0 Documentation
> spark.apache.org
> Global Temporary View. Temporary views in Spark SQL are session-scoped and 
> will disappear if the session that creates it terminates. If you want to have 
> a temporary view that is shared among all sessions and keep alive until the 
> Spark application terminates, you can create a global temporary view.
> 
> 
> 
> From: Walid LEZZAR 
> Sent: Friday, April 27, 2018 7:42 AM
> To: spark users
> Subject: How to read the schema of a partitioned dataframe without listing 
> all the partitions ?
>  
> Hi,
> 
> I have a parquet on S3 partitioned by day. I have 2 years of data (-> about 
> 1000 partitions). With spark, when I just want to know the schema of this 
> parquet without even asking for a single row of data, spark tries to list all 
> the partitions and the nested partitions of the parquet. Which makes it very 
> slow just to build the dataframe object on Zeppelin.
> 
> Is there a way to avoid that ? Is there way to tell spark : "hey, just read a 
> single partition and give me the schema of that partition and consider it as 
> the schema of the whole dataframe" ? (I don't care about schema merge, it's 
> off by the way)
> 
> Thanks.
> Walid.


Re: How to read the schema of a partitioned dataframe without listing all the partitions ?

2018-04-27 Thread Yong Zhang
What version of Spark you are using?


You can search "spark.sql.parquet.mergeSchema" on 
https://spark.apache.org/docs/latest/sql-programming-guide.html


Starting from Spark 1.5, the default is already "false", which means Spark 
shouldn't scan all the parquet files to generate the schema.


Yong

Spark SQL and DataFrames - Spark 2.3.0 
Documentation
spark.apache.org
Global Temporary View. Temporary views in Spark SQL are session-scoped and will 
disappear if the session that creates it terminates. If you want to have a 
temporary view that is shared among all sessions and keep alive until the Spark 
application terminates, you can create a global temporary view.





From: Walid LEZZAR 
Sent: Friday, April 27, 2018 7:42 AM
To: spark users
Subject: How to read the schema of a partitioned dataframe without listing all 
the partitions ?

Hi,

I have a parquet on S3 partitioned by day. I have 2 years of data (-> about 
1000 partitions). With spark, when I just want to know the schema of this 
parquet without even asking for a single row of data, spark tries to list all 
the partitions and the nested partitions of the parquet. Which makes it very 
slow just to build the dataframe object on Zeppelin.

Is there a way to avoid that ? Is there way to tell spark : "hey, just read a 
single partition and give me the schema of that partition and consider it as 
the schema of the whole dataframe" ? (I don't care about schema merge, it's off 
by the way)

Thanks.
Walid.


Re: How to read the schema of a partitioned dataframe without listing all the partitions ?

2018-04-27 Thread ayan guha
You can specify the first folder directly and read it

On Fri, 27 Apr 2018 at 9:42 pm, Walid LEZZAR  wrote:

> Hi,
>
> I have a parquet on S3 partitioned by day. I have 2 years of data (->
> about 1000 partitions). With spark, when I just want to know the schema of
> this parquet without even asking for a single row of data, spark tries to
> list all the partitions and the nested partitions of the parquet. Which
> makes it very slow just to build the dataframe object on Zeppelin.
>
> Is there a way to avoid that ? Is there way to tell spark : "hey, just
> read a single partition and give me the schema of that partition and
> consider it as the schema of the whole dataframe" ? (I don't care about
> schema merge, it's off by the way)
>
> Thanks.
> Walid.
>
-- 
Best Regards,
Ayan Guha


How to read the schema of a partitioned dataframe without listing all the partitions ?

2018-04-27 Thread Walid LEZZAR
Hi,

I have a parquet on S3 partitioned by day. I have 2 years of data (-> about
1000 partitions). With spark, when I just want to know the schema of this
parquet without even asking for a single row of data, spark tries to list
all the partitions and the nested partitions of the parquet. Which makes it
very slow just to build the dataframe object on Zeppelin.

Is there a way to avoid that ? Is there way to tell spark : "hey, just read
a single partition and give me the schema of that partition and consider it
as the schema of the whole dataframe" ? (I don't care about schema merge,
it's off by the way)

Thanks.
Walid.