Hi, Could someone please revert on this?
Thanks Pankaj Bhootra On Sun, 7 Mar 2021, 01:22 Pankaj Bhootra, <pankajbhoo...@gmail.com> wrote: > Hello Team > > I am new to Spark and this question may be a possible duplicate of the > issue highlighted here: https://issues.apache.org/jira/browse/SPARK-9347 > > We have a large dataset partitioned by calendar date, and within each date > partition, we are storing the data as *parquet* files in 128 parts. > > We are trying to run aggregation on this dataset for 366 dates at a time > with Spark SQL on spark version 2.3.0, hence our Spark job is reading > 366*128=46848 partitions, all of which are parquet files. There is > currently no *_metadata* or *_common_metadata* file(s) available for this > dataset. > > The problem we are facing is that when we try to run *spark.read.parquet* on > the above 46848 partitions, our data reads are extremely slow. It takes a > long time to run even a simple map task (no shuffling) without any > aggregation or group by. > > I read through the above issue and I think I perhaps generally understand > the ideas around *_common_metadata* file. But the above issue was raised > for Spark 1.3.1 and for Spark 2.3.0, I have not found any documentation > related to this metadata file so far. > > I would like to clarify: > > 1. What's the latest, best practice for reading large number of > parquet files efficiently? > 2. Does this involve using any additional options with > spark.read.parquet? How would that work? > 3. Are there other possible reasons for slow data reads apart from > reading metadata for every part? We are basically trying to migrate our > existing spark pipeline from using csv files to parquet, but from my > hands-on so far, it seems that parquet's read time is slower than csv? This > seems contradictory to popular opinion that parquet performs better in > terms of both computation and storage? > > > Thanks > Pankaj Bhootra > > > > ---------- Forwarded message --------- > From: Takeshi Yamamuro (Jira) <j...@apache.org> > Date: Sat, 6 Mar 2021, 20:02 > Subject: [jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark > Extremely Slow for Large Number of Files? > To: <pankajbhoo...@gmail.com> > > > > [ > https://issues.apache.org/jira/browse/SPARK-34648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17296528#comment-17296528 > ] > > Takeshi Yamamuro commented on SPARK-34648: > ------------------------------------------ > > Please use the mailing list (user@spark.apache.org) instead. This is not > a right place to ask questions. > > > Reading Parquet Files in Spark Extremely Slow for Large Number of Files? > > ------------------------------------------------------------------------ > > > > Key: SPARK-34648 > > URL: https://issues.apache.org/jira/browse/SPARK-34648 > > Project: Spark > > Issue Type: Question > > Components: SQL > > Affects Versions: 2.3.0 > > Reporter: Pankaj Bhootra > > Priority: Major > > > > Hello Team > > I am new to Spark and this question may be a possible duplicate of the > issue highlighted here: https://issues.apache.org/jira/browse/SPARK-9347 > > We have a large dataset partitioned by calendar date, and within each > date partition, we are storing the data as *parquet* files in 128 parts. > > We are trying to run aggregation on this dataset for 366 dates at a time > with Spark SQL on spark version 2.3.0, hence our Spark job is reading > 366*128=46848 partitions, all of which are parquet files. There is > currently no *_metadata* or *_common_metadata* file(s) available for this > dataset. > > The problem we are facing is that when we try to run > *spark.read.parquet* on the above 46848 partitions, our data reads are > extremely slow. It takes a long time to run even a simple map task (no > shuffling) without any aggregation or group by. > > I read through the above issue and I think I perhaps generally > understand the ideas around *_common_metadata* file. But the above issue > was raised for Spark 1.3.1 and for Spark 2.3.0, I have not found any > documentation related to this metadata file so far. > > I would like to clarify: > > # What's the latest, best practice for reading large number of parquet > files efficiently? > > # Does this involve using any additional options with > spark.read.parquet? How would that work? > > # Are there other possible reasons for slow data reads apart from > reading metadata for every part? We are basically trying to migrate our > existing spark pipeline from using csv files to parquet, but from my > hands-on so far, it seems that parquet's read time is slower than csv? This > seems contradictory to popular opinion that parquet performs better in > terms of both computation and storage? > > > > -- > This message was sent by Atlassian Jira > (v8.3.4#803005) >