[ https://issues.apache.org/jira/browse/SPARK-34648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Takeshi Yamamuro resolved SPARK-34648. -------------------------------------- Resolution: Invalid > Reading Parquet Files in Spark Extremely Slow for Large Number of Files? > ------------------------------------------------------------------------ > > Key: SPARK-34648 > URL: https://issues.apache.org/jira/browse/SPARK-34648 > Project: Spark > Issue Type: Question > Components: SQL > Affects Versions: 2.3.0 > Reporter: Pankaj Bhootra > Priority: Major > > Hello Team > I am new to Spark and this question may be a possible duplicate of the issue > highlighted here: https://issues.apache.org/jira/browse/SPARK-9347 > We have a large dataset partitioned by calendar date, and within each date > partition, we are storing the data as *parquet* files in 128 parts. > We are trying to run aggregation on this dataset for 366 dates at a time with > Spark SQL on spark version 2.3.0, hence our Spark job is reading > 366*128=46848 partitions, all of which are parquet files. There is currently > no *_metadata* or *_common_metadata* file(s) available for this dataset. > The problem we are facing is that when we try to run *spark.read.parquet* on > the above 46848 partitions, our data reads are extremely slow. It takes a > long time to run even a simple map task (no shuffling) without any > aggregation or group by. > I read through the above issue and I think I perhaps generally understand the > ideas around *_common_metadata* file. But the above issue was raised for > Spark 1.3.1 and for Spark 2.3.0, I have not found any documentation related > to this metadata file so far. > I would like to clarify: > # What's the latest, best practice for reading large number of parquet files > efficiently? > # Does this involve using any additional options with spark.read.parquet? > How would that work? > # Are there other possible reasons for slow data reads apart from reading > metadata for every part? We are basically trying to migrate our existing > spark pipeline from using csv files to parquet, but from my hands-on so far, > it seems that parquet's read time is slower than csv? This seems > contradictory to popular opinion that parquet performs better in terms of > both computation and storage? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org