Hi Pankaj, Can you show your detail code and Job/Stage Info? Which Stage is slow?
Pankaj Bhootra <pankajbhoo...@gmail.com> 于2021年3月10日周三 下午12:32写道: > Hi, > > Could someone please revert on this? > > > Thanks > Pankaj Bhootra > > > On Sun, 7 Mar 2021, 01:22 Pankaj Bhootra, <pankajbhoo...@gmail.com> wrote: > >> Hello Team >> >> I am new to Spark and this question may be a possible duplicate of the >> issue highlighted here: https://issues.apache.org/jira/browse/SPARK-9347 >> >> We have a large dataset partitioned by calendar date, and within each >> date partition, we are storing the data as *parquet* files in 128 parts. >> >> We are trying to run aggregation on this dataset for 366 dates at a time >> with Spark SQL on spark version 2.3.0, hence our Spark job is reading >> 366*128=46848 partitions, all of which are parquet files. There is >> currently no *_metadata* or *_common_metadata* file(s) available for >> this dataset. >> >> The problem we are facing is that when we try to run *spark.read.parquet* on >> the above 46848 partitions, our data reads are extremely slow. It takes a >> long time to run even a simple map task (no shuffling) without any >> aggregation or group by. >> >> I read through the above issue and I think I perhaps generally understand >> the ideas around *_common_metadata* file. But the above issue was raised >> for Spark 1.3.1 and for Spark 2.3.0, I have not found any documentation >> related to this metadata file so far. >> >> I would like to clarify: >> >> 1. What's the latest, best practice for reading large number of >> parquet files efficiently? >> 2. Does this involve using any additional options with >> spark.read.parquet? How would that work? >> 3. Are there other possible reasons for slow data reads apart from >> reading metadata for every part? We are basically trying to migrate our >> existing spark pipeline from using csv files to parquet, but from my >> hands-on so far, it seems that parquet's read time is slower than csv? >> This >> seems contradictory to popular opinion that parquet performs better in >> terms of both computation and storage? >> >> >> Thanks >> Pankaj Bhootra >> >> >> >> ---------- Forwarded message --------- >> From: Takeshi Yamamuro (Jira) <j...@apache.org> >> Date: Sat, 6 Mar 2021, 20:02 >> Subject: [jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark >> Extremely Slow for Large Number of Files? >> To: <pankajbhoo...@gmail.com> >> >> >> >> [ >> https://issues.apache.org/jira/browse/SPARK-34648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17296528#comment-17296528 >> ] >> >> Takeshi Yamamuro commented on SPARK-34648: >> ------------------------------------------ >> >> Please use the mailing list (user@spark.apache.org) instead. This is not >> a right place to ask questions. >> >> > Reading Parquet Files in Spark Extremely Slow for Large Number of Files? >> > ------------------------------------------------------------------------ >> > >> > Key: SPARK-34648 >> > URL: https://issues.apache.org/jira/browse/SPARK-34648 >> > Project: Spark >> > Issue Type: Question >> > Components: SQL >> > Affects Versions: 2.3.0 >> > Reporter: Pankaj Bhootra >> > Priority: Major >> > >> > Hello Team >> > I am new to Spark and this question may be a possible duplicate of the >> issue highlighted here: https://issues.apache.org/jira/browse/SPARK-9347 >> > We have a large dataset partitioned by calendar date, and within each >> date partition, we are storing the data as *parquet* files in 128 parts. >> > We are trying to run aggregation on this dataset for 366 dates at a >> time with Spark SQL on spark version 2.3.0, hence our Spark job is reading >> 366*128=46848 partitions, all of which are parquet files. There is >> currently no *_metadata* or *_common_metadata* file(s) available for this >> dataset. >> > The problem we are facing is that when we try to run >> *spark.read.parquet* on the above 46848 partitions, our data reads are >> extremely slow. It takes a long time to run even a simple map task (no >> shuffling) without any aggregation or group by. >> > I read through the above issue and I think I perhaps generally >> understand the ideas around *_common_metadata* file. But the above issue >> was raised for Spark 1.3.1 and for Spark 2.3.0, I have not found any >> documentation related to this metadata file so far. >> > I would like to clarify: >> > # What's the latest, best practice for reading large number of parquet >> files efficiently? >> > # Does this involve using any additional options with >> spark.read.parquet? How would that work? >> > # Are there other possible reasons for slow data reads apart from >> reading metadata for every part? We are basically trying to migrate our >> existing spark pipeline from using csv files to parquet, but from my >> hands-on so far, it seems that parquet's read time is slower than csv? This >> seems contradictory to popular opinion that parquet performs better in >> terms of both computation and storage? >> >> >> >> -- >> This message was sent by Atlassian Jira >> (v8.3.4#803005) >> > -- 致 礼! 钟雨