Hi,

Could someone please revert on this?


Thanks
Pankaj Bhootra


On Sun, 7 Mar 2021, 01:22 Pankaj Bhootra, <pankajbhoo...@gmail.com> wrote:

> Hello Team
>
> I am new to Spark and this question may be a possible duplicate of the
> issue highlighted here: https://issues.apache.org/jira/browse/SPARK-9347
>
> We have a large dataset partitioned by calendar date, and within each date
> partition, we are storing the data as *parquet* files in 128 parts.
>
> We are trying to run aggregation on this dataset for 366 dates at a time
> with Spark SQL on spark version 2.3.0, hence our Spark job is reading
> 366*128=46848 partitions, all of which are parquet files. There is
> currently no *_metadata* or *_common_metadata* file(s) available for this
> dataset.
>
> The problem we are facing is that when we try to run *spark.read.parquet* on
> the above 46848 partitions, our data reads are extremely slow. It takes a
> long time to run even a simple map task (no shuffling) without any
> aggregation or group by.
>
> I read through the above issue and I think I perhaps generally understand
> the ideas around *_common_metadata* file. But the above issue was raised
> for Spark 1.3.1 and for Spark 2.3.0, I have not found any documentation
> related to this metadata file so far.
>
> I would like to clarify:
>
>    1. What's the latest, best practice for reading large number of
>    parquet files efficiently?
>    2. Does this involve using any additional options with
>    spark.read.parquet? How would that work?
>    3. Are there other possible reasons for slow data reads apart from
>    reading metadata for every part? We are basically trying to migrate our
>    existing spark pipeline from using csv files to parquet, but from my
>    hands-on so far, it seems that parquet's read time is slower than csv? This
>    seems contradictory to popular opinion that parquet performs better in
>    terms of both computation and storage?
>
>
> Thanks
> Pankaj Bhootra
>
>
>
> ---------- Forwarded message ---------
> From: Takeshi Yamamuro (Jira) <j...@apache.org>
> Date: Sat, 6 Mar 2021, 20:02
> Subject: [jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark
> Extremely Slow for Large Number of Files?
> To: <pankajbhoo...@gmail.com>
>
>
>
>     [
> https://issues.apache.org/jira/browse/SPARK-34648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17296528#comment-17296528
> ]
>
> Takeshi Yamamuro commented on SPARK-34648:
> ------------------------------------------
>
> Please use the mailing list (user@spark.apache.org) instead. This is not
> a right place to ask questions.
>
> > Reading Parquet Files in Spark Extremely Slow for Large Number of Files?
> > ------------------------------------------------------------------------
> >
> >                 Key: SPARK-34648
> >                 URL: https://issues.apache.org/jira/browse/SPARK-34648
> >             Project: Spark
> >          Issue Type: Question
> >          Components: SQL
> >    Affects Versions: 2.3.0
> >            Reporter: Pankaj Bhootra
> >            Priority: Major
> >
> > Hello Team
> > I am new to Spark and this question may be a possible duplicate of the
> issue highlighted here: https://issues.apache.org/jira/browse/SPARK-9347
> > We have a large dataset partitioned by calendar date, and within each
> date partition, we are storing the data as *parquet* files in 128 parts.
> > We are trying to run aggregation on this dataset for 366 dates at a time
> with Spark SQL on spark version 2.3.0, hence our Spark job is reading
> 366*128=46848 partitions, all of which are parquet files. There is
> currently no *_metadata* or *_common_metadata* file(s) available for this
> dataset.
> > The problem we are facing is that when we try to run
> *spark.read.parquet* on the above 46848 partitions, our data reads are
> extremely slow. It takes a long time to run even a simple map task (no
> shuffling) without any aggregation or group by.
> > I read through the above issue and I think I perhaps generally
> understand the ideas around *_common_metadata* file. But the above issue
> was raised for Spark 1.3.1 and for Spark 2.3.0, I have not found any
> documentation related to this metadata file so far.
> > I would like to clarify:
> >  # What's the latest, best practice for reading large number of parquet
> files efficiently?
> >  # Does this involve using any additional options with
> spark.read.parquet? How would that work?
> >  # Are there other possible reasons for slow data reads apart from
> reading metadata for every part? We are basically trying to migrate our
> existing spark pipeline from using csv files to parquet, but from my
> hands-on so far, it seems that parquet's read time is slower than csv? This
> seems contradictory to popular opinion that parquet performs better in
> terms of both computation and storage?
>
>
>
> --
> This message was sent by Atlassian Jira
> (v8.3.4#803005)
>

Reply via email to