Re: [jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark Extremely Slow for Large Number of Files?

钟雨 Wed, 10 Mar 2021 05:59:28 -0800

Hi Pankaj,

Can you show your detail code and Job/Stage Info? Which Stage is slow?



Pankaj Bhootra <pankajbhoo...@gmail.com> 于2021年3月10日周三 下午12:32写道：

> Hi,
>
> Could someone please revert on this?
>
>
> Thanks
> Pankaj Bhootra
>
>
> On Sun, 7 Mar 2021, 01:22 Pankaj Bhootra, <pankajbhoo...@gmail.com> wrote:
>
>> Hello Team
>>
>> I am new to Spark and this question may be a possible duplicate of the
>> issue highlighted here: https://issues.apache.org/jira/browse/SPARK-9347
>>
>> We have a large dataset partitioned by calendar date, and within each
>> date partition, we are storing the data as *parquet* files in 128 parts.
>>
>> We are trying to run aggregation on this dataset for 366 dates at a time
>> with Spark SQL on spark version 2.3.0, hence our Spark job is reading
>> 366*128=46848 partitions, all of which are parquet files. There is
>> currently no *_metadata* or *_common_metadata* file(s) available for
>> this dataset.
>>
>> The problem we are facing is that when we try to run *spark.read.parquet* on
>> the above 46848 partitions, our data reads are extremely slow. It takes a
>> long time to run even a simple map task (no shuffling) without any
>> aggregation or group by.
>>
>> I read through the above issue and I think I perhaps generally understand
>> the ideas around *_common_metadata* file. But the above issue was raised
>> for Spark 1.3.1 and for Spark 2.3.0, I have not found any documentation
>> related to this metadata file so far.
>>
>> I would like to clarify:
>>
>>    1. What's the latest, best practice for reading large number of
>>    parquet files efficiently?
>>    2. Does this involve using any additional options with
>>    spark.read.parquet? How would that work?
>>    3. Are there other possible reasons for slow data reads apart from
>>    reading metadata for every part? We are basically trying to migrate our
>>    existing spark pipeline from using csv files to parquet, but from my
>>    hands-on so far, it seems that parquet's read time is slower than csv? 
>> This
>>    seems contradictory to popular opinion that parquet performs better in
>>    terms of both computation and storage?
>>
>>
>> Thanks
>> Pankaj Bhootra
>>
>>
>>
>> ---------- Forwarded message ---------
>> From: Takeshi Yamamuro (Jira) <j...@apache.org>
>> Date: Sat, 6 Mar 2021, 20:02
>> Subject: [jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark
>> Extremely Slow for Large Number of Files?
>> To: <pankajbhoo...@gmail.com>
>>
>>
>>
>>     [
>> https://issues.apache.org/jira/browse/SPARK-34648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17296528#comment-17296528
>> ]
>>
>> Takeshi Yamamuro commented on SPARK-34648:
>> ------------------------------------------
>>
>> Please use the mailing list (user@spark.apache.org) instead. This is not
>> a right place to ask questions.
>>
>> > Reading Parquet Files in Spark Extremely Slow for Large Number of Files?
>> > ------------------------------------------------------------------------
>> >
>> >                 Key: SPARK-34648
>> >                 URL: https://issues.apache.org/jira/browse/SPARK-34648
>> >             Project: Spark
>> >          Issue Type: Question
>> >          Components: SQL
>> >    Affects Versions: 2.3.0
>> >            Reporter: Pankaj Bhootra
>> >            Priority: Major
>> >
>> > Hello Team
>> > I am new to Spark and this question may be a possible duplicate of the
>> issue highlighted here: https://issues.apache.org/jira/browse/SPARK-9347
>> > We have a large dataset partitioned by calendar date, and within each
>> date partition, we are storing the data as *parquet* files in 128 parts.
>> > We are trying to run aggregation on this dataset for 366 dates at a
>> time with Spark SQL on spark version 2.3.0, hence our Spark job is reading
>> 366*128=46848 partitions, all of which are parquet files. There is
>> currently no *_metadata* or *_common_metadata* file(s) available for this
>> dataset.
>> > The problem we are facing is that when we try to run
>> *spark.read.parquet* on the above 46848 partitions, our data reads are
>> extremely slow. It takes a long time to run even a simple map task (no
>> shuffling) without any aggregation or group by.
>> > I read through the above issue and I think I perhaps generally
>> understand the ideas around *_common_metadata* file. But the above issue
>> was raised for Spark 1.3.1 and for Spark 2.3.0, I have not found any
>> documentation related to this metadata file so far.
>> > I would like to clarify:
>> >  # What's the latest, best practice for reading large number of parquet
>> files efficiently?
>> >  # Does this involve using any additional options with
>> spark.read.parquet? How would that work?
>> >  # Are there other possible reasons for slow data reads apart from
>> reading metadata for every part? We are basically trying to migrate our
>> existing spark pipeline from using csv files to parquet, but from my
>> hands-on so far, it seems that parquet's read time is slower than csv? This
>> seems contradictory to popular opinion that parquet performs better in
>> terms of both computation and storage?
>>
>>
>>
>> --
>> This message was sent by Atlassian Jira
>> (v8.3.4#803005)
>>
>

-- 
    致
礼！

钟雨

Re: [jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark Extremely Slow for Large Number of Files?

Reply via email to