On 10/22/16 6:18 AM, Steve Loughran wrote:

...
On Sat, Oct 22, 2016 at 3:41 AM, Cheng Lian <lian.cs....@gmail.com <mailto:lian.cs....@gmail.com>> wrote:

    What version of Spark are you using and how many output files
    does the job writes out?

    By default, Spark versions before 1.6 (not including) writes
    Parquet summary files when committing the job. This process reads
    footers from all Parquet files in the destination directory and
    merges them together. This can be particularly bad if you are
    appending a small amount of data to a large existing Parquet dataset.

    If that's the case, you may disable Parquet summary files by
    setting Hadoop configuration " parquet.enable.summary-metadata"
    to false.



Now I'm a bit mixed up. Should that be spark.sql.parquet.enable.summary-metadata =false?
No, "parquet.enable.summary-metadata" is a Hadoop configuration option introduced by Parquet. In Spark 2.0, you can simply set it using spark.conf.set(), Spark will propagate it properly.

    We've disabled it by default since 1.6.0

    Cheng


    On 10/21/16 1:47 PM, Chetan Khatri wrote:
    Hello Spark Users,

    I am writing around 10 GB of Processed Data to Parquet where
    having 1 TB of HDD and 102 GB of RAM, 16 vCore machine on Google
    Cloud.

    Every time, i write to parquet. it shows on Spark UI that stages
    succeeded but on spark shell it hold context on wait mode for
    almost 10 mins. then it clears broadcast, accumulator shared
    variables.

    Can we sped up this thing ?

    Thanks.

-- Yours Aye,
    Chetan Khatri.
    M.+91 76666 80574
    Data Science Researcher
    INDIA

    ​​Statement of Confidentiality
    ————————————————————————————
    The contents of this e-mail message and any attachments are
    confidential and are intended solely for addressee. The
    information may also be legally privileged. This transmission is
    sent in trust, for the sole purpose of delivery to the intended
    recipient. If you have received this transmission in error, any
    use, reproduction or dissemination of this transmission is
    strictly prohibited. If you are not the intended recipient,
    please immediately notify the sender by reply e-mail or phone
    and delete this message and its attachments, if any.​​




--
Yours Aye,
Chetan Khatri.
M.+91 76666 80574
Data Science Researcher
INDIA

​​Statement of Confidentiality
————————————————————————————
The contents of this e-mail message and any attachments are confidential and are intended solely for addressee. The information may also be legally privileged. This transmission is sent in trust, for the sole purpose of delivery to the intended recipient. If you have received this transmission in error, any use, reproduction or dissemination of this transmission is strictly prohibited. If you are not the intended recipient, please immediately notify the sender by reply e-mail or phone and delete this message and its attachments, if any.​​


Reply via email to