Hi,
    I think you should write to HDFS then copy file (parquet or orc) from HDFS 
to MinIO.



eabour
 
From: Prem Sahoo
Date: 2024-05-22 00:38
To: Vibhor Gupta; user
Subject: Re: EXT: Dual Write to HDFS and MinIO in faster way


On Tue, May 21, 2024 at 6:58 AM Prem Sahoo <prem.re...@gmail.com> wrote:
Hello Vibhor,
Thanks for the suggestion .
I am looking for some other alternatives where I can use the same dataframe can 
be written to two destinations without re execution and cache or persist .

Can some one help me in scenario 2 ?
How to make spark write to MinIO faster ?
Sent from my iPhone

On May 21, 2024, at 1:18 AM, Vibhor Gupta <vibhor.gu...@walmart.com> wrote:

 
Hi Prem,
 
You can try to write to HDFS then read from HDFS and write to MinIO.
 
This will prevent duplicate transformation.
 
You can also try persisting the dataframe using the DISK_ONLY level.
 
Regards,
Vibhor
From: Prem Sahoo <prem.re...@gmail.com>
Date: Tuesday, 21 May 2024 at 8:16 AM
To: Spark dev list <d...@spark.apache.org>
Subject: EXT: Dual Write to HDFS and MinIO in faster way
EXTERNAL: Report suspicious emails to Email Abuse.
Hello Team,
I am planning to write to two datasource at the same time . 
 
Scenario:-
 
Writing the same dataframe to HDFS and MinIO without re-executing the 
transformations and no cache(). Then how can we make it faster ?
 
Read the parquet file and do a few transformations and write to HDFS and MinIO.
 
here in both write spark needs execute the transformation again. Do we know how 
we can avoid re-execution of transformation  without cache()/persist ?
 
Scenario2 :-
I am writing 3.2G data to HDFS and MinIO which takes ~6mins.
Do we have any way to make writing this faster ?
 
I don't want to do repartition and write as repartition will have overhead of 
shuffling .
 
Please provide some inputs. 
 
 

Reply via email to