Re: is it possible to run spark2 on EMR 7.2.0

2024-09-20 Thread Prem Sahoo
It is not possible as emr 7.2.0 comes with Hadoop 3.x and Spark3.x by default . 
If you are looking for migration Spark 2 to 3 then use emr 6.x probably 6.2 .
Sent from my iPhone

> On Sep 20, 2024, at 9:18 AM, joachim rodrigues  
> wrote:
> 
> 
> I'd like to start a migration from spark2 to spark3 is it possible to migrate 
> my EMR cluster to emr-7.2.0 and migrate progressively. Meaning that i can run 
> spark2 on emr-7.2.3 which is dedicate to spark 3 ?


Re: Spark Reads from MapR and Write to MinIO fails for few batches

2024-08-24 Thread Prem Sahoo
Issue resolved , thanks for your time folks.Sent from my iPhoneOn Aug 21, 2024, at 5:38 PM, Prem Sahoo  wrote:Hello Team,Could you please check on this request ?On Mon, Aug 19, 2024 at 7:00 PM Prem Sahoo <prem.re...@gmail.com> wrote:Hello Spark and User,could you please shed some light ?On Thu, Aug 15, 2024 at 7:15 PM Prem Sahoo <prem.re...@gmail.com> wrote:Hello Spark and User,we have a  Spark project which is a long running Spark session where  it does below1. We are reading from  Mapr FS and writing to MapR FS.2. Another parallel job which reads from MapR Fs and Writes to MinIO object storage.We are finding issues for a few batches of Spark jobs which one writes to MinIO , reads empty data frame/dataset from MapR but the job which reads from  & writes to  MapR Fs for the same batches never had any issue.I was just going through some blogs and stackoverflow to know that Spark Session which holds both information /config of MapR and Minio sometimes find this issue as Spark Session or context has no correct  information so either we need to clear or restart spark session for each batch.Please let me know if you have any suggestions to get rid of this issue.




Re: Spark Reads from MapR and Write to MinIO fails for few batches

2024-08-21 Thread Prem Sahoo
Hello Team,
Could you please check on this request ?

On Mon, Aug 19, 2024 at 7:00 PM Prem Sahoo  wrote:

> Hello Spark and User,
> could you please shed some light ?
>
> On Thu, Aug 15, 2024 at 7:15 PM Prem Sahoo  wrote:
>
>> Hello Spark and User,
>> we have a  Spark project which is a long running Spark session where  it
>> does below
>> 1. We are reading from  Mapr FS and writing to MapR FS.
>> 2. Another parallel job which reads from MapR Fs and Writes to MinIO
>> object storage.
>>
>> We are finding issues for a few batches of Spark jobs which one writes to
>> MinIO , reads empty data frame/dataset from MapR but the job which reads
>> from  & writes to  MapR Fs for the same batches never had any issue.
>>
>> I was just going through some blogs and stackoverflow to know that Spark
>> Session which holds both information /config of MapR and Minio sometimes
>> find this issue as Spark Session or context has no correct  information so
>> either we need to clear or restart spark session for each batch.
>>
>>
>> Please let me know if you have any suggestions to get rid of this issue.
>>
>>


Re: Spark Reads from MapR and Write to MinIO fails for few batches

2024-08-19 Thread Prem Sahoo
Hello Spark and User,
could you please shed some light ?

On Thu, Aug 15, 2024 at 7:15 PM Prem Sahoo  wrote:

> Hello Spark and User,
> we have a  Spark project which is a long running Spark session where  it
> does below
> 1. We are reading from  Mapr FS and writing to MapR FS.
> 2. Another parallel job which reads from MapR Fs and Writes to MinIO
> object storage.
>
> We are finding issues for a few batches of Spark jobs which one writes to
> MinIO , reads empty data frame/dataset from MapR but the job which reads
> from  & writes to  MapR Fs for the same batches never had any issue.
>
> I was just going through some blogs and stackoverflow to know that Spark
> Session which holds both information /config of MapR and Minio sometimes
> find this issue as Spark Session or context has no correct  information so
> either we need to clear or restart spark session for each batch.
>
>
> Please let me know if you have any suggestions to get rid of this issue.
>
>


Spark Reads from MapR and Write to MinIO fails for few batches

2024-08-15 Thread Prem Sahoo
Hello Spark and User,
we have a  Spark project which is a long running Spark session where  it
does below
1. We are reading from  Mapr FS and writing to MapR FS.
2. Another parallel job which reads from MapR Fs and Writes to MinIO object
storage.

We are finding issues for a few batches of Spark jobs which one writes to
MinIO , reads empty data frame/dataset from MapR but the job which reads
from  & writes to  MapR Fs for the same batches never had any issue.

I was just going through some blogs and stackoverflow to know that Spark
Session which holds both information /config of MapR and Minio sometimes
find this issue as Spark Session or context has no correct  information so
either we need to clear or restart spark session for each batch.


Please let me know if you have any suggestions to get rid of this issue.


Re: BUG :: UI Spark

2024-05-26 Thread Prem Sahoo
Can anyone please assist me ?

On Fri, May 24, 2024 at 12:29 AM Prem Sahoo  wrote:

> Does anyone have a clue ?
>
> On Thu, May 23, 2024 at 11:40 AM Prem Sahoo  wrote:
>
>> Hello Team,
>> in spark DAG UI , we have Stages tab. Once you click on each stage you
>> can view the tasks.
>>
>> In each task we have a column "ShuffleWrite Size/Records " that column
>> prints wrong data when it gets the data from cache/persist . it
>> typically will show the wrong record number though the data size is correct
>> for e.g  3.2G/ 7400 which is wrong .
>>
>> please advise.
>>
>


Re: BUG :: UI Spark

2024-05-23 Thread Prem Sahoo
Does anyone have a clue ?

On Thu, May 23, 2024 at 11:40 AM Prem Sahoo  wrote:

> Hello Team,
> in spark DAG UI , we have Stages tab. Once you click on each stage you can
> view the tasks.
>
> In each task we have a column "ShuffleWrite Size/Records " that column
> prints wrong data when it gets the data from cache/persist . it
> typically will show the wrong record number though the data size is correct
> for e.g  3.2G/ 7400 which is wrong .
>
> please advise.
>


BUG :: UI Spark

2024-05-23 Thread Prem Sahoo
Hello Team,
in spark DAG UI , we have Stages tab. Once you click on each stage you can
view the tasks.

In each task we have a column "ShuffleWrite Size/Records " that column
prints wrong data when it gets the data from cache/persist . it
typically will show the wrong record number though the data size is correct
for e.g  3.2G/ 7400 which is wrong .

please advise.


Re: Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-21 Thread Prem Sahoo
I am looking for writer/comitter optimization which can make the spark
write faster.

On Tue, May 21, 2024 at 9:15 PM eab...@163.com  wrote:

> Hi,
> I think you should write to HDFS then copy file (parquet or orc) from
> HDFS to MinIO.
>
> --
> eabour
>
>
> *From:* Prem Sahoo 
> *Date:* 2024-05-22 00:38
> *To:* Vibhor Gupta ; user
> 
> *Subject:* Re: EXT: Dual Write to HDFS and MinIO in faster way
>
>
> On Tue, May 21, 2024 at 6:58 AM Prem Sahoo  wrote:
>
>> Hello Vibhor,
>> Thanks for the suggestion .
>> I am looking for some other alternatives where I can use the same
>> dataframe can be written to two destinations without re execution and cache
>> or persist .
>>
>> Can some one help me in scenario 2 ?
>> How to make spark write to MinIO faster ?
>> Sent from my iPhone
>>
>> On May 21, 2024, at 1:18 AM, Vibhor Gupta 
>> wrote:
>>
>> 
>>
>> Hi Prem,
>>
>>
>>
>> You can try to write to HDFS then read from HDFS and write to MinIO.
>>
>>
>>
>> This will prevent duplicate transformation.
>>
>>
>>
>> You can also try persisting the dataframe using the DISK_ONLY level.
>>
>>
>>
>> Regards,
>>
>> Vibhor
>>
>> *From: *Prem Sahoo 
>> *Date: *Tuesday, 21 May 2024 at 8:16 AM
>> *To: *Spark dev list 
>> *Subject: *EXT: Dual Write to HDFS and MinIO in faster way
>>
>> *EXTERNAL: *Report suspicious emails to *Email Abuse.*
>>
>> Hello Team,
>>
>> I am planning to write to two datasource at the same time .
>>
>>
>>
>> Scenario:-
>>
>>
>>
>> Writing the same dataframe to HDFS and MinIO without re-executing the
>> transformations and no cache(). Then how can we make it faster ?
>>
>>
>>
>> Read the parquet file and do a few transformations and write to HDFS and
>> MinIO.
>>
>>
>>
>> here in both write spark needs execute the transformation again. Do we
>> know how we can avoid re-execution of transformation  without
>> cache()/persist ?
>>
>>
>>
>> Scenario2 :-
>>
>> I am writing 3.2G data to HDFS and MinIO which takes ~6mins.
>>
>> Do we have any way to make writing this faster ?
>>
>>
>>
>> I don't want to do repartition and write as repartition will have
>> overhead of shuffling .
>>
>>
>>
>> Please provide some inputs.
>>
>>
>>
>>
>>
>>


Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-21 Thread Prem Sahoo
On Tue, May 21, 2024 at 6:58 AM Prem Sahoo  wrote:

> Hello Vibhor,
> Thanks for the suggestion .
> I am looking for some other alternatives where I can use the same
> dataframe can be written to two destinations without re execution and cache
> or persist .
>
> Can some one help me in scenario 2 ?
> How to make spark write to MinIO faster ?
> Sent from my iPhone
>
> On May 21, 2024, at 1:18 AM, Vibhor Gupta 
> wrote:
>
> 
>
> Hi Prem,
>
>
>
> You can try to write to HDFS then read from HDFS and write to MinIO.
>
>
>
> This will prevent duplicate transformation.
>
>
>
> You can also try persisting the dataframe using the DISK_ONLY level.
>
>
>
> Regards,
>
> Vibhor
>
> *From: *Prem Sahoo 
> *Date: *Tuesday, 21 May 2024 at 8:16 AM
> *To: *Spark dev list 
> *Subject: *EXT: Dual Write to HDFS and MinIO in faster way
>
> *EXTERNAL: *Report suspicious emails to *Email Abuse.*
>
> Hello Team,
>
> I am planning to write to two datasource at the same time .
>
>
>
> Scenario:-
>
>
>
> Writing the same dataframe to HDFS and MinIO without re-executing the
> transformations and no cache(). Then how can we make it faster ?
>
>
>
> Read the parquet file and do a few transformations and write to HDFS and
> MinIO.
>
>
>
> here in both write spark needs execute the transformation again. Do we
> know how we can avoid re-execution of transformation  without
> cache()/persist ?
>
>
>
> Scenario2 :-
>
> I am writing 3.2G data to HDFS and MinIO which takes ~6mins.
>
> Do we have any way to make writing this faster ?
>
>
>
> I don't want to do repartition and write as repartition will have overhead
> of shuffling .
>
>
>
> Please provide some inputs.
>
>
>
>
>
>


Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Prem Sahoo
Congratulations 👍Sent from my iPhoneOn Feb 29, 2024, at 4:54 PM, Xinrong Meng  wrote:Congratulations!Thanks,XinrongOn Thu, Feb 29, 2024 at 11:16 AM Dongjoon Hyun  wrote:Congratulations!Bests,Dongjoon.On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:Congratulations!At 2024-02-28 17:43:25, "Jungtaek Lim"  wrote:Hi everyone,We are happy to announce the availability of Spark 3.5.1!Spark 3.5.1 is a maintenance release containing stability fixes. Thisrelease is based on the branch-3.5 maintenance branch of Spark. We stronglyrecommend all 3.5 users to upgrade to this stable release.To download Spark 3.5.1, head over to the download page:https://spark.apache.org/downloads.htmlTo view the release notes:https://spark.apache.org/releases/spark-release-3-5-1.htmlWe would like to acknowledge all community members for contributing to thisrelease. This release would not have been possible without you.Jungtaek Limps. Yikun is helping us through releasing the official docker image for Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally available.




Executor tab missing information

2023-02-13 Thread Prem Sahoo
Hello All,
I am executing spark jobs but in executor tab I am missing information, I
cant see any data/info coming up. Please let me know what I am missing .