Re: OOM concern

2024-05-27 Thread Meena Rajani
What exactly is the error? Is it erroring out while reading the data from
db? How are you partitioning the data?

How much memory currently do you have? What is the network time out?

Regards,
Meena


On Mon, May 27, 2024 at 4:22 PM Perez  wrote:

> Hi Team,
>
> I want to extract the data from DB and just dump it into S3. I
> don't have to perform any transformations on the data yet. My data size
> would be ~100 GB (historical load).
>
> Choosing the right DPUs(Glue jobs) should solve this problem right? Or
> should I move to EMR.
>
> I don't feel the need to move to EMR but wanted the expertise suggestions.
>
> TIA.
>


Re: Python for the kids and now PySpark

2024-04-28 Thread Meena Rajani
Mitch, you are right these days the attention span is getting shorter.
Christian could work on a completely new thing for 3 hours and is proud to
explain. It is amazing.

Thanks for sharing.



On Sat, Apr 27, 2024 at 9:40 PM Farshid Ashouri 
wrote:

> Mich, this is absolutely amazing.
>
> Thanks for sharing.
>
> On Sat, 27 Apr 2024, 22:26 Mich Talebzadeh, 
> wrote:
>
>> Python for the kids. Slightly off-topic but worthwhile sharing.
>>
>> One of the things that may benefit kids is starting to learn something
>> new. Basically anything that can focus their attention away from games for
>> a few hours. Around 2020, my son Christian (now nearly 15) decided to
>> learn a programming language. So somehow he picked Python to start with.
>> The kids are good when they focus. However, they live in a virtual reality
>> world and they cannot focus for long hours. I let him explore Python on his
>> Windows 10 laptop and download it himself. In the following video Christian
>> explains to his mother what he started to do just before going to bed. BTW,
>> when he says 32M he means 32-bit. I leave it to you to judge :) Now the
>> idea is to start learning PySpark. So I will let him do it himself and
>> learn from his mistakes.  For those who have kids, I would be interested to
>> know their opinion.
>>
>> Cheers
>>
>>
>> Mich Talebzadeh,
>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Spark join produce duplicate rows in resultset

2023-10-27 Thread Meena Rajani
Thanks all:

Patrick selected rev.* and I.* cleared the confusion. The Item actually
brought 4 rows hence the final result set had 4 rows.

Regards,
Meena

On Sun, Oct 22, 2023 at 10:13 AM Bjørn Jørgensen 
wrote:

> alos remove the space in rev. scode
>
> søn. 22. okt. 2023 kl. 19:08 skrev Sadha Chilukoori <
> sage.quoti...@gmail.com>:
>
>> Hi Meena,
>>
>> I'm asking to clarify, are the *on *& *and* keywords optional in the
>> join conditions?
>>
>> Please try this snippet, and see if it helps
>>
>> select rev.* from rev
>> inner join customer c
>> on rev.custumer_id =c.id
>> inner join product p
>> on rev.sys = p.sys
>> and rev.prin = p.prin
>> and rev.scode= p.bcode
>>
>> left join item I
>> on rev.sys = I.sys
>> and rev.custumer_id = I.custumer_id
>> and rev. scode = I.scode;
>>
>> Thanks,
>> Sadha
>>
>> On Sat, Oct 21, 2023 at 3:21 PM Meena Rajani 
>> wrote:
>>
>>> Hello all:
>>>
>>> I am using spark sql to join two tables. To my surprise I am
>>> getting redundant rows. What could be the cause.
>>>
>>>
>>> select rev.* from rev
>>> inner join customer c
>>> on rev.custumer_id =c.id
>>> inner join product p
>>> rev.sys = p.sys
>>> rev.prin = p.prin
>>> rev.scode= p.bcode
>>>
>>> left join item I
>>> on rev.sys = i.sys
>>> rev.custumer_id = I.custumer_id
>>> rev. scode = I.scode
>>>
>>> where rev.custumer_id = '123456789'
>>>
>>> The first part of the code brings one row
>>>
>>> select rev.* from rev
>>> inner join customer c
>>> on rev.custumer_id =c.id
>>> inner join product p
>>> rev.sys = p.sys
>>> rev.prin = p.prin
>>> rev.scode= p.bcode
>>>
>>>
>>> The  item has two rows which have common attributes  and the* final
>>> join should result in 2 rows. But I am seeing 4 rows instead.*
>>>
>>> left join item I
>>> on rev.sys = i.sys
>>> rev.custumer_id = I.custumer_id
>>> rev. scode = I.scode
>>>
>>>
>>>
>>> Regards,
>>> Meena
>>>
>>>
>>>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>


Spark join produce duplicate rows in resultset

2023-10-21 Thread Meena Rajani
Hello all:

I am using spark sql to join two tables. To my surprise I am
getting redundant rows. What could be the cause.


select rev.* from rev
inner join customer c
on rev.custumer_id =c.id
inner join product p
rev.sys = p.sys
rev.prin = p.prin
rev.scode= p.bcode

left join item I
on rev.sys = i.sys
rev.custumer_id = I.custumer_id
rev. scode = I.scode

where rev.custumer_id = '123456789'

The first part of the code brings one row

select rev.* from rev
inner join customer c
on rev.custumer_id =c.id
inner join product p
rev.sys = p.sys
rev.prin = p.prin
rev.scode= p.bcode


The  item has two rows which have common attributes  and the* final join
should result in 2 rows. But I am seeing 4 rows instead.*

left join item I
on rev.sys = i.sys
rev.custumer_id = I.custumer_id
rev. scode = I.scode



Regards,
Meena