Re: What is the best way to organize a join within a foreach?

Marco Costantini Tue, 25 Apr 2023 10:50:51 -0700

Thanks Mich,

Great idea. I have done it. Those files are attached. I'm interested to
know your thoughts. Let's imagine this same structure, but with huge
amounts of data as well.


Please and thank you,
Marco.

On Tue, Apr 25, 2023 at 12:12 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi Marco,
>
> Let us start simple,
>
> Provide a csv file of 5 rows for the users table. Each row has a unique
> user_id and one or two other columns like fictitious email etc.
>
> Also for each user_id, provide 10 rows of orders table, meaning that
> orders table has 5 x 10 rows for each user_id.
>
> both as comma separated csv file
>
> HTH
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 25 Apr 2023 at 14:07, Marco Costantini <
> marco.costant...@rocketfncl.com> wrote:
>
>> Thanks Mich,
>> I have not but I will certainly read up on this today.
>>
>> To your point that all of the essential data is in the 'orders' table; I
>> agree! That distills the problem nicely. Yet, I still have some questions
>> on which someone may be able to shed some light.
>>
>> 1) If my 'orders' table is very large, and will need to be aggregated by
>> 'user_id', how will Spark intelligently optimize on that constraint (only
>> read data for relevent 'user_id's). Is that something I have to instruct
>> Spark to do?
>>
>> 2) Without #1, even with windowing, am I asking each partition to search
>> too much?
>>
>> Please, if you have any links to documentation I can read on *how* Spark
>> works under the hood for these operations, I would appreciate it if you
>> give them. Spark has become a pillar on my team and knowing it in more
>> detail is warranted.
>>
>> Slightly pivoting the subject here; I have tried something. It was a
>> suggestion by an AI chat bot and it seemed reasonable. In my main Spark
>> script I now have the line:
>>
>> ```
>> grouped_orders_df =
>> orders_df.groupBy('user_id').agg(collect_list(to_json(struct('user_id',
>> 'timestamp', 'total', 'description'))).alias('orders'))
>> ```
>> (json is ultimately needed)
>>
>> This actually achieves my goal by putting all of the 'orders' in a single
>> Array column. Now my worry is, will this column become too large if there
>> are a great many orders. Is there a limit? I have search for documentation
>> on such a limit but could not find any.
>>
>> I truly appreciate your help Mich and team,
>> Marco.
>>
>>
>> On Tue, Apr 25, 2023 at 5:40 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Have you thought of using  windowing function
>>> <https://sparkbyexamples.com/spark/spark-sql-window-functions/>s to
>>> achieve this?
>>>
>>> Effectively all your information is in the orders table.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 25 Apr 2023 at 00:15, Marco Costantini <
>>> marco.costant...@rocketfncl.com> wrote:
>>>
>>>> I have two tables: {users, orders}. In this example, let's say that for
>>>> each 1 User in the users table, there are 100000 Orders in the orders 
>>>> table.
>>>>
>>>> I have to use pyspark to generate a statement of Orders for each User.
>>>> So, a single user will need his/her own list of Orders. Additionally, I
>>>> need to send this statement to the real-world user via email (for example).
>>>>
>>>> My first intuition was to apply a DataFrame.foreach() on the users
>>>> DataFrame. This way, I can rely on the spark workers to handle the email
>>>> sending individually. However, I now do not know the best way to get each
>>>> User's Orders.
>>>>
>>>> I will soon try the following (pseudo-code):
>>>>
>>>> ```
>>>> users_df = <my entire users DataFrame>
>>>> orders_df = <my entire orders DataFrame>
>>>>
>>>> #this is poorly named for max understandability in this context
>>>> def foreach_function(row):
>>>>   user_id = row.user_id
>>>>   user_orders_df = orders_df.select(f'user_id = {user_id}')
>>>>
>>>>   #here, I'd get any User info from 'row'
>>>>   #then, I'd convert all 'user_orders' to JSON
>>>>   #then, I'd prepare the email and send it
>>>>
>>>> users_df.foreach(foreach_function)
>>>> ```
>>>>
>>>> It is my understanding that if I do my user-specific work in the
>>>> foreach function, I will capitalize on Spark's scalability when doing that
>>>> work. However, I am worried of two things:
>>>>
>>>> If I take all Orders up front...
>>>>
>>>> Will that work?
>>>> Will I be taking too much? Will I be taking Orders on partitions who
>>>> won't handle them (different User).
>>>>
>>>> If I create the orders_df (filtered) within the foreach function...
>>>>
>>>> Will it work?
>>>> Will that be too much IO to DB?
>>>>
>>>> The question ultimately is: How can I achieve this goal efficiently?
>>>>
>>>> I have not yet tried anything here. I am doing so as we speak, but am
>>>> suffering from choice-paralysis.
>>>>
>>>> Please and thank you.
>>>>
>>>

id,name
10001,Mich
10002,Marco
10003,Micho
10004,March
10005,Steve

id,description,amount,user_id
50001,"Mich's 1st order",101.11,10001
50002,"Mich's 2nd order",102.11,10001
50003,"Mich's 3rd order",103.11,10001
50004,"Mich's 4th order",104.11,10001
50005,"Mich's 5th order",105.11,10001
50006,"Mich's 6th order",106.11,10001
50007,"Mich's 7th order",107.11,10001
50008,"Mich's 8th order",108.11,10001
50009,"Mich's 9th order",109.11,10001
50010,"Mich's 10th order",210.11,10002
50011,"Marco's 1st order",201.11,10002
50012,"Marco's 2nd order",202.11,10002
50013,"Marco's 3rd order",203.11,10002
50014,"Marco's 4th order",204.11,10002
50015,"Marco's 5th order",205.11,10002
50016,"Marco's 6th order",206.11,10002
50017,"Marco's 7th order",207.11,10002
50018,"Marco's 8th order",208.11,10002
50019,"Marco's 9th order",209.11,10002
50020,"Marco's 10th order",210.11,10002
50021,"Micho's 1st order",301.11,10003
50022,"Micho's 2nd order",302.11,10003
50023,"Micho's 3rd order",303.11,10003
50024,"Micho's 4th order",304.11,10003
50025,"Micho's 5th order",305.11,10003
50026,"Micho's 6th order",306.11,10003
50027,"Micho's 7th order",307.11,10003
50028,"Micho's 8th order",308.11,10003
50029,"Micho's 9th order",309.11,10003
50030,"Micho's 10th order",310.11,10003
50031,"Micho's 1st order",401.11,10004
50032,"Micho's 2nd order",402.11,10004
50033,"Micho's 3rd order",403.11,10004
50034,"Micho's 4th order",404.11,10004
50035,"Micho's 5th order",405.11,10004
50036,"Micho's 6th order",406.11,10004
50037,"Micho's 7th order",407.11,10004
50038,"Micho's 8th order",408.11,10004
50039,"Micho's 9th order",409.11,10004
50040,"Micho's 10th order",410.11,10004
50041,"Steve's 1st order",501.11,10005
50042,"Steve's 2nd order",502.11,10005
50043,"Steve's 3rd order",503.11,10005
50044,"Steve's 4th order",504.11,10005
50045,"Steve's 5th order",505.11,10005
50046,"Steve's 6th order",506.11,10005
50047,"Steve's 7th order",507.11,10005
50048,"Steve's 8th order",508.11,10005
50049,"Steve's 9th order",509.11,10005
50050,"Steve's 10th order",510.11,10005

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: What is the best way to organize a join within a foreach?

Reply via email to