I have two tables: {users, orders}. In this example, let's say that for each 1 User in the users table, there are 100000 Orders in the orders table.
I have to use pyspark to generate a statement of Orders for each User. So, a single user will need his/her own list of Orders. Additionally, I need to send this statement to the real-world user via email (for example). My first intuition was to apply a DataFrame.foreach() on the users DataFrame. This way, I can rely on the spark workers to handle the email sending individually. However, I now do not know the best way to get each User's Orders. I will soon try the following (pseudo-code): ``` users_df = <my entire users DataFrame> orders_df = <my entire orders DataFrame> #this is poorly named for max understandability in this context def foreach_function(row): user_id = row.user_id user_orders_df = orders_df.select(f'user_id = {user_id}') #here, I'd get any User info from 'row' #then, I'd convert all 'user_orders' to JSON #then, I'd prepare the email and send it users_df.foreach(foreach_function) ``` It is my understanding that if I do my user-specific work in the foreach function, I will capitalize on Spark's scalability when doing that work. However, I am worried of two things: If I take all Orders up front... Will that work? Will I be taking too much? Will I be taking Orders on partitions who won't handle them (different User). If I create the orders_df (filtered) within the foreach function... Will it work? Will that be too much IO to DB? The question ultimately is: How can I achieve this goal efficiently? I have not yet tried anything here. I am doing so as we speak, but am suffering from choice-paralysis. Please and thank you.