Hi All, I have to join 2 files both not very big say few MBs only but the result can be huge say generating 500GBs to TBs of data. Now I have tried using spark Join() function but Im noticing that join is executing on only 1 or 2 nodes at the max. Since I have a cluster size of 5 nodes , I tried to pass "join(otherDataset, [numTasks])" as numTasks=10 but again what I noticed that all the 9 tasks are finished instantly and only 1 executor is processing all the data.
I searched on internet and got that we can use Broadcast variable to send data from 1 file to all nodes and then use map function to do the join. In this way I should be able to run multiple task on different executors. Now my question is , since Spark is providing the Join functionality, I have assumed that it will handle the data parallelism automatically. Now is Spark provide some functionality which I can directly use for join rather than implementing Mapside join using Broadcast on my own or any other better way is also welcome. I assume that this might be very common problem for all and looking out for suggestions. Thanks &Regards Stuti Awasthi ::DISCLAIMER:: ---------------------------------------------------------------------------------------------------------------------------------------------------- The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects. ----------------------------------------------------------------------------------------------------------------------------------------------------