Hi,
Does anyone have any optimisation tips or could propose an alternative way to perform the below: val groupedUserItems1 = userItems1.groupByKey{_.customer_id} val groupedUserItems2 = userItems2.groupByKey{_.customer_id} groupedUserItems1.cogroup(groupedUserItems2){ case (_, userItems1, userItems2) => processSingleUser(userItems1, userItems2) } } The userItems1 and userItems2 datasets are quite large (100's millions of records) so I'm finding the shuffle stage is shuffing Gigabytes of data. Any help would be greatly appreciated. Thanks, Steve Robinson steve.robin...@aquilainsight.com 0131 290 2300 [https://aquilainsight.sharepoint.com/Phoenix/_layouts/15/guestaccess.aspx?guestaccesstoken=coZwd1lxzVikDFySR9HQQUgSdTBDtR24zCE7VI3ucKE%3d&docid=0c609694b07114f03a01631f3c5cc4606&rev=1][https://aquilainsight.sharepoint.com/Phoenix/_layouts/15/guestaccess.aspx?guestaccesstoken=XWaweiSSd7YO1IFgfwqm3AAn7KKCsmBf%2f73IlT3d0zE%3d&docid=0cea80d160d954b9584aef7090a5c4ef5&rev=1][https://aquilainsight.sharepoint.com/Phoenix/_layouts/15/guestaccess.aspx?guestaccesstoken=hTrHq%2fmrgDTxOp4jWXzYVM04wsasy2aNJfmG6EJJ%2f9g%3d&docid=00a6dd20560b1438fadebfb8a1255be41&rev=1] www.aquilainsight.com<http://www.aquilainsight.com> [https://aquilainsight.sharepoint.com/Phoenix/_layouts/15/guestaccess.aspx?guestaccesstoken=N79xtBiBY4r5ry1TCu0P%2bce%2f%2b3HFTwwamnQ47PieOoo%3d&docid=03f7d1040c43f4fa0bcdf7f17fa89dfcc&rev=1]linkedin.com/aquilainsight<https://www.linkedin.com/company/aquila-insight> [https://aquilainsight.sharepoint.com/Phoenix/_layouts/15/guestaccess.aspx?guestaccesstoken=fdX1gHdkBdEZ%2bOap1Nr7kTrjMoFxgTZI4RfHFw0R7mw%3d&docid=0869faaa87f6c402fa845a320c225e213&rev=1]twitter.com/aquilainsight<http://twitter.com/aquilainsight>