Hi,

Does anyone have any optimisation tips or could propose an alternative way to 
perform the below:


val groupedUserItems1 = userItems1.groupByKey{_.customer_id}
val groupedUserItems2 = userItems2.groupByKey{_.customer_id}
groupedUserItems1.cogroup(groupedUserItems2){
   case (_, userItems1, userItems2) =>
        processSingleUser(userItems1, userItems2)
    }
}

The userItems1 and userItems2 datasets are quite large (100's millions of 
records) so I'm finding the shuffle stage is shuffing Gigabytes of data.

Any help would be greatly appreciated.

Thanks,



Steve Robinson

steve.robin...@aquilainsight.com
0131 290 2300

[https://aquilainsight.sharepoint.com/Phoenix/_layouts/15/guestaccess.aspx?guestaccesstoken=coZwd1lxzVikDFySR9HQQUgSdTBDtR24zCE7VI3ucKE%3d&docid=0c609694b07114f03a01631f3c5cc4606&rev=1][https://aquilainsight.sharepoint.com/Phoenix/_layouts/15/guestaccess.aspx?guestaccesstoken=XWaweiSSd7YO1IFgfwqm3AAn7KKCsmBf%2f73IlT3d0zE%3d&docid=0cea80d160d954b9584aef7090a5c4ef5&rev=1][https://aquilainsight.sharepoint.com/Phoenix/_layouts/15/guestaccess.aspx?guestaccesstoken=hTrHq%2fmrgDTxOp4jWXzYVM04wsasy2aNJfmG6EJJ%2f9g%3d&docid=00a6dd20560b1438fadebfb8a1255be41&rev=1]
www.aquilainsight.com<http://www.aquilainsight.com>
[https://aquilainsight.sharepoint.com/Phoenix/_layouts/15/guestaccess.aspx?guestaccesstoken=N79xtBiBY4r5ry1TCu0P%2bce%2f%2b3HFTwwamnQ47PieOoo%3d&docid=03f7d1040c43f4fa0bcdf7f17fa89dfcc&rev=1]linkedin.com/aquilainsight<https://www.linkedin.com/company/aquila-insight>
[https://aquilainsight.sharepoint.com/Phoenix/_layouts/15/guestaccess.aspx?guestaccesstoken=fdX1gHdkBdEZ%2bOap1Nr7kTrjMoFxgTZI4RfHFw0R7mw%3d&docid=0869faaa87f6c402fa845a320c225e213&rev=1]twitter.com/aquilainsight<http://twitter.com/aquilainsight>

Reply via email to