from:"Harel Gliksman"

Re: Driver OutOfMemoryError in MapOutputTracker$.serializeMapStatuses for 40 TB shuffle.

2018-09-07 Thread Harel Gliksman

018 at 10:35 AM Harel Gliksman > wrote: > > > > Hi, > > > > We are running a Spark (2.3.1) job on an EMR cluster with 500 r3.2xlarge > (60 GB, 8 vcores, 160 GB SSD ). Driver memory is set to 25GB. > > > > It processes ~40 TB of data using aggregateByKey in

Driver OutOfMemoryError in MapOutputTracker$.serializeMapStatuses for 40 TB shuffle.

2018-09-07 Thread Harel Gliksman

Hi, We are running a Spark (2.3.1) job on an EMR cluster with 500 r3.2xlarge (60 GB, 8 vcores, 160 GB SSD ). Driver memory is set to 25GB. It processes ~40 TB of data using aggregateByKey in which we specify numPartitions = 300,000. Map side tasks succeed, but reduce side tasks all fail. We