Re: GC overhead exceeded
Hi Pat, I am using dynamic scheduling with executor memory of 8 gb . Will check to do static scheduling by giving number of executor and cores. Thanks, Asmath Sent from my iPhone > On Aug 18, 2017, at 10:39 AM, Patrick Alwell <palw...@hortonworks.com> wrote: > > +1 what is the executor memory? You may need to adjust executor memory and > cores. For the sake of simplicity; each executor can handle 5 concurrent > tasks and should have 5 cores. So if your cluster has 100 cores, you’d have > 20 executors. And if your cluster memory is 500gb, each executor would have > 25gb of memory. > > What’s more, you can use tools like the Spark UI or Ganglia to determine > which step is failing and why. What is the overall cluster size? How many > executors do you have? Is it an appropriate count for this cluster’s cores? > I’m assuming you are using YARN? > > -Pat > > From: KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> > Date: Friday, August 18, 2017 at 5:30 AM > To: Pralabh Kumar <pralabhku...@gmail.com> > Cc: "user @spark" <user@spark.apache.org> > Subject: Re: GC overhead exceeded > > It is just a sql from hive table with transformation if adding 10 more > columns calculated for currency. Input size for this query is 2 months which > has around 450gb data. > > I added persist but it didn't help. Also the executor memory is 8g . Any > suggestions please ? > > Sent from my iPhone > > On Aug 17, 2017, at 11:43 PM, Pralabh Kumar <pralabhku...@gmail.com> wrote: > > what's is your exector memory , please share the code also > > On Fri, Aug 18, 2017 at 10:06 AM, KhajaAsmath Mohammed > <mdkhajaasm...@gmail.com> wrote: > > HI, > > I am getting below error when running spark sql jobs. This error is thrown > after running 80% of tasks. any solution? > > spark.storage.memoryFraction=0.4 > spark.sql.shuffle.partitions=2000 > spark.default.parallelism=100 > #spark.eventLog.enabled=false > #spark.scheduler.revive.interval=1s > spark.driver.memory=8g > > > java.lang.OutOfMemoryError: GC overhead limit exceeded > at java.util.ArrayList.subList(ArrayList.java:955) > at java.lang.String.split(String.java:2311) > at > sun.net.util.IPAddressUtil.textToNumericFormatV4(IPAddressUtil.java:47) > at java.net.InetAddress.getAllByName(InetAddress.java:1129) > at java.net.InetAddress.getAllByName(InetAddress.java:1098) > at java.net.InetAddress.getByName(InetAddress.java:1048) > at org.apache.hadoop.net.NetUtils.normalizeHostName(NetUtils.java:562) > at > org.apache.hadoop.net.NetUtils.normalizeHostNames(NetUtils.java:579) > at > org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:109) > at > org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101) > at > org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81) > at > org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:37) > at > org.apache.spark.scheduler.TaskSetManager.dequeueTask(TaskSetManager.scala:380) > at > org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:433) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet$1.apply$mcVI$sp(TaskSchedulerImpl.scala:276) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) > at > org.apache.spark.scheduler.TaskSchedulerImpl.org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet(TaskSchedulerImpl.scala:271) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:357) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:355) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:355) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:352) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:352) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:222) > >
Re: GC overhead exceeded
+1 what is the executor memory? You may need to adjust executor memory and cores. For the sake of simplicity; each executor can handle 5 concurrent tasks and should have 5 cores. So if your cluster has 100 cores, you’d have 20 executors. And if your cluster memory is 500gb, each executor would have 25gb of memory. What’s more, you can use tools like the Spark UI or Ganglia to determine which step is failing and why. What is the overall cluster size? How many executors do you have? Is it an appropriate count for this cluster’s cores? I’m assuming you are using YARN? -Pat From: KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> Date: Friday, August 18, 2017 at 5:30 AM To: Pralabh Kumar <pralabhku...@gmail.com> Cc: "user @spark" <user@spark.apache.org> Subject: Re: GC overhead exceeded It is just a sql from hive table with transformation if adding 10 more columns calculated for currency. Input size for this query is 2 months which has around 450gb data. I added persist but it didn't help. Also the executor memory is 8g . Any suggestions please ? Sent from my iPhone On Aug 17, 2017, at 11:43 PM, Pralabh Kumar <pralabhku...@gmail.com<mailto:pralabhku...@gmail.com>> wrote: what's is your exector memory , please share the code also On Fri, Aug 18, 2017 at 10:06 AM, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com<mailto:mdkhajaasm...@gmail.com>> wrote: HI, I am getting below error when running spark sql jobs. This error is thrown after running 80% of tasks. any solution? spark.storage.memoryFraction=0.4 spark.sql.shuffle.partitions=2000 spark.default.parallelism=100 #spark.eventLog.enabled=false #spark.scheduler.revive.interval=1s spark.driver.memory=8g java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.ArrayList.subList(ArrayList.java:955) at java.lang.String.split(String.java:2311) at sun.net.util.IPAddressUtil.textToNumericFormatV4(IPAddressUtil.java:47) at java.net.InetAddress.getAllByName(InetAddress.java:1129) at java.net.InetAddress.getAllByName(InetAddress.java:1098) at java.net.InetAddress.getByName(InetAddress.java:1048) at org.apache.hadoop.net<http://org.apache.hadoop.net>.NetUtils.normalizeHostName(NetUtils.java:562) at org.apache.hadoop.net<http://org.apache.hadoop.net>.NetUtils.normalizeHostNames(NetUtils.java:579) at org.apache.hadoop.net<http://org.apache.hadoop.net>.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:109) at org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101) at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81) at org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:37) at org.apache.spark.scheduler.TaskSetManager.dequeueTask(TaskSetManager.scala:380) at org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:433) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet$1.apply$mcVI$sp(TaskSchedulerImpl.scala:276) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at org.apache.spark.scheduler.TaskSchedulerImpl.org<http://org.apache.spark.scheduler.TaskSchedulerImpl.org>$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet(TaskSchedulerImpl.scala:271) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:357) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:355) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:355) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:352) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:352) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.org<http://DriverEndpoint.org>$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:222)
Re: GC overhead exceeded
It is just a sql from hive table with transformation if adding 10 more columns calculated for currency. Input size for this query is 2 months which has around 450gb data. I added persist but it didn't help. Also the executor memory is 8g . Any suggestions please ? Sent from my iPhone > On Aug 17, 2017, at 11:43 PM, Pralabh Kumarwrote: > > what's is your exector memory , please share the code also > >> On Fri, Aug 18, 2017 at 10:06 AM, KhajaAsmath Mohammed >> wrote: >> >> HI, >> >> I am getting below error when running spark sql jobs. This error is thrown >> after running 80% of tasks. any solution? >> >> spark.storage.memoryFraction=0.4 >> spark.sql.shuffle.partitions=2000 >> spark.default.parallelism=100 >> #spark.eventLog.enabled=false >> #spark.scheduler.revive.interval=1s >> spark.driver.memory=8g >> >> >> java.lang.OutOfMemoryError: GC overhead limit exceeded >> at java.util.ArrayList.subList(ArrayList.java:955) >> at java.lang.String.split(String.java:2311) >> at >> sun.net.util.IPAddressUtil.textToNumericFormatV4(IPAddressUtil.java:47) >> at java.net.InetAddress.getAllByName(InetAddress.java:1129) >> at java.net.InetAddress.getAllByName(InetAddress.java:1098) >> at java.net.InetAddress.getByName(InetAddress.java:1048) >> at >> org.apache.hadoop.net.NetUtils.normalizeHostName(NetUtils.java:562) >> at >> org.apache.hadoop.net.NetUtils.normalizeHostNames(NetUtils.java:579) >> at >> org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:109) >> at >> org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101) >> at >> org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81) >> at >> org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:37) >> at >> org.apache.spark.scheduler.TaskSetManager.dequeueTask(TaskSetManager.scala:380) >> at >> org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:433) >> at >> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet$1.apply$mcVI$sp(TaskSchedulerImpl.scala:276) >> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) >> at >> org.apache.spark.scheduler.TaskSchedulerImpl.org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet(TaskSchedulerImpl.scala:271) >> at >> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:357) >> at >> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:355) >> at >> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) >> at >> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) >> at >> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:355) >> at >> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:352) >> at >> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) >> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) >> at >> org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:352) >> at >> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:222) >> >
Re: GC overhead exceeded
what's is your exector memory , please share the code also On Fri, Aug 18, 2017 at 10:06 AM, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > > HI, > > I am getting below error when running spark sql jobs. This error is thrown > after running 80% of tasks. any solution? > > spark.storage.memoryFraction=0.4 > spark.sql.shuffle.partitions=2000 > spark.default.parallelism=100 > #spark.eventLog.enabled=false > #spark.scheduler.revive.interval=1s > spark.driver.memory=8g > > > java.lang.OutOfMemoryError: GC overhead limit exceeded > at java.util.ArrayList.subList(ArrayList.java:955) > at java.lang.String.split(String.java:2311) > at sun.net.util.IPAddressUtil.textToNumericFormatV4( > IPAddressUtil.java:47) > at java.net.InetAddress.getAllByName(InetAddress.java:1129) > at java.net.InetAddress.getAllByName(InetAddress.java:1098) > at java.net.InetAddress.getByName(InetAddress.java:1048) > at org.apache.hadoop.net.NetUtils.normalizeHostName( > NetUtils.java:562) > at org.apache.hadoop.net.NetUtils.normalizeHostNames( > NetUtils.java:579) > at org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve( > CachedDNSToSwitchMapping.java:109) > at org.apache.hadoop.yarn.util.RackResolver.coreResolve( > RackResolver.java:101) > at org.apache.hadoop.yarn.util.RackResolver.resolve( > RackResolver.java:81) > at org.apache.spark.scheduler.cluster.YarnScheduler. > getRackForHost(YarnScheduler.scala:37) > at org.apache.spark.scheduler.TaskSetManager.dequeueTask( > TaskSetManager.scala:380) > at org.apache.spark.scheduler.TaskSetManager.resourceOffer( > TaskSetManager.scala:433) > at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$ > org$apache$spark$scheduler$TaskSchedulerImpl$$ > resourceOfferSingleTaskSet$1.apply$mcVI$sp(TaskSchedulerImpl.scala:276) > at scala.collection.immutable.Range.foreach$mVc$sp(Range. > scala:160) > at org.apache.spark.scheduler.TaskSchedulerImpl.org$apache$ > spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet( > TaskSchedulerImpl.scala:271) > at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$ > resourceOffers$4$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:357) > at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$ > resourceOffers$4$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:355) > at scala.collection.IndexedSeqOptimized$class. > foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach( > ArrayOps.scala:186) > at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$ > resourceOffers$4.apply(TaskSchedulerImpl.scala:355) > at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$ > resourceOffers$4.apply(TaskSchedulerImpl.scala:352) > at scala.collection.mutable.ResizableArray$class.foreach( > ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach( > ArrayBuffer.scala:48) > at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers( > TaskSchedulerImpl.scala:352) > at org.apache.spark.scheduler.cluster. > CoarseGrainedSchedulerBackend$DriverEndpoint.org$apache$ > spark$scheduler$cluster$CoarseGrainedSchedulerBackend$ > DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:222) > >
GC overhead exceeded
HI, I am getting below error when running spark sql jobs. This error is thrown after running 80% of tasks. any solution? spark.storage.memoryFraction=0.4 spark.sql.shuffle.partitions=2000 spark.default.parallelism=100 #spark.eventLog.enabled=false #spark.scheduler.revive.interval=1s spark.driver.memory=8g java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.ArrayList.subList(ArrayList.java:955) at java.lang.String.split(String.java:2311) at sun.net.util.IPAddressUtil.textToNumericFormatV4(IPAddressUtil.java:47) at java.net.InetAddress.getAllByName(InetAddress.java:1129) at java.net.InetAddress.getAllByName(InetAddress.java:1098) at java.net.InetAddress.getByName(InetAddress.java:1048) at org.apache.hadoop.net.NetUtils.normalizeHostName(NetUtils.java:562) at org.apache.hadoop.net.NetUtils.normalizeHostNames(NetUtils.java:579) at org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:109) at org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101) at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81) at org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:37) at org.apache.spark.scheduler.TaskSetManager.dequeueTask(TaskSetManager.scala:380) at org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:433) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet$1.apply$mcVI$sp(TaskSchedulerImpl.scala:276) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at org.apache.spark.scheduler.TaskSchedulerImpl.org $apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet(TaskSchedulerImpl.scala:271) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:357) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:355) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:355) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:352) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:352) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:222)