Re: GC overhead exceeded

2017-08-18 Thread KhajaAsmath Mohammed
Hi Pat,

I am using dynamic scheduling with executor memory of 8 gb . Will check to do 
static scheduling by giving number of executor and cores.

Thanks,
Asmath

Sent from my iPhone

> On Aug 18, 2017, at 10:39 AM, Patrick Alwell <palw...@hortonworks.com> wrote:
> 
> +1 what is the executor memory? You may need to adjust executor memory and 
> cores. For the sake of simplicity; each executor can handle 5 concurrent 
> tasks and should have 5 cores. So if your cluster has 100 cores, you’d have 
> 20 executors. And if your cluster memory is 500gb, each executor would have  
> 25gb of memory.
>  
> What’s more, you can use tools like the Spark UI or Ganglia to determine 
> which step is failing and why. What is the overall cluster size? How many 
> executors do you have? Is it an appropriate count for this cluster’s cores? 
> I’m assuming you are using YARN?
>  
> -Pat
>  
> From: KhajaAsmath Mohammed <mdkhajaasm...@gmail.com>
> Date: Friday, August 18, 2017 at 5:30 AM
> To: Pralabh Kumar <pralabhku...@gmail.com>
> Cc: "user @spark" <user@spark.apache.org>
> Subject: Re: GC overhead exceeded
>  
> It is just a sql from hive table with transformation if adding 10 more 
> columns calculated for currency. Input size for this query is 2 months which 
> has around 450gb data.
>  
> I added persist but it didn't help. Also the executor memory is 8g . Any 
> suggestions please ?
> 
> Sent from my iPhone
> 
> On Aug 17, 2017, at 11:43 PM, Pralabh Kumar <pralabhku...@gmail.com> wrote:
> 
> what's is your exector memory , please share the code also
>  
> On Fri, Aug 18, 2017 at 10:06 AM, KhajaAsmath Mohammed 
> <mdkhajaasm...@gmail.com> wrote:
>  
> HI,
>  
> I am getting below error when running spark sql jobs. This error is thrown 
> after running 80% of tasks. any solution?
>  
> spark.storage.memoryFraction=0.4
> spark.sql.shuffle.partitions=2000
> spark.default.parallelism=100
> #spark.eventLog.enabled=false
> #spark.scheduler.revive.interval=1s
> spark.driver.memory=8g
>  
>  
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> at java.util.ArrayList.subList(ArrayList.java:955)
> at java.lang.String.split(String.java:2311)
> at 
> sun.net.util.IPAddressUtil.textToNumericFormatV4(IPAddressUtil.java:47)
> at java.net.InetAddress.getAllByName(InetAddress.java:1129)
> at java.net.InetAddress.getAllByName(InetAddress.java:1098)
> at java.net.InetAddress.getByName(InetAddress.java:1048)
> at org.apache.hadoop.net.NetUtils.normalizeHostName(NetUtils.java:562)
> at 
> org.apache.hadoop.net.NetUtils.normalizeHostNames(NetUtils.java:579)
> at 
> org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:109)
> at 
> org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101)
> at 
> org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81)
> at 
> org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:37)
> at 
> org.apache.spark.scheduler.TaskSetManager.dequeueTask(TaskSetManager.scala:380)
> at 
> org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:433)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet$1.apply$mcVI$sp(TaskSchedulerImpl.scala:276)
> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet(TaskSchedulerImpl.scala:271)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:357)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:355)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:355)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:352)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:352)
> at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:222)
>  
>  


Re: GC overhead exceeded

2017-08-18 Thread Patrick Alwell
+1 what is the executor memory? You may need to adjust executor memory and 
cores. For the sake of simplicity; each executor can handle 5 concurrent tasks 
and should have 5 cores. So if your cluster has 100 cores, you’d have 20 
executors. And if your cluster memory is 500gb, each executor would have  25gb 
of memory.

What’s more, you can use tools like the Spark UI or Ganglia to determine which 
step is failing and why. What is the overall cluster size? How many executors 
do you have? Is it an appropriate count for this cluster’s cores? I’m assuming 
you are using YARN?

-Pat

From: KhajaAsmath Mohammed <mdkhajaasm...@gmail.com>
Date: Friday, August 18, 2017 at 5:30 AM
To: Pralabh Kumar <pralabhku...@gmail.com>
Cc: "user @spark" <user@spark.apache.org>
Subject: Re: GC overhead exceeded

It is just a sql from hive table with transformation if adding 10 more columns 
calculated for currency. Input size for this query is 2 months which has around 
450gb data.

I added persist but it didn't help. Also the executor memory is 8g . Any 
suggestions please ?

Sent from my iPhone

On Aug 17, 2017, at 11:43 PM, Pralabh Kumar 
<pralabhku...@gmail.com<mailto:pralabhku...@gmail.com>> wrote:
what's is your exector memory , please share the code also

On Fri, Aug 18, 2017 at 10:06 AM, KhajaAsmath Mohammed 
<mdkhajaasm...@gmail.com<mailto:mdkhajaasm...@gmail.com>> wrote:

HI,

I am getting below error when running spark sql jobs. This error is thrown 
after running 80% of tasks. any solution?

spark.storage.memoryFraction=0.4
spark.sql.shuffle.partitions=2000
spark.default.parallelism=100
#spark.eventLog.enabled=false
#spark.scheduler.revive.interval=1s
spark.driver.memory=8g


java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.ArrayList.subList(ArrayList.java:955)
at java.lang.String.split(String.java:2311)
at 
sun.net.util.IPAddressUtil.textToNumericFormatV4(IPAddressUtil.java:47)
at java.net.InetAddress.getAllByName(InetAddress.java:1129)
at java.net.InetAddress.getAllByName(InetAddress.java:1098)
at java.net.InetAddress.getByName(InetAddress.java:1048)
at 
org.apache.hadoop.net<http://org.apache.hadoop.net>.NetUtils.normalizeHostName(NetUtils.java:562)
at 
org.apache.hadoop.net<http://org.apache.hadoop.net>.NetUtils.normalizeHostNames(NetUtils.java:579)
at 
org.apache.hadoop.net<http://org.apache.hadoop.net>.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:109)
at 
org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101)
at 
org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81)
at 
org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:37)
at 
org.apache.spark.scheduler.TaskSetManager.dequeueTask(TaskSetManager.scala:380)
at 
org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:433)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet$1.apply$mcVI$sp(TaskSchedulerImpl.scala:276)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.org<http://org.apache.spark.scheduler.TaskSchedulerImpl.org>$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet(TaskSchedulerImpl.scala:271)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:357)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:355)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:355)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:352)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:352)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.org<http://DriverEndpoint.org>$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:222)




Re: GC overhead exceeded

2017-08-18 Thread KhajaAsmath Mohammed
It is just a sql from hive table with transformation if adding 10 more columns 
calculated for currency. Input size for this query is 2 months which has around 
450gb data.

I added persist but it didn't help. Also the executor memory is 8g . Any 
suggestions please ?

Sent from my iPhone

> On Aug 17, 2017, at 11:43 PM, Pralabh Kumar  wrote:
> 
> what's is your exector memory , please share the code also
> 
>> On Fri, Aug 18, 2017 at 10:06 AM, KhajaAsmath Mohammed 
>>  wrote:
>> 
>> HI,
>> 
>> I am getting below error when running spark sql jobs. This error is thrown 
>> after running 80% of tasks. any solution?
>> 
>> spark.storage.memoryFraction=0.4
>> spark.sql.shuffle.partitions=2000
>> spark.default.parallelism=100
>> #spark.eventLog.enabled=false
>> #spark.scheduler.revive.interval=1s
>> spark.driver.memory=8g
>> 
>> 
>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>> at java.util.ArrayList.subList(ArrayList.java:955)
>> at java.lang.String.split(String.java:2311)
>> at 
>> sun.net.util.IPAddressUtil.textToNumericFormatV4(IPAddressUtil.java:47)
>> at java.net.InetAddress.getAllByName(InetAddress.java:1129)
>> at java.net.InetAddress.getAllByName(InetAddress.java:1098)
>> at java.net.InetAddress.getByName(InetAddress.java:1048)
>> at 
>> org.apache.hadoop.net.NetUtils.normalizeHostName(NetUtils.java:562)
>> at 
>> org.apache.hadoop.net.NetUtils.normalizeHostNames(NetUtils.java:579)
>> at 
>> org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:109)
>> at 
>> org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101)
>> at 
>> org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81)
>> at 
>> org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:37)
>> at 
>> org.apache.spark.scheduler.TaskSetManager.dequeueTask(TaskSetManager.scala:380)
>> at 
>> org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:433)
>> at 
>> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet$1.apply$mcVI$sp(TaskSchedulerImpl.scala:276)
>> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
>> at 
>> org.apache.spark.scheduler.TaskSchedulerImpl.org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet(TaskSchedulerImpl.scala:271)
>> at 
>> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:357)
>> at 
>> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:355)
>> at 
>> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>> at 
>> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>> at 
>> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:355)
>> at 
>> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:352)
>> at 
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>> at 
>> org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:352)
>> at 
>> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:222)
>> 
> 


Re: GC overhead exceeded

2017-08-17 Thread Pralabh Kumar
what's is your exector memory , please share the code also

On Fri, Aug 18, 2017 at 10:06 AM, KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com> wrote:

>
> HI,
>
> I am getting below error when running spark sql jobs. This error is thrown
> after running 80% of tasks. any solution?
>
> spark.storage.memoryFraction=0.4
> spark.sql.shuffle.partitions=2000
> spark.default.parallelism=100
> #spark.eventLog.enabled=false
> #spark.scheduler.revive.interval=1s
> spark.driver.memory=8g
>
>
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> at java.util.ArrayList.subList(ArrayList.java:955)
> at java.lang.String.split(String.java:2311)
> at sun.net.util.IPAddressUtil.textToNumericFormatV4(
> IPAddressUtil.java:47)
> at java.net.InetAddress.getAllByName(InetAddress.java:1129)
> at java.net.InetAddress.getAllByName(InetAddress.java:1098)
> at java.net.InetAddress.getByName(InetAddress.java:1048)
> at org.apache.hadoop.net.NetUtils.normalizeHostName(
> NetUtils.java:562)
> at org.apache.hadoop.net.NetUtils.normalizeHostNames(
> NetUtils.java:579)
> at org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(
> CachedDNSToSwitchMapping.java:109)
> at org.apache.hadoop.yarn.util.RackResolver.coreResolve(
> RackResolver.java:101)
> at org.apache.hadoop.yarn.util.RackResolver.resolve(
> RackResolver.java:81)
> at org.apache.spark.scheduler.cluster.YarnScheduler.
> getRackForHost(YarnScheduler.scala:37)
> at org.apache.spark.scheduler.TaskSetManager.dequeueTask(
> TaskSetManager.scala:380)
> at org.apache.spark.scheduler.TaskSetManager.resourceOffer(
> TaskSetManager.scala:433)
> at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$
> org$apache$spark$scheduler$TaskSchedulerImpl$$
> resourceOfferSingleTaskSet$1.apply$mcVI$sp(TaskSchedulerImpl.scala:276)
> at scala.collection.immutable.Range.foreach$mVc$sp(Range.
> scala:160)
> at org.apache.spark.scheduler.TaskSchedulerImpl.org$apache$
> spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet(
> TaskSchedulerImpl.scala:271)
> at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$
> resourceOffers$4$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:357)
> at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$
> resourceOffers$4$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:355)
> at scala.collection.IndexedSeqOptimized$class.
> foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(
> ArrayOps.scala:186)
> at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$
> resourceOffers$4.apply(TaskSchedulerImpl.scala:355)
> at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$
> resourceOffers$4.apply(TaskSchedulerImpl.scala:352)
> at scala.collection.mutable.ResizableArray$class.foreach(
> ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(
> ArrayBuffer.scala:48)
> at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(
> TaskSchedulerImpl.scala:352)
> at org.apache.spark.scheduler.cluster.
> CoarseGrainedSchedulerBackend$DriverEndpoint.org$apache$
> spark$scheduler$cluster$CoarseGrainedSchedulerBackend$
> DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:222)
>
>


GC overhead exceeded

2017-08-17 Thread KhajaAsmath Mohammed
HI,

I am getting below error when running spark sql jobs. This error is thrown
after running 80% of tasks. any solution?

spark.storage.memoryFraction=0.4
spark.sql.shuffle.partitions=2000
spark.default.parallelism=100
#spark.eventLog.enabled=false
#spark.scheduler.revive.interval=1s
spark.driver.memory=8g


java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.ArrayList.subList(ArrayList.java:955)
at java.lang.String.split(String.java:2311)
at
sun.net.util.IPAddressUtil.textToNumericFormatV4(IPAddressUtil.java:47)
at java.net.InetAddress.getAllByName(InetAddress.java:1129)
at java.net.InetAddress.getAllByName(InetAddress.java:1098)
at java.net.InetAddress.getByName(InetAddress.java:1048)
at
org.apache.hadoop.net.NetUtils.normalizeHostName(NetUtils.java:562)
at
org.apache.hadoop.net.NetUtils.normalizeHostNames(NetUtils.java:579)
at
org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:109)
at
org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101)
at
org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81)
at
org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:37)
at
org.apache.spark.scheduler.TaskSetManager.dequeueTask(TaskSetManager.scala:380)
at
org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:433)
at
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet$1.apply$mcVI$sp(TaskSchedulerImpl.scala:276)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at org.apache.spark.scheduler.TaskSchedulerImpl.org
$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet(TaskSchedulerImpl.scala:271)
at
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:357)
at
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:355)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:355)
at
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:352)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at
org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:352)
at
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:222)