hey guys
I tried the following settings as well. No luck
--total-executor-cores 24 --executor-memory 4G
BTW on the same cluster , impala absolutely kills it. same query 9 seconds. no
memory issues. no issues.
In fact I am pretty disappointed with Spark-SQL.I have worked with Hive during
the 0.9.x stages and taken projects to production successfully and Hive
actually very rarely craps out.
Whether the spark folks like what I say or not, yes my expectations are pretty
high of Spark-SQL if I were to change the ways we are doing things at my
workplace.Until that time, we are going to be hugely dependent on Impala and
Hive(with SSD speeding up the shuffle stage , even MR jobs are not that slow
now).
I want to clarify for those of u who may be asking - why I am not using spark
with Scala and insisting on using spark-sql ?
- I have already pipelined data from enterprise tables to Hive- I am using CDH
5.3.3 (Cloudera starving developers version)- I have close to 300 tables
defined in Hive external tables.
- Data if on HDFS- On an average we have 150 columns per table- One an everyday
basis , we do crazy amounts of ad-hoc joining of new and old tables in getting
datasets ready for supervised ML- I thought that quite simply I can point Spark
to the Hive meta and do queries as I do - in fact the existing queries would
work as is unless I am using some esoteric Hive/Impala function
Anyway, if there are some settings I can use and get spark-sql to run even on
standalone mode that will be huge help.
On the pre-production cluster I have spark on YARN but could never get it to
run fairly complex queries and I have no answers from this group of the CDH
groups.
So my assumption is that its possibly not solved , else I have always got very
quick answers and responses :-) to my questions on all CDH groups, Spark, Hive
best regards
sanjay
From: Josh Rosen <[email protected]>
To: Sanjay Subramanian <[email protected]>
Cc: "[email protected]" <[email protected]>
Sent: Friday, June 12, 2015 7:15 AM
Subject: Re: spark-sql from CLI --->EXCEPTION: java.lang.OutOfMemoryError:
Java heap space
It sounds like this might be caused by a memory configuration problem. In
addition to looking at the executor memory, I'd also bump up the driver memory,
since it appears that your shell is running out of memory when collecting a
large query result.
Sent from my phone
On Jun 11, 2015, at 8:43 AM, Sanjay Subramanian
<[email protected]> wrote:
hey guys
Using Hive and Impala daily intensively.Want to transition to spark-sql in CLI
mode
Currently in my sandbox I am using the Spark (standalone mode) in the CDH
distribution (starving developer version 5.3.3)
3 datanode hadoop cluster32GB RAM per node8 cores per node
| spark | 1.2.0+cdh5.3.3+371 |
I am testing some stuff on one view and getting memory errorsPossibly reason is
default memory per executor showing on 18080 is 512M
These options when used to start the spark-sql CLI does not seem to have any
effect --total-executor-cores 12 --executor-memory 4G
/opt/cloudera/parcels/CDH/lib/spark/bin/spark-sql -e "select distinct
isr,event_dt,age,age_cod,sex,year,quarter from aers.aers_demo_view"
aers.aers_demo_view (7 million+ records)===================isr bigint case
idevent_dt bigint Event dateage double age of patientage_cod
string days,months yearssex string M or Fyear intquarter int
VIEW DEFINITION================CREATE VIEW `aers.aers_demo_view` AS SELECT
`isr` AS `isr`, `event_dt` AS `event_dt`, `age` AS `age`, `age_cod` AS
`age_cod`, `gndr_cod` AS `sex`, `year` AS `year`, `quarter` AS `quarter` FROM
(SELECT `aers_demo_v1`.`isr`, `aers_demo_v1`.`event_dt`,
`aers_demo_v1`.`age`, `aers_demo_v1`.`age_cod`, `aers_demo_v1`.`gndr_cod`,
`aers_demo_v1`.`year`, `aers_demo_v1`.`quarter`FROM
`aers`.`aers_demo_v1`UNION ALLSELECT `aers_demo_v2`.`isr`,
`aers_demo_v2`.`event_dt`, `aers_demo_v2`.`age`, `aers_demo_v2`.`age_cod`,
`aers_demo_v2`.`gndr_cod`, `aers_demo_v2`.`year`,
`aers_demo_v2`.`quarter`FROM `aers`.`aers_demo_v2`UNION ALLSELECT
`aers_demo_v3`.`isr`, `aers_demo_v3`.`event_dt`, `aers_demo_v3`.`age`,
`aers_demo_v3`.`age_cod`, `aers_demo_v3`.`gndr_cod`, `aers_demo_v3`.`year`,
`aers_demo_v3`.`quarter`FROM `aers`.`aers_demo_v3`UNION ALLSELECT
`aers_demo_v4`.`isr`, `aers_demo_v4`.`event_dt`, `aers_demo_v4`.`age`,
`aers_demo_v4`.`age_cod`, `aers_demo_v4`.`gndr_cod`, `aers_demo_v4`.`year`,
`aers_demo_v4`.`quarter`FROM `aers`.`aers_demo_v4`UNION ALLSELECT
`aers_demo_v5`.`primaryid` AS `ISR`, `aers_demo_v5`.`event_dt`,
`aers_demo_v5`.`age`, `aers_demo_v5`.`age_cod`, `aers_demo_v5`.`gndr_cod`,
`aers_demo_v5`.`year`, `aers_demo_v5`.`quarter`FROM
`aers`.`aers_demo_v5`UNION ALLSELECT `aers_demo_v6`.`primaryid` AS `ISR`,
`aers_demo_v6`.`event_dt`, `aers_demo_v6`.`age`, `aers_demo_v6`.`age_cod`,
`aers_demo_v6`.`sex` AS `GNDR_COD`, `aers_demo_v6`.`year`,
`aers_demo_v6`.`quarter`FROM `aers`.`aers_demo_v6`) `aers_demo_view`
15/06/11 08:36:36 WARN DefaultChannelPipeline: An exception was thrown by a
user handler while handling an exception event ([id: 0x01b99855,
/10.0.0.19:58117 => /10.0.0.19:52016] EXCEPTION: java.lang.OutOfMemoryError:
Java heap space)java.lang.OutOfMemoryError: Java heap space at
org.jboss.netty.buffer.HeapChannelBuffer.<init>(HeapChannelBuffer.java:42)
at
org.jboss.netty.buffer.BigEndianHeapChannelBuffer.<init>(BigEndianHeapChannelBuffer.java:34)
at
org.jboss.netty.buffer.ChannelBuffers.buffer(ChannelBuffers.java:134) at
org.jboss.netty.buffer.HeapChannelBufferFactory.getBuffer(HeapChannelBufferFactory.java:68)
at
org.jboss.netty.buffer.AbstractChannelBufferFactory.getBuffer(AbstractChannelBufferFactory.java:48)
at
org.jboss.netty.handler.codec.frame.FrameDecoder.newCumulationBuffer(FrameDecoder.java:507)
at
org.jboss.netty.handler.codec.frame.FrameDecoder.updateCumulation(FrameDecoder.java:345)
at
org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:312)
at
org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at
org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:109)
at
org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
at
org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)15/06/11 08:36:40 ERROR Utils:
Uncaught exception in thread task-result-getter-0java.lang.OutOfMemoryError: GC
overhead limit exceeded at java.lang.Long.valueOf(Long.java:577)
at
com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.read(DefaultSerializers.java:113)
at
com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.read(DefaultSerializers.java:103)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:651) at
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
at
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
at
org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:171)
at
org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
at
org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:558)
at
org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:352)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:80)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:49)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:49)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1468)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:48)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)15/06/11 08:36:38 ERROR
ActorSystemImpl: exception on LARS’ timer threadjava.lang.OutOfMemoryError: GC
overhead limit exceeded at
akka.dispatch.AbstractNodeQueue.<init>(AbstractNodeQueue.java:19) at
akka.actor.LightArrayRevolverScheduler$TaskQueue.<init>(Scheduler.scala:431)
at
akka.actor.LightArrayRevolverScheduler$$anon$12.nextTick(Scheduler.scala:397)
at
akka.actor.LightArrayRevolverScheduler$$anon$12.run(Scheduler.scala:363)
at java.lang.Thread.run(Thread.java:745)Exception in thread
"task-result-getter-0" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.Long.valueOf(Long.java:577) at
com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.read(DefaultSerializers.java:113)
at
com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.read(DefaultSerializers.java:103)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:651) at
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
at
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
at
org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:171)
at
org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
at
org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:558)
at
org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:352)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:80)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:49)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:49)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1468)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:48)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)15/06/11 08:36:41 ERROR
ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-scheduler-1]
shutting down ActorSystem [sparkDriver]java.lang.OutOfMemoryError: GC overhead
limit exceeded at
akka.dispatch.AbstractNodeQueue.<init>(AbstractNodeQueue.java:19) at
akka.actor.LightArrayRevolverScheduler$TaskQueue.<init>(Scheduler.scala:431)
at
akka.actor.LightArrayRevolverScheduler$$anon$12.nextTick(Scheduler.scala:397)
at
akka.actor.LightArrayRevolverScheduler$$anon$12.run(Scheduler.scala:363)
at java.lang.Thread.run(Thread.java:745)15/06/11 08:36:46 ERROR
ActorSystemImpl: Uncaught fatal error from thread
[sparkDriver-akka.actor.default-dispatcher-4] shutting down ActorSystem
[sparkDriver]java.lang.OutOfMemoryError: GC overhead limit exceeded15/06/11
08:36:46 ERROR SparkSQLDriver: Failed in [select distinct
isr,event_dt,age,age_cod,sex,year,quarter from
aers.aers_demo_view]org.apache.spark.SparkException: Job cancelled because
SparkContext was shut down at
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:702)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:701)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at
org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:701)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor.postStop(DAGScheduler.scala:1428)
at
akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:201)
at
akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:163)
at akka.actor.ActorCell.terminate(ActorCell.scala:338) at
akka.actor.ActorCell.invokeAll$1(ActorCell.scala:431) at
akka.actor.ActorCell.systemInvoke(ActorCell.scala:447) at
akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:262) at
akka.dispatch.Mailbox.run(Mailbox.scala:218) at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)15/06/11
08:36:51 WARN DefaultChannelPipeline: An exception was thrown by a user
handler while handling an exception event ([id: 0x79935a9b, /10.0.0.35:54028 =>
/10.0.0.19:52016] EXCEPTION: java.lang.OutOfMemoryError: Java heap
space)java.lang.OutOfMemoryError: Java heap space15/06/11 08:36:52 ERROR
ActorSystemImpl: Uncaught fatal error from thread
[sparkDriver-akka.actor.default-dispatcher-5] shutting down ActorSystem
[sparkDriver]java.lang.OutOfMemoryError: Java heap space15/06/11 08:36:53 WARN
DefaultChannelPipeline: An exception was thrown by a user handler while
handling an exception event ([id: 0xcb8c4b5d, /10.0.0.18:46744 =>
/10.0.0.19:52016] EXCEPTION: java.lang.OutOfMemoryError: Java heap
space)java.lang.OutOfMemoryError: Java heap space15/06/11 08:36:56 WARN
NioEventLoop: Unexpected exception in the selector
loop.java.lang.OutOfMemoryError: GC overhead limit exceeded15/06/11 08:36:57
ERROR ActorSystemImpl: Uncaught fatal error from thread
[sparkDriver-akka.actor.default-dispatcher-18] shutting down ActorSystem
[sparkDriver]java.lang.OutOfMemoryError: GC overhead limit exceeded15/06/11
08:36:58 ERROR Utils: Uncaught exception in thread
task-result-getter-3java.lang.OutOfMemoryError: GC overhead limit
exceededException in thread "task-result-getter-3" java.lang.OutOfMemoryError:
GC overhead limit exceeded15/06/11 08:37:01 ERROR ActorSystemImpl: Uncaught
fatal error from thread [sparkDriver-akka.actor.default-dispatcher-4] shutting
down ActorSystem [sparkDriver]java.lang.OutOfMemoryError: Java heap spaceTime
taken: 70.982 seconds15/06/11 08:37:06 WARN QueuedThreadPool: 4 threads could
not be stopped15/06/11 08:37:11 ERROR MapOutputTrackerMaster: Error
communicating with MapOutputTrackerakka.pattern.AskTimeoutException:
Recipient[Actor[akka://sparkDriver/user/MapOutputTracker#-2109395547]] had
already been terminated. at
akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:134) at
org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:111)
at org.apache.spark.MapOutputTracker.sendTracker(MapOutputTracker.scala:122)
at org.apache.spark.MapOutputTrackerMaster.stop(MapOutputTracker.scala:330)
at org.apache.spark.SparkEnv.stop(SparkEnv.scala:83) at
org.apache.spark.SparkContext.stop(SparkContext.scala:1210) at
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.stop(SparkSQLEnv.scala:66)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$$anon$1.run(SparkSQLCLIDriver.scala:107)Exception
in thread "Thread-3" org.apache.spark.SparkException: Error communicating with
MapOutputTracker at
org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:116)
at org.apache.spark.MapOutputTracker.sendTracker(MapOutputTracker.scala:122)
at org.apache.spark.MapOutputTrackerMaster.stop(MapOutputTracker.scala:330)
at org.apache.spark.SparkEnv.stop(SparkEnv.scala:83) at
org.apache.spark.SparkContext.stop(SparkContext.scala:1210) at
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.stop(SparkSQLEnv.scala:66)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$$anon$1.run(SparkSQLCLIDriver.scala:107)Caused
by: akka.pattern.AskTimeoutException:
Recipient[Actor[akka://sparkDriver/user/MapOutputTracker#-2109395547]] had
already been terminated. at
akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:134) at
org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:111)