Re: Querying a parquet file in s3 with an ec2 install

Manu Mukerji Mon, 08 Sep 2014 15:34:27 -0700

How big is the data set? Does it work when you copy it to hdfs?

-Manu



On Mon, Sep 8, 2014 at 2:58 PM, Jim Carroll <jimfcarr...@gmail.com> wrote:

> Hello all,
>
> I've been wrestling with this problem all day and any suggestions would be
> greatly appreciated.
>
> I'm trying to test reading a parquet file that's stored in s3 using a spark
> cluster deployed on ec2. The following works in the spark shell when run
> completely locally on my own machine (i.e. no --master option passed to the
> spark-shell command):
>
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext._
> val p = parquetFile("s3n://[bucket]/path-to-parquet-dir/")
> p.registerAsTable("s")
> sql("select count(*) from s").collect
>
> I have an ec2 deployment of spark (tried version 1.0.2 and 1.1.0-rc4) using
> the standalone cluster manager and deployed with the spark-ec2 script.
>
> Running the same code in a spark shell connected to the cluster it
> basically
> hangs on the select statement. The workers/slaves simply time out and
> restart every 30 seconds when they hit what appears to be an activity
> timeout, as if there's no activity from the spark-shell (based on what I
> see
> in the stderr logs for the job, I assume this is expected behavior when
> connected from a spark-shell that's sitting idle).
>
> I see these messages about every 30 seconds:
>
> 14/09/08 17:43:08 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory
> 14/09/08 17:43:09 INFO AppClient$ClientActor: Executor updated:
> app-20140908213842-0002/7 is now EXITED (Command exited with code 1)
> 14/09/08 17:43:09 INFO SparkDeploySchedulerBackend: Executor
> app-20140908213842-0002/7 removed: Command exited with code 1
> 14/09/08 17:43:09 INFO AppClient$ClientActor: Executor added:
> app-20140908213842-0002/8 on
> worker-20140908183422-ip-10-60-107-194.ec2.internal-53445
> (ip-10-60-107-194.ec2.internal:53445) with 2 cores
> 14/09/08 17:43:09 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140908213842-0002/8 on hostPort ip-10-60-107-194.ec2.internal:53445
> with 2 cores, 4.0 GB RAM
> 14/09/08 17:43:09 INFO AppClient$ClientActor: Executor updated:
> app-20140908213842-0002/8 is now RUNNING
>
> Eventually it fails with a:
>
> 14/09/08 17:44:16 INFO AppClient$ClientActor: Executor updated:
> app-20140908213842-0002/9 is now EXITED (Command exited with code 1)
> 14/09/08 17:44:16 INFO SparkDeploySchedulerBackend: Executor
> app-20140908213842-0002/9 removed: Command exited with code 1
> 14/09/08 17:44:16 ERROR SparkDeploySchedulerBackend: Application has been
> killed. Reason: Master removed our application: FAILED
> 14/09/08 17:44:16 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks
> have all completed, from pool
> 14/09/08 17:44:16 INFO TaskSchedulerImpl: Cancelling stage 1
> 14/09/08 17:44:16 INFO DAGScheduler: Failed to run collect at
> SparkPlan.scala:85
> 14/09/08 17:44:16 INFO SparkUI: Stopped Spark web UI at
> http://192.168.10.198:4040
> 14/09/08 17:44:16 INFO DAGScheduler: Stopping DAGScheduler
> 14/09/08 17:44:16 INFO SparkDeploySchedulerBackend: Shutting down all
> executors
> 14/09/08 17:44:16 INFO SparkDeploySchedulerBackend: Asking each executor to
> shut down
> 14/09/08 17:44:16 INFO SparkDeploySchedulerBackend: Asking each executor to
> shut down
> org.apache.spark.SparkException: Job aborted due to stage failure: Master
> removed our application: FAILED
>         at
> org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
>         at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
>         at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
>         at
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>         at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
>         at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
>         at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
>         at scala.Option.foreach(Option.scala:236)
>         at
>
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
>         at
>
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>         at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>         at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>         at
>
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>         at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at
>
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at
>
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
> As far as the "Initial job has not accepted any resources" I'm running the
> spark-shell command with:
>
> SPARK_MEM=2g ./spark-shell --master
> spark://ec2-x-x-x-x.compute-1.amazonaws.com:7077
>
> According to the master web page each node has 6 Gig so I'm not sure why
> I'm
> seeing that message either. If I run with less than 2g I get the following
> in my spark-shell:
>
> 14/09/08 17:47:38 INFO Remoting: Remoting shut down
> 14/09/08 17:47:38 INFO RemoteActorRefProvider$RemotingTerminator: Remoting
> shut down.
> java.io.IOException: Error reading summaries
>         at
>
> parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:128)
>         ....
> Caused by: java.util.concurrent.ExecutionException:
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>         at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>
> I'm not sure if this exception is from the spark-shell jvm or transferred
> over from the master or a worker through the master.
>
> Any help would be greatly appreciated.
>
> Thanks
> Jim
>
>
>
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Querying-a-parquet-file-in-s3-with-an-ec2-install-tp13737.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Querying a parquet file in s3 with an ec2 install

Reply via email to