Spark 1.6.1 packages on S3 corrupt?
Hi all, I'm trying to launch a cluster with the spark-ec2 script but seeing the error below. Are the packages on S3 corrupted / not in the correct format? Initializing spark --2016-04-13 00:25:39-- http://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop1.tgz Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.11.67 Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.11.67|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 277258240 (264M) [application/x-compressed] Saving to: ‘spark-1.6.1-bin-hadoop1.tgz’ 100%[==>] 277,258,240 37.6MB/s in 9.2s 2016-04-13 00:25:49 (28.8 MB/s) - ‘spark-1.6.1-bin-hadoop1.tgz’ saved [277258240/277258240] Unpacking Spark gzip: stdin: not in gzip format tar: Child returned status 1 tar: Error is not recoverable: exiting now mv: missing destination file operand after `spark' Try `mv --help' for more information. -- [image: Branch] <https://branch.io/?bmp=xink-sig> Augustus Hong Software Engineer
Spark Streaming Running Out Of Memory in 1.5.0.
enceToXML$2.apply(Utility.scala:256) > at scala.xml.Utility$$anonfun$sequenceToXML$2.apply(Utility.scala:256) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.xml.Utility$.sequenceToXML(Utility.scala:256) > at scala.xml.Utility$.serialize(Utility.scala:227) > at scala.xml.Utility$$anonfun$sequenceToXML$2.apply(Utility.scala:256) > at scala.xml.Utility$$anonfun$sequenceToXML$2.apply(Utility.scala:256) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.xml.Utility$.sequenceToXML(Utility.scala:256) > at scala.xml.Utility$.serialize(Utility.scala:227) > at scala.xml.Utility$$anonfun$sequenceToXML$2.apply(Utility.scala:256) > at scala.xml.Utility$$anonfun$sequenceToXML$2.apply(Utility.scala:256) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.xml.Utility$.sequenceToXML(Utility.scala:256) > at scala.xml.Utility$.serialize(Utility.scala:227) > at scala.xml.Utility$$anonfun$sequenceToXML$2.apply(Utility.scala:256) > at scala.xml.Utility$$anonfun$sequenceToXML$2.apply(Utility.scala:256) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.xml.Utility$.sequenceToXML(Utility.scala:256) > at scala.xml.Utility$.serialize(Utility.scala:227) -- [image: Branch Metrics mobile deep linking] <http://branch.io/>* Augustus Hong* Data Analytics | Branch Metrics m 650-391-3369 | e augus...@branch.io
Re: Upgrading Spark in EC2 clusters
Thanks for the info and the tip! I'll look into writing our own script based on the spark-ec2 scripts. Best, Augustus On Thu, Nov 12, 2015 at 10:01 AM, Jason Rubenstein < jasondrubenst...@gmail.com> wrote: > Hi, > > With some minor changes to spark-ec2/spark/init.sh and writing your own > "upgrade-spark.sh" script, you can upgrade spark in place. > > (Make sure to call not only spark/init.sh but also spark/setup.sh, because > the latter uses copy-dir to get your ner version of spark to the slaves) > > I wrote one so we could upgrade to a specific version of Spark (via > commit-hash) and used it to upgrade from 1.4.1. to 1.5.0 > > best, > Jason > > > On Thu, Nov 12, 2015 at 9:49 AM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> spark-ec2 does not offer a way to upgrade an existing cluster, and from >> what I gather, it wasn't intended to be used to manage long-lasting >> infrastructure. The recommended approach really is to just destroy your >> existing cluster and launch a new one with the desired configuration. >> >> If you want to upgrade the cluster in place, you'll probably have to do >> that manually. Otherwise, perhaps spark-ec2 is not the right tool, and >> instead you want one of those "grown-up" management tools like Ansible >> which can be setup to allow in-place upgrades. That'll take a bit of work, >> though. >> >> Nick >> >> On Wed, Nov 11, 2015 at 6:01 PM Augustus Hong <augus...@branchmetrics.io> >> wrote: >> >>> Hey All, >>> >>> I have a Spark cluster(running version 1.5.0) on EC2 launched with the >>> provided spark-ec2 scripts. If I want to upgrade Spark to 1.5.2 in the same >>> cluster, what's the safest / recommended way to do that? >>> >>> >>> I know I can spin up a new cluster running 1.5.2, but it doesn't seem >>> efficient to spin up a new cluster every time we need to upgrade. >>> >>> >>> Thanks, >>> Augustus >>> >>> >>> >>> >>> >>> -- >>> [image: Branch Metrics mobile deep linking] <http://branch.io/>* Augustus >>> Hong* >>> Data Analytics | Branch Metrics >>> m 650-391-3369 | e augus...@branch.io >>> >> > -- [image: Branch Metrics mobile deep linking] <http://branch.io/>* Augustus Hong* Data Analytics | Branch Metrics m 650-391-3369 | e augus...@branch.io
Upgrading Spark in EC2 clusters
Hey All, I have a Spark cluster(running version 1.5.0) on EC2 launched with the provided spark-ec2 scripts. If I want to upgrade Spark to 1.5.2 in the same cluster, what's the safest / recommended way to do that? I know I can spin up a new cluster running 1.5.2, but it doesn't seem efficient to spin up a new cluster every time we need to upgrade. Thanks, Augustus -- [image: Branch Metrics mobile deep linking] <http://branch.io/>* Augustus Hong* Data Analytics | Branch Metrics m 650-391-3369 | e augus...@branch.io
Re: Multiple Spark Streaming Jobs on Single Master
How did you specify number of cores each executor can use? Be sure to use this when submitting jobs with spark-submit: *--total-executor-cores 100.* Other options won't work from my experience. On Fri, Oct 23, 2015 at 8:36 AM, gaurav sharma <sharmagaura...@gmail.com> wrote: > Hi, > > I created 2 workers on same machine each with 4 cores and 6GB ram > > I submitted first job, and it allocated 2 cores on each of the worker > processes, and utilized full 4 GB ram for each executor process > > When i submit my second job it always say in WAITING state. > > > Cheers!! > > > > On Tue, Oct 20, 2015 at 10:46 AM, Tathagata Das <t...@databricks.com> > wrote: > >> You can set the max cores for the first submitted job such that it does >> not take all the resources from the master. See >> http://spark.apache.org/docs/latest/submitting-applications.html >> >> # Run on a Spark standalone cluster in client deploy mode >> ./bin/spark-submit \ >> --class org.apache.spark.examples.SparkPi \ >> --master spark://207.184.161.138:7077 \ >> --executor-memory 20G \ >> *--total-executor-cores 100 \* >> /path/to/examples.jar \ >> 1000 >> >> >> On Mon, Oct 19, 2015 at 4:26 PM, Augustus Hong <augus...@branchmetrics.io >> > wrote: >> >>> Hi All, >>> >>> Would it be possible to run multiple spark streaming jobs on a single >>> master at the same time? >>> >>> I currently have one master node and several worker nodes in the >>> standalone mode, and I used spark-submit to submit multiple spark streaming >>> jobs. >>> >>> From what I observed, it seems like only the first submitted job would >>> get resources and run. Jobs submitted afterwards will have the status >>> "Waiting", and will only run after the first one is finished or killed. >>> >>> I tried limiting each executor to only 1 core(each worker machine has 8 >>> cores), but the same things happens that only one job will be run, even >>> though there are a lot of idle cores. >>> >>> Best, >>> Augustus >>> >>> >>> >>> -- >>> [image: Branch Metrics mobile deep linking] <http://branch.io/>* Augustus >>> Hong* >>> Data Analytics | Branch Metrics >>> m 650-391-3369 | e augus...@branch.io >>> >> >> > -- [image: Branch Metrics mobile deep linking] <http://branch.io/>* Augustus Hong* Data Analytics | Branch Metrics m 650-391-3369 | e augus...@branch.io
Multiple Spark Streaming Jobs on Single Master
Hi All, Would it be possible to run multiple spark streaming jobs on a single master at the same time? I currently have one master node and several worker nodes in the standalone mode, and I used spark-submit to submit multiple spark streaming jobs. >From what I observed, it seems like only the first submitted job would get resources and run. Jobs submitted afterwards will have the status "Waiting", and will only run after the first one is finished or killed. I tried limiting each executor to only 1 core(each worker machine has 8 cores), but the same things happens that only one job will be run, even though there are a lot of idle cores. Best, Augustus -- [image: Branch Metrics mobile deep linking] <http://branch.io/>* Augustus Hong* Data Analytics | Branch Metrics m 650-391-3369 | e augus...@branch.io
Adding / Removing worker nodes for Spark Streaming
Hey all, I'm evaluating using Spark Streaming with Kafka direct streaming, and I have a couple of questions: 1. Would it be possible to add / remove worker nodes without stopping and restarting the spark streaming driver? 2. I understand that we can enable checkpointing to recover from node failures, and that it doesn't work across code changes. What about in the event that worker nodes failed due to load -> we added more worker nodes -> restart Spark Streaming? Would this incur data loss as well? Best, Augustus -- [image: Branch Metrics mobile deep linking] <http://branch.io/>* Augustus Hong* Data Analytics | Branch Metrics m 650-391-3369 | e augus...@branch.io