Spark 1.6.1 packages on S3 corrupt?

2016-04-12 Thread Augustus Hong
Hi all,

I'm trying to launch a cluster with the spark-ec2 script but seeing the
error below.  Are the packages on S3 corrupted / not in the correct format?

Initializing spark

--2016-04-13 00:25:39--
http://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop1.tgz

Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.11.67

Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.11.67|:80...
connected.

HTTP request sent, awaiting response... 200 OK

Length: 277258240 (264M) [application/x-compressed]

Saving to: ‘spark-1.6.1-bin-hadoop1.tgz’

100%[==>]
277,258,240 37.6MB/s   in 9.2s

2016-04-13 00:25:49 (28.8 MB/s) - ‘spark-1.6.1-bin-hadoop1.tgz’ saved
[277258240/277258240]

Unpacking Spark


gzip: stdin: not in gzip format

tar: Child returned status 1

tar: Error is not recoverable: exiting now

mv: missing destination file operand after `spark'

Try `mv --help' for more information.






-- 
[image: Branch] <https://branch.io/?bmp=xink-sig>
Augustus Hong
Software Engineer


Spark Streaming Running Out Of Memory in 1.5.0.

2015-12-03 Thread Augustus Hong
enceToXML$2.apply(Utility.scala:256)
>   at scala.xml.Utility$$anonfun$sequenceToXML$2.apply(Utility.scala:256)
>   at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.xml.Utility$.sequenceToXML(Utility.scala:256)
>   at scala.xml.Utility$.serialize(Utility.scala:227)
>   at scala.xml.Utility$$anonfun$sequenceToXML$2.apply(Utility.scala:256)
>   at scala.xml.Utility$$anonfun$sequenceToXML$2.apply(Utility.scala:256)
>   at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.xml.Utility$.sequenceToXML(Utility.scala:256)
>   at scala.xml.Utility$.serialize(Utility.scala:227)
>   at scala.xml.Utility$$anonfun$sequenceToXML$2.apply(Utility.scala:256)
>   at scala.xml.Utility$$anonfun$sequenceToXML$2.apply(Utility.scala:256)
>   at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.xml.Utility$.sequenceToXML(Utility.scala:256)
>   at scala.xml.Utility$.serialize(Utility.scala:227)
>   at scala.xml.Utility$$anonfun$sequenceToXML$2.apply(Utility.scala:256)
>   at scala.xml.Utility$$anonfun$sequenceToXML$2.apply(Utility.scala:256)
>   at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.xml.Utility$.sequenceToXML(Utility.scala:256)
>   at scala.xml.Utility$.serialize(Utility.scala:227)





-- 
[image: Branch Metrics mobile deep linking] <http://branch.io/>* Augustus
Hong*
 Data Analytics | Branch Metrics
 m 650-391-3369 | e augus...@branch.io


Re: Upgrading Spark in EC2 clusters

2015-11-12 Thread Augustus Hong
Thanks for the info and the tip! I'll look into writing our own script
based on the spark-ec2 scripts.


Best,
Augustus

On Thu, Nov 12, 2015 at 10:01 AM, Jason Rubenstein <
jasondrubenst...@gmail.com> wrote:

> Hi,
>
> With some minor changes to spark-ec2/spark/init.sh and writing your own
>  "upgrade-spark.sh" script, you can upgrade spark in place.
>
> (Make sure to call not only spark/init.sh but also spark/setup.sh, because
> the latter uses copy-dir to get your ner version of spark to the slaves)
>
> I wrote one so we could upgrade to a specific version of Spark (via
> commit-hash) and used it to upgrade from 1.4.1. to 1.5.0
>
> best,
> Jason
>
>
> On Thu, Nov 12, 2015 at 9:49 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> spark-ec2 does not offer a way to upgrade an existing cluster, and from
>> what I gather, it wasn't intended to be used to manage long-lasting
>> infrastructure. The recommended approach really is to just destroy your
>> existing cluster and launch a new one with the desired configuration.
>>
>> If you want to upgrade the cluster in place, you'll probably have to do
>> that manually. Otherwise, perhaps spark-ec2 is not the right tool, and
>> instead you want one of those "grown-up" management tools like Ansible
>> which can be setup to allow in-place upgrades. That'll take a bit of work,
>> though.
>>
>> Nick
>>
>> On Wed, Nov 11, 2015 at 6:01 PM Augustus Hong <augus...@branchmetrics.io>
>> wrote:
>>
>>> Hey All,
>>>
>>> I have a Spark cluster(running version 1.5.0) on EC2 launched with the
>>> provided spark-ec2 scripts. If I want to upgrade Spark to 1.5.2 in the same
>>> cluster, what's the safest / recommended way to do that?
>>>
>>>
>>> I know I can spin up a new cluster running 1.5.2, but it doesn't seem
>>> efficient to spin up a new cluster every time we need to upgrade.
>>>
>>>
>>> Thanks,
>>> Augustus
>>>
>>>
>>>
>>>
>>>
>>> --
>>> [image: Branch Metrics mobile deep linking] <http://branch.io/>* Augustus
>>> Hong*
>>>  Data Analytics | Branch Metrics
>>>  m 650-391-3369 | e augus...@branch.io
>>>
>>
>


-- 
[image: Branch Metrics mobile deep linking] <http://branch.io/>* Augustus
Hong*
 Data Analytics | Branch Metrics
 m 650-391-3369 | e augus...@branch.io


Upgrading Spark in EC2 clusters

2015-11-11 Thread Augustus Hong
Hey All,

I have a Spark cluster(running version 1.5.0) on EC2 launched with the
provided spark-ec2 scripts. If I want to upgrade Spark to 1.5.2 in the same
cluster, what's the safest / recommended way to do that?


I know I can spin up a new cluster running 1.5.2, but it doesn't seem
efficient to spin up a new cluster every time we need to upgrade.


Thanks,
Augustus





-- 
[image: Branch Metrics mobile deep linking] <http://branch.io/>* Augustus
Hong*
 Data Analytics | Branch Metrics
 m 650-391-3369 | e augus...@branch.io


Re: Multiple Spark Streaming Jobs on Single Master

2015-10-23 Thread Augustus Hong
How did you specify number of cores each executor can use?

Be sure to use this when submitting jobs with spark-submit:
*--total-executor-cores
100.*

Other options won't work from my experience.

On Fri, Oct 23, 2015 at 8:36 AM, gaurav sharma <sharmagaura...@gmail.com>
wrote:

> Hi,
>
> I created 2 workers on same machine each with 4 cores and 6GB ram
>
> I submitted first job, and it allocated 2 cores on each of the worker
> processes, and utilized full 4 GB ram for each executor process
>
> When i submit my second job it always say in WAITING state.
>
>
> Cheers!!
>
>
>
> On Tue, Oct 20, 2015 at 10:46 AM, Tathagata Das <t...@databricks.com>
> wrote:
>
>> You can set the max cores for the first submitted job such that it does
>> not take all the resources from the master. See
>> http://spark.apache.org/docs/latest/submitting-applications.html
>>
>> # Run on a Spark standalone cluster in client deploy mode
>> ./bin/spark-submit \
>>   --class org.apache.spark.examples.SparkPi \
>>   --master spark://207.184.161.138:7077 \
>>   --executor-memory 20G \
>>   *--total-executor-cores 100 \*
>>   /path/to/examples.jar \
>>   1000
>>
>>
>> On Mon, Oct 19, 2015 at 4:26 PM, Augustus Hong <augus...@branchmetrics.io
>> > wrote:
>>
>>> Hi All,
>>>
>>> Would it be possible to run multiple spark streaming jobs on a single
>>> master at the same time?
>>>
>>> I currently have one master node and several worker nodes in the
>>> standalone mode, and I used spark-submit to submit multiple spark streaming
>>> jobs.
>>>
>>> From what I observed, it seems like only the first submitted job would
>>> get resources and run.  Jobs submitted afterwards will have the status
>>> "Waiting", and will only run after the first one is finished or killed.
>>>
>>> I tried limiting each executor to only 1 core(each worker machine has 8
>>> cores), but the same things happens that only one job will be run, even
>>> though there are a lot of idle cores.
>>>
>>> Best,
>>> Augustus
>>>
>>>
>>>
>>> --
>>> [image: Branch Metrics mobile deep linking] <http://branch.io/>* Augustus
>>> Hong*
>>>  Data Analytics | Branch Metrics
>>>  m 650-391-3369 | e augus...@branch.io
>>>
>>
>>
>


-- 
[image: Branch Metrics mobile deep linking] <http://branch.io/>* Augustus
Hong*
 Data Analytics | Branch Metrics
 m 650-391-3369 | e augus...@branch.io


Multiple Spark Streaming Jobs on Single Master

2015-10-19 Thread Augustus Hong
Hi All,

Would it be possible to run multiple spark streaming jobs on a single
master at the same time?

I currently have one master node and several worker nodes in the standalone
mode, and I used spark-submit to submit multiple spark streaming jobs.

>From what I observed, it seems like only the first submitted job would get
resources and run.  Jobs submitted afterwards will have the status
"Waiting", and will only run after the first one is finished or killed.

I tried limiting each executor to only 1 core(each worker machine has 8
cores), but the same things happens that only one job will be run, even
though there are a lot of idle cores.

Best,
Augustus



-- 
[image: Branch Metrics mobile deep linking] <http://branch.io/>* Augustus
Hong*
 Data Analytics | Branch Metrics
 m 650-391-3369 | e augus...@branch.io


Adding / Removing worker nodes for Spark Streaming

2015-09-28 Thread Augustus Hong
Hey all,

I'm evaluating using Spark Streaming with Kafka direct streaming, and I
have a couple of questions:

1.  Would it be possible to add / remove worker nodes without stopping and
restarting the spark streaming driver?

2.  I understand that we can enable checkpointing to recover from node
failures, and that it doesn't work across code changes.  What about in the
event that worker nodes failed due to load -> we added more worker nodes ->
restart Spark Streaming?  Would this incur data loss as well?


Best,
Augustus

-- 
[image: Branch Metrics mobile deep linking] <http://branch.io/>* Augustus
Hong*
 Data Analytics | Branch Metrics
 m 650-391-3369 | e augus...@branch.io