Re: newbie best practices: is spark-ec2 intended to be used to manage long-lasting infrastructure ?

Michal Klos Fri, 04 Dec 2015 05:50:44 -0800

If you are running on AWS I would recommend using s3 instead of hdfs as a 
general practice if you are maintaining state or data there. This way you can 
treat your spark clusters as ephemeral compute resources that you can swap out 
easily -- eg if something breaks just spin up a fresh cluster and redirect your 
workload rather than fighting a fire and trying to debug and fix a broken 
cluster. It simplifies operations once you are in prod.


M





Sent from my iPhone
> On Dec 4, 2015, at 6:42 AM, Sean Owen <so...@cloudera.com> wrote:
> 
> There is no way to upgrade a running cluster here. You can stop a
> cluster, and simply start a new cluster in the same way you started
> the original cluster. That ought to be simple; the only issue I
> suppose is that you have down-time since you have to shut the whole
> thing down, but maybe that's acceptable.
> 
> If you have data, including HDFS, set up on ephemeral disks though
> then yes that is lost. Really that's an 'ephemeral' HDFS cluster. It
> has nothing to do with partitions.
> 
> You would want to get the data out to S3 first, and then copy it back
> in later. Yes it's manual, but works fine.
> 
> For more production use cases, on Amazon, you probably want to look
> into a distribution or product around Spark rather than manage it
> yourself. That could be AWS's own EMR, Databricks cloud, or even CDH
> running on AWS. Those would give you much more of a chance of
> automatically getting updates and so on, but they're fairly different
> products.
> 
>> On Fri, Dec 4, 2015 at 3:21 AM, Divya Gehlot <divya.htco...@gmail.com> wrote:
>> Hello,
>> Even I have the same queries in mind .
>> What all the upgrades where we can use EC2 as compare to normal servers for
>> spark and other big data product development .
>> Hope to get inputs from the community .
>> 
>> Thanks,
>> Divya
>> 
>> On Dec 4, 2015 6:05 AM, "Andy Davidson" <a...@santacruzintegration.com>
>> wrote:
>>> 
>>> About 2 months ago I used spark-ec2 to set up a small cluster. The cluster
>>> runs a spark streaming app 7x24 and stores the data to hdfs. I also need to
>>> run some batch analytics on the data.
>>> 
>>> Now that I have a little more experience I wonder if this was a good way
>>> to set up the cluster the following issues
>>> 
>>> I have not been able to find explicit directions for upgrading the spark
>>> version
>>> 
>>> 
>>> http://search-hadoop.com/m/q3RTt7E0f92v0tKh2&subj=Re+Upgrading+Spark+in+EC2+clusters
>>> 
>>> I am not sure where the data is physically be stored. I think I may
>>> accidentally loose all my data
>>> spark-ec2 makes it easy to launch a cluster with as many machines as you
>>> like how ever Its not clear how I would add slaves to an existing
>>> installation
>>> 
>>> 
>>> Our Java streaming app we call rdd.saveAsTextFile(“hdfs://path”);
>>> 
>>> ephemeral-hdfs/conf/hdfs-site.xml:
>>> 
>>>  <property>
>>> 
>>>    <name>dfs.data.dir</name>
>>> 
>>>    <value>/mnt/ephemeral-hdfs/data,/mnt2/ephemeral-hdfs/data</value>
>>> 
>>>  </property>
>>> 
>>> 
>>> persistent-hdfs/conf/hdfs-site.xml
>>> 
>>> 
>>> $ mount
>>> 
>>> /dev/xvdb on /mnt type ext3 (rw,nodiratime)
>>> 
>>> /dev/xvdf on /mnt2 type ext3 (rw,nodiratime)
>>> 
>>> 
>>> http://spark.apache.org/docs/latest/ec2-scripts.html
>>> 
>>> 
>>> "The spark-ec2 script also supports pausing a cluster. In this case, the
>>> VMs are stopped but not terminated, so they lose all data on ephemeral disks
>>> but keep the data in their root partitions and their persistent-pdfs.”
>>> 
>>> 
>>> Initially I though using HDFS was a good idea. spark-ec2 makes HDFS easy
>>> to use. I incorrectly thought spark some how knew how HDFS partitioned my
>>> data.
>>> 
>>> I think many people are using amazon s3. I do not have an direct
>>> experience with S3. My concern would be that the data is not physically
>>> stored closed to my slaves. I.e. High communication costs.
>>> 
>>> Any suggestions would be greatly appreciated
>>> 
>>> Andy
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: newbie best practices: is spark-ec2 intended to be used to manage long-lasting infrastructure ?

Reply via email to