Re: newbie best practices: is spark-ec2 intended to be used to manage long-lasting infrastructure ?

Sean Owen Fri, 04 Dec 2015 03:43:55 -0800

There is no way to upgrade a running cluster here. You can stop a
cluster, and simply start a new cluster in the same way you started
the original cluster. That ought to be simple; the only issue I
suppose is that you have down-time since you have to shut the whole
thing down, but maybe that's acceptable.


If you have data, including HDFS, set up on ephemeral disks though
then yes that is lost. Really that's an 'ephemeral' HDFS cluster. It
has nothing to do with partitions.

You would want to get the data out to S3 first, and then copy it back
in later. Yes it's manual, but works fine.

For more production use cases, on Amazon, you probably want to look
into a distribution or product around Spark rather than manage it
yourself. That could be AWS's own EMR, Databricks cloud, or even CDH
running on AWS. Those would give you much more of a chance of
automatically getting updates and so on, but they're fairly different
products.

On Fri, Dec 4, 2015 at 3:21 AM, Divya Gehlot <divya.htco...@gmail.com> wrote:
> Hello,
> Even I have the same queries in mind .
> What all the upgrades where we can use EC2 as compare to normal servers for
> spark and other big data product development .
> Hope to get inputs from the community .
>
> Thanks,
> Divya
>
> On Dec 4, 2015 6:05 AM, "Andy Davidson" <a...@santacruzintegration.com>
> wrote:
>>
>> About 2 months ago I used spark-ec2 to set up a small cluster. The cluster
>> runs a spark streaming app 7x24 and stores the data to hdfs. I also need to
>> run some batch analytics on the data.
>>
>> Now that I have a little more experience I wonder if this was a good way
>> to set up the cluster the following issues
>>
>> I have not been able to find explicit directions for upgrading the spark
>> version
>>
>>
>> http://search-hadoop.com/m/q3RTt7E0f92v0tKh2&subj=Re+Upgrading+Spark+in+EC2+clusters
>>
>> I am not sure where the data is physically be stored. I think I may
>> accidentally loose all my data
>> spark-ec2 makes it easy to launch a cluster with as many machines as you
>> like how ever Its not clear how I would add slaves to an existing
>> installation
>>
>>
>> Our Java streaming app we call rdd.saveAsTextFile(“hdfs://path”);
>>
>> ephemeral-hdfs/conf/hdfs-site.xml:
>>
>>   <property>
>>
>>     <name>dfs.data.dir</name>
>>
>>     <value>/mnt/ephemeral-hdfs/data,/mnt2/ephemeral-hdfs/data</value>
>>
>>   </property>
>>
>>
>> persistent-hdfs/conf/hdfs-site.xml
>>
>>
>> $ mount
>>
>> /dev/xvdb on /mnt type ext3 (rw,nodiratime)
>>
>> /dev/xvdf on /mnt2 type ext3 (rw,nodiratime)
>>
>>
>> http://spark.apache.org/docs/latest/ec2-scripts.html
>>
>>
>> "The spark-ec2 script also supports pausing a cluster. In this case, the
>> VMs are stopped but not terminated, so they lose all data on ephemeral disks
>> but keep the data in their root partitions and their persistent-pdfs.”
>>
>>
>> Initially I though using HDFS was a good idea. spark-ec2 makes HDFS easy
>> to use. I incorrectly thought spark some how knew how HDFS partitioned my
>> data.
>>
>> I think many people are using amazon s3. I do not have an direct
>> experience with S3. My concern would be that the data is not physically
>> stored closed to my slaves. I.e. High communication costs.
>>
>> Any suggestions would be greatly appreciated
>>
>> Andy

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: newbie best practices: is spark-ec2 intended to be used to manage long-lasting infrastructure ?

Reply via email to