If you are running on AWS I would recommend using s3 instead of hdfs as a general practice if you are maintaining state or data there. This way you can treat your spark clusters as ephemeral compute resources that you can swap out easily -- eg if something breaks just spin up a fresh cluster and redirect your workload rather than fighting a fire and trying to debug and fix a broken cluster. It simplifies operations once you are in prod.
M Sent from my iPhone > On Dec 4, 2015, at 6:42 AM, Sean Owen <so...@cloudera.com> wrote: > > There is no way to upgrade a running cluster here. You can stop a > cluster, and simply start a new cluster in the same way you started > the original cluster. That ought to be simple; the only issue I > suppose is that you have down-time since you have to shut the whole > thing down, but maybe that's acceptable. > > If you have data, including HDFS, set up on ephemeral disks though > then yes that is lost. Really that's an 'ephemeral' HDFS cluster. It > has nothing to do with partitions. > > You would want to get the data out to S3 first, and then copy it back > in later. Yes it's manual, but works fine. > > For more production use cases, on Amazon, you probably want to look > into a distribution or product around Spark rather than manage it > yourself. That could be AWS's own EMR, Databricks cloud, or even CDH > running on AWS. Those would give you much more of a chance of > automatically getting updates and so on, but they're fairly different > products. > >> On Fri, Dec 4, 2015 at 3:21 AM, Divya Gehlot <divya.htco...@gmail.com> wrote: >> Hello, >> Even I have the same queries in mind . >> What all the upgrades where we can use EC2 as compare to normal servers for >> spark and other big data product development . >> Hope to get inputs from the community . >> >> Thanks, >> Divya >> >> On Dec 4, 2015 6:05 AM, "Andy Davidson" <a...@santacruzintegration.com> >> wrote: >>> >>> About 2 months ago I used spark-ec2 to set up a small cluster. The cluster >>> runs a spark streaming app 7x24 and stores the data to hdfs. I also need to >>> run some batch analytics on the data. >>> >>> Now that I have a little more experience I wonder if this was a good way >>> to set up the cluster the following issues >>> >>> I have not been able to find explicit directions for upgrading the spark >>> version >>> >>> >>> http://search-hadoop.com/m/q3RTt7E0f92v0tKh2&subj=Re+Upgrading+Spark+in+EC2+clusters >>> >>> I am not sure where the data is physically be stored. I think I may >>> accidentally loose all my data >>> spark-ec2 makes it easy to launch a cluster with as many machines as you >>> like how ever Its not clear how I would add slaves to an existing >>> installation >>> >>> >>> Our Java streaming app we call rdd.saveAsTextFile(“hdfs://path”); >>> >>> ephemeral-hdfs/conf/hdfs-site.xml: >>> >>> <property> >>> >>> <name>dfs.data.dir</name> >>> >>> <value>/mnt/ephemeral-hdfs/data,/mnt2/ephemeral-hdfs/data</value> >>> >>> </property> >>> >>> >>> persistent-hdfs/conf/hdfs-site.xml >>> >>> >>> $ mount >>> >>> /dev/xvdb on /mnt type ext3 (rw,nodiratime) >>> >>> /dev/xvdf on /mnt2 type ext3 (rw,nodiratime) >>> >>> >>> http://spark.apache.org/docs/latest/ec2-scripts.html >>> >>> >>> "The spark-ec2 script also supports pausing a cluster. In this case, the >>> VMs are stopped but not terminated, so they lose all data on ephemeral disks >>> but keep the data in their root partitions and their persistent-pdfs.” >>> >>> >>> Initially I though using HDFS was a good idea. spark-ec2 makes HDFS easy >>> to use. I incorrectly thought spark some how knew how HDFS partitioned my >>> data. >>> >>> I think many people are using amazon s3. I do not have an direct >>> experience with S3. My concern would be that the data is not physically >>> stored closed to my slaves. I.e. High communication costs. >>> >>> Any suggestions would be greatly appreciated >>> >>> Andy > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org