Howard, > - node failure? > - not able to handle if intermediate data > memory size of a node > - cost
Spark uses recomputation, aka "journaling" to provide resiliency in case of node failure, thus providing node-level recovery behavior much as Hadoop MapReduce, except much faster recovery in most cases. Spark is also designed to allow spill-to-disk if a given node doesn't have enough RAM to hold its data partitions, thus provides graceful degradation to disk-based data handling. As for cost, at $5/GB street RAM prices, meaning you can have up to 1TB RAM for about $5K, memory is becoming a smaller fraction of total node cost. If your larger question is "Hadoop MR or Spark?", or more generally, "disk-based or RAM-based distributed computing?", the correct answer is "It depends." And the variables "it" depends on are dynamically changing over time. A way to think about this is to see that there is a cost-benefit crossover point for every unique organization/business-use-case combination, before which disk is preferred, and beyond which, RAM is preferred. For many Wall-Street mission critical apps, where milliseconds can mean millions of dollars, many of these crossover points were passed in the mid-2000's. At Google, a large organization with large datasets and high productivity ($1.2M/employee-year), you can see similar crossovers in the late 2000's/early 2010's (cf. PowerDrill). The general industry is undergoing similar evaluations. The next question to ask is "how are the underlying variables changing?" Consider for example how latencies are evolving across technologies in your compute path, even as each is getting cheaper per Moore's Law. For RAM outside the L1/L2 caches, we're in the 60ns regime going down to 30-40ns. Network latencies are 100us going down to the 10us range. In contrast, disk latencies have bottomed out at 4-5ms, and the trend of SSD reads is actually going up from 20us to 30-40us (to get higher densities). You could do similar projections of bandwidths. Certainly, these storage technologies have their place, but the point is that whatever your cost-benefit equation for in-memory vs disk-based use cases is this year, next year it will shift more in favor of memory, and inexorably so the year after that. So trends clearly favor in-memory techniques like Spark. These industry trends have reinforcing positive feedback: as more organizations adopt in-memory technologies, it will become uncompetitive for laggards to sit on the sidelines for the same use cases. A final thing to keep in mind is that having affordable high performance enables use cases that were not at all possible before, such as interactive data science with huge datasets. -- Christopher T. Nguyen Co-founder & CEO, Adatao <http://adatao.com> linkedin.com/in/ctnguyen On Sat, Oct 12, 2013 at 8:24 PM, howard chen <[email protected]> wrote: > Hello, > > I am new to Spark and have only used Hadoop in the past. > > I understand Spark is in memory as compare to Hadoop who use disk for > intermediate storage. From the practical term, the benefit must be > performance, but what would be the drawbacks? > > e.g. > - node failure? > - not able to handle if intermediate data > memory size of a node > - cost > > I would like to hear your experience when using Spark to handle big data, > and what is the work around in the above cases? > > Thanks. > > >
