Yes, local directories will be sufficient On Sat, Sep 5, 2015 at 10:44 AM, N B <nb.nos...@gmail.com> wrote:
> Hi TD, > > Thanks! > > So our application does turn on checkpoints but we do not recover upon > application restart (we just blow the checkpoint directory away first and > re-create the StreamingContext) as we don't have a real need for that type > of recovery. However, because the application does reduceeByKeyAndWindow > operations, checkpointing has to be turned on. Do you think this scenario > will also only work with HDFS or having local directories suffice? > > Thanks > Nikunj > > > > On Fri, Sep 4, 2015 at 3:09 PM, Tathagata Das <t...@databricks.com> wrote: > >> Shuffle spills will use local disk, HDFS not needed. >> Spark and Spark Streaming checkpoint info WILL NEED HDFS for >> fault-tolerance. So that stuff can be recovered even if the spark cluster >> nodes go down. >> >> TD >> >> On Fri, Sep 4, 2015 at 2:45 PM, N B <nb.nos...@gmail.com> wrote: >> >>> Hello, >>> >>> We have a Spark Streaming program that is currently running on a single >>> node in "local[n]" master mode. We currently give it local directories for >>> Spark's own state management etc. The input is streaming from network/flume >>> and output is also to network/kafka etc, so the process as such does not >>> need any distributed file system. >>> >>> Now, we do want to start distributing this procesing across a few >>> machines and make a real cluster out of it. However, I am not sure if HDFS >>> is a hard requirement for that to happen. I am thinking about the Shuffle >>> spills, DStream/RDD persistence and checkpoint info. Do any of these >>> require the state to be shared via HDFS? Are there other alternatives that >>> can be utilized if state sharing is accomplished via the file system only. >>> >>> Thanks >>> Nikunj >>> >>> >> >