Re: Is HDFS required for Spark streaming?

Cody Koeninger Tue, 08 Sep 2015 07:00:00 -0700

Yes, local directories will be sufficient

On Sat, Sep 5, 2015 at 10:44 AM, N B <nb.nos...@gmail.com> wrote:


> Hi TD,
>
> Thanks!
>
> So our application does turn on checkpoints but we do not recover upon
> application restart (we just blow the checkpoint directory away first and
> re-create the StreamingContext) as we don't have a real need for that type
> of recovery. However, because the application does reduceeByKeyAndWindow
> operations, checkpointing has to be turned on. Do you think this scenario
> will also only work with HDFS or having local directories suffice?
>
> Thanks
> Nikunj
>
>
>
> On Fri, Sep 4, 2015 at 3:09 PM, Tathagata Das <t...@databricks.com> wrote:
>
>> Shuffle spills will use local disk, HDFS not needed.
>> Spark and Spark Streaming checkpoint info WILL NEED HDFS for
>> fault-tolerance. So that stuff can be recovered even if the spark cluster
>> nodes go down.
>>
>> TD
>>
>> On Fri, Sep 4, 2015 at 2:45 PM, N B <nb.nos...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> We have a Spark Streaming program that is currently running on a single
>>> node in "local[n]" master mode. We currently give it local directories for
>>> Spark's own state management etc. The input is streaming from network/flume
>>> and output is also to network/kafka etc, so the process as such does not
>>> need any distributed file system.
>>>
>>> Now, we do want to start distributing this procesing across a few
>>> machines and make a real cluster out of it. However, I am not sure if HDFS
>>> is a hard requirement for that to happen. I am thinking about the Shuffle
>>> spills, DStream/RDD persistence and checkpoint info. Do any of these
>>> require the state to be shared via HDFS? Are there other alternatives that
>>> can be utilized if state sharing is accomplished via the file system only.
>>>
>>> Thanks
>>> Nikunj
>>>
>>>
>>
>

Re: Is HDFS required for Spark streaming?

Reply via email to