Re: streaming spark is writing results to S3 a good idea?

Sabarish Sasidharan Tue, 23 Feb 2016 22:48:12 -0800

Writing to S3 is over the network. So will obviously be slower than local
disk. That said, within AWS the network is pretty fast. Still you might
want to write to S3 only after a certain threshold in data is reached, so
that it's efficient. You might also want to use the DirectOutputCommitter
as it avoid one extra set of writes and is doubly faster.


Note that when using S3 your data moves through the public Internet, though
it's still https. If you don't like that you should look at using vpc
endpoints.

Regards
Sab
On 24-Feb-2016 6:57 am, "Andy Davidson" <a...@santacruzintegration.com>
wrote:

> Currently our stream apps write results to hdfs. We are running into
> problems with HDFS becoming corrupted and running out of space. It seems
> like a better solution might be to write directly to S3. Is this a good
> idea?
>
> We plan to continue to write our checkpoints to hdfs
>
> Are there any issues to be aware of? Maybe performance or something else
> to watch out for?
>
> This is our first S3 project. Does storage just grow on on demand?
>
> Kind regards
>
> Andy
>
>
> P.s. Turns out we are using an old version of hadoop (v 1.0.4)
>
>
>
>

Re: streaming spark is writing results to S3 a good idea?

Reply via email to