Re: streaming spark is writing results to S3 a good idea?
Writing to S3 is over the network. So will obviously be slower than local disk. That said, within AWS the network is pretty fast. Still you might want to write to S3 only after a certain threshold in data is reached, so that it's efficient. You might also want to use the DirectOutputCommitter as it avoid one extra set of writes and is doubly faster. Note that when using S3 your data moves through the public Internet, though it's still https. If you don't like that you should look at using vpc endpoints. Regards Sab On 24-Feb-2016 6:57 am, "Andy Davidson" wrote: > Currently our stream apps write results to hdfs. We are running into > problems with HDFS becoming corrupted and running out of space. It seems > like a better solution might be to write directly to S3. Is this a good > idea? > > We plan to continue to write our checkpoints to hdfs > > Are there any issues to be aware of? Maybe performance or something else > to watch out for? > > This is our first S3 project. Does storage just grow on on demand? > > Kind regards > > Andy > > > P.s. Turns out we are using an old version of hadoop (v 1.0.4) > > > >
Re: streaming spark is writing results to S3 a good idea?
And yes, storage grows on demand. No issues with that. Regards Sab On 24-Feb-2016 6:57 am, "Andy Davidson" wrote: > Currently our stream apps write results to hdfs. We are running into > problems with HDFS becoming corrupted and running out of space. It seems > like a better solution might be to write directly to S3. Is this a good > idea? > > We plan to continue to write our checkpoints to hdfs > > Are there any issues to be aware of? Maybe performance or something else > to watch out for? > > This is our first S3 project. Does storage just grow on on demand? > > Kind regards > > Andy > > > P.s. Turns out we are using an old version of hadoop (v 1.0.4) > > > >
streaming spark is writing results to S3 a good idea?
Currently our stream apps write results to hdfs. We are running into problems with HDFS becoming corrupted and running out of space. It seems like a better solution might be to write directly to S3. Is this a good idea? We plan to continue to write our checkpoints to hdfs Are there any issues to be aware of? Maybe performance or something else to watch out for? This is our first S3 project. Does storage just grow on on demand? Kind regards Andy P.s. Turns out we are using an old version of hadoop (v 1.0.4)