hang correlated to number of shards Re: Checkpointing with Kinesis hangs with socket timeouts when driver is relaunched while transforming on a 0 event batch

2015-11-13 Thread Hster Geguri
Just an update that the kinesis checkpointing works well with orderly and kill -9 driver shutdowns when there is less than 4 shards. We use 20+. I created a case with Amazon support since it is the AWS kinesis getRecords API which is hanging. Regards, Heji On Thu, Nov 12, 2015 at 10:37 AM

Re: Checkpointing with Kinesis

2015-09-18 Thread Nick Pentreath
Are you doing actual transformations / aggregation in Spark Streaming? Or just using it to bulk write to S3? If the latter, then you could just use your AWS Lambda function to read directly from the Kinesis stream. If the former, then perhaps either look into the WAL option that Aniket mentioned,

Re: Checkpointing with Kinesis

2015-09-18 Thread Michal Čizmazia
FYI re WAL on S3 http://search-hadoop.com/m/q3RTtFMpd41A7TnH/WAL+S3=WAL+on+S3 On 18 September 2015 at 13:32, Alan Dipert wrote: > Hello, > > Thanks all for considering our problem. We are doing transformations in > Spark Streaming. We have also since learned that WAL to S3

Re: Checkpointing with Kinesis

2015-09-18 Thread Alan Dipert
Hello, Thanks all for considering our problem. We are doing transformations in Spark Streaming. We have also since learned that WAL to S3 on 1.4 is "not reliable" [1] We are just going to wait for EMR to support 1.5 and hopefully this won't be a problem anymore [2]. Alan 1.

Checkpointing with Kinesis

2015-09-17 Thread Alan Dipert
Hello, We are using Spark Streaming 1.4.1 in AWS EMR to process records from Kinesis. Our Spark program saves RDDs to S3, after which the records are picked up by a Lambda function that loads them into Redshift. That no data is lost during processing is important to us. We have set our Kinesis

Re: Checkpointing with Kinesis

2015-09-17 Thread Aniket Bhatnagar
You can perhaps setup a WAL that logs to S3? New cluster should pick the records that weren't processed due previous cluster termination. Thanks, Aniket On Thu, Sep 17, 2015, 9:19 PM Alan Dipert wrote: > Hello, > We are using Spark Streaming 1.4.1 in AWS EMR to process records