Re: spark streaming with kinesis

2016-11-20 Thread Takeshi Yamamuro
"1 userid data" is ambiguous though (user-input data? stream? shard?), since a kinesis worker fetch data from shards that the worker has an ownership of, IIUC user-input data in a shard are transferred into an assigned worker as long as you get no failure. // maropu On Mon, Nov 21, 2016 at 1:59 P

Re: spark streaming with kinesis

2016-11-20 Thread Shushant Arora
Hi Thanks. Have a doubt on spark streaming kinesis consumer. Say I have a batch time of 500 ms and kiensis stream is partitioned on userid(uniformly distributed).But since IdleTimeBetweenReadsInMillis is set to 1000ms so Spark receiver nodes will fetch the data at interval of 1 second and store in

Re: spark streaming with kinesis

2016-11-14 Thread Takeshi Yamamuro
Seems it it not a good design to frequently restart workers in a minute because their initialization and shutdown take much time as you said (e.g., interconnection overheads with dynamodb and graceful shutdown). Anyway, since this is a kind of questions about the aws kinesis library, so you'd bett

Re: spark streaming with kinesis

2016-11-14 Thread Shushant Arora
1.No, I want to implement low level consumer on kinesis stream. so need to stop the worker once it read the latest sequence number sent by driver. 2.What is the cost of frequent register and deregister of worker node. Is that when worker's shutdown is called it will terminate run method but leasec

Re: spark streaming with kinesis

2016-11-14 Thread Takeshi Yamamuro
Is "aws kinesis get-shard-iterator --shard-iterator-type LATEST" not enough for your usecase? On Mon, Nov 14, 2016 at 10:23 PM, Shushant Arora wrote: > Thanks! > Is there a way to get the latest sequence number of all shards of a > kinesis stream? > > > > On Mon, Nov 14, 2016 at 5:43 PM, Takeshi

Re: spark streaming with kinesis

2016-11-14 Thread Shushant Arora
Thanks! Is there a way to get the latest sequence number of all shards of a kinesis stream? On Mon, Nov 14, 2016 at 5:43 PM, Takeshi Yamamuro wrote: > Hi, > > The time interval can be controlled by `IdleTimeBetweenReadsInMillis` in > KinesisClientLibConfiguration > though, > it is not configu

Re: spark streaming with kinesis

2016-11-14 Thread Takeshi Yamamuro
Hi, The time interval can be controlled by `IdleTimeBetweenReadsInMillis` in KinesisClientLibConfiguration though, it is not configurable in the current implementation. The detail can be found in; https://github.com/apache/spark/blob/master/external/kinesis-asl/src/main/scala/org/apache/spark/str

spark streaming with kinesis

2016-11-12 Thread Shushant Arora
*Hi * *is **spark.streaming.blockInterval* for kinesis input stream is hardcoded to 1 sec or is it configurable ? Time interval at which receiver fetched data from kinesis . Means stream batch interval cannot be less than *spark.streaming.blockInterval and this should be configrable , Also is the

Re: spark streaming with kinesis

2016-11-07 Thread Takeshi Yamamuro
eaming? > > Is there any limitation on interval checkpoint - minimum of 1second in > spark streaming with kinesis. But as such there is no limit on checkpoint > interval in KCL side ? > > Thanks > > On Tue, Oct 25, 2016 at 8:36 AM, Takeshi Yamamuro > wrote: > >> I&

Re: spark streaming with kinesis

2016-11-06 Thread Shushant Arora
Hi By receicer I meant spark streaming receiver architecture- means worker nodes are different than receiver nodes. There is no direct consumer/low level consumer like of Kafka in kinesis spark streaming? Is there any limitation on interval checkpoint - minimum of 1second in spark streaming

Re: spark streaming with kinesis

2016-10-24 Thread Takeshi Yamamuro
I'm not exactly sure about the receiver you pointed though, if you point the "KinesisReceiver" implementation, yes. Also, we currently cannot disable the interval checkpoints. On Tue, Oct 25, 2016 at 11:53 AM, Shushant Arora wrote: > Thanks! > > Is kinesis streams are receiver based only? Is th

Re: spark streaming with kinesis

2016-10-24 Thread Shushant Arora
Thanks! Is kinesis streams are receiver based only? Is there non receiver based consumer for Kinesis ? And Instead of having fixed checkpoint interval,Can I disable auto checkpoint and say when my worker has processed the data after last record of mapPartition now checkpoint the sequence no usin

Re: spark streaming with kinesis

2016-10-24 Thread Takeshi Yamamuro
Hi, The only thing you can do for Kinesis checkpoints is tune the interval of them. https://github.com/apache/spark/blob/master/external/ kinesis-asl/src/main/scala/org/apache/spark/streaming/ kinesis/KinesisUtils.scala#L68 Whether the dataloss occurs or not depends on the storage level you set;

spark streaming with kinesis

2016-10-24 Thread Shushant Arora
Does spark streaming consumer for kinesis uses Kinesis Client Library and mandates to checkpoint the sequence number of shards in dynamo db. Will it lead to dataloss if consumed datarecords are not yet processed and kinesis checkpointed the consumed sequenece numbers in dynamo db and spark worker

Re: Spark streaming with Kinesis broken?

2015-12-11 Thread Brian London
Yes, it's against master: https://github.com/apache/spark/pull/10256 I'll push the KCL version bump after my local tests finish. On Fri, Dec 11, 2015 at 10:42 AM Nick Pentreath wrote: > Is that PR against master branch? > > S3 read comes from Hadoop / jet3t afaik > > — > Sent from Mailbox

Re: Spark streaming with Kinesis broken?

2015-12-11 Thread Nick Pentreath
Is that PR against master branch? S3 read comes from Hadoop / jet3t afaik — Sent from Mailbox On Fri, Dec 11, 2015 at 5:38 PM, Brian London wrote: > That's good news I've got a PR in to up the SDK version to 1.10.40 and the > KCL to 1.6.1 which I'm running tests on locally now. > Is the

Re: Spark streaming with Kinesis broken?

2015-12-11 Thread Brian London
That's good news I've got a PR in to up the SDK version to 1.10.40 and the KCL to 1.6.1 which I'm running tests on locally now. Is the AWS SDK not used for reading/writing from S3 or do we get that for free from the Hadoop dependencies? On Fri, Dec 11, 2015 at 5:07 AM Nick Pentreath wrote: > c

Re: Spark streaming with Kinesis broken?

2015-12-11 Thread Nick Pentreath
cc'ing dev list Ok, looks like when the KCL version was updated in https://github.com/apache/spark/pull/8957, the AWS SDK version was not, probably leading to dependency conflict, though as Burak mentions its hard to debug as no exceptions seem to get thrown... I've tested 1.5.2 locally and on my

Re: Spark streaming with Kinesis broken?

2015-12-10 Thread Nick Pentreath
Yeah also the integration tests need to be specifically run - I would have thought the contributor would have run those tests and also tested the change themselves using live Kinesis :( — Sent from Mailbox On Fri, Dec 11, 2015 at 6:18 AM, Burak Yavuz wrote: > I don't think the Kinesis tests

Re: Spark streaming with Kinesis broken?

2015-12-10 Thread Burak Yavuz
I don't think the Kinesis tests specifically ran when that was merged into 1.5.2 :( https://github.com/apache/spark/pull/8957 https://github.com/apache/spark/commit/883bd8fccf83aae7a2a847c9a6ca129fac86e6a3 AFAIK pom changes don't trigger the Kinesis tests. Burak On Thu, Dec 10, 2015 at 8:09 PM,

Re: Spark streaming with Kinesis broken?

2015-12-10 Thread Nick Pentreath
Yup also works for me on master branch as I've been testing DynamoDB Streams integration. In fact works with latest KCL 1.6.1 also which I was using. So theKCL version does seem like it could be the issue - somewhere along the line an exception must be getting swallowed. Though the tests shou

Re: Spark streaming with Kinesis broken?

2015-12-10 Thread Brian London
Yes, it worked in the 1.6 branch as of commit db5165246f2888537dd0f3d4c5a515875c7358ed. That makes it much less serious of an issue, although it would be nice to know what the root cause is to avoid a regression. On Thu, Dec 10, 2015 at 4:03 PM Burak Yavuz wrote: > I've noticed this happening w

Re: Spark streaming with Kinesis broken?

2015-12-10 Thread Burak Yavuz
I've noticed this happening when there was some dependency conflicts, and it is super hard to debug. It seems that the KinesisClientLibrary version in Spark 1.5.2 is 1.3.0, but it is 1.2.1 in Spark 1.5.1. I feel like that seems to be the problem... Brian, did you verify that it works with the 1.6.

Re: Spark streaming with Kinesis broken?

2015-12-10 Thread Brian London
Nick's symptoms sound identical to mine. I should mention that I just pulled the latest version from github and it seems to be working there. To reproduce: 1. Download spark 1.5.2 from http://spark.apache.org/downloads.html 2. build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTes

Re: Spark streaming with Kinesis broken?

2015-12-10 Thread Jean-Baptiste Onofré
Hi Nick, Just to be sure: don't you see some ClassCastException in the log ? Thanks, Regards JB On 12/10/2015 07:56 PM, Nick Pentreath wrote: Could you provide an example / test case and more detail on what issue you're facing? I've just tested a simple program reading from a dev Kinesis stre

Spark streaming with Kinesis broken?

2015-12-10 Thread Brian London
Has anyone managed to run the Kinesis demo in Spark 1.5.2? The Kinesis ASL that ships with 1.5.2 appears to not work for me although 1.5.1 is fine. I spent some time with Amazon earlier in the week and the only thing we could do to make it work is to change the version to 1.5.1. Can someone pleas

Re: Having problem with Spark streaming with Kinesis

2014-12-19 Thread Ashrafuzzaman
Thanks Aniket , clears a lot of confusion. 😄 On Dec 14, 2014 7:11 PM, "Aniket Bhatnagar" wrote: > The reason is because of the following code: > > val numStreams = numShards > val kinesisStreams = (0 until numStreams).map { i => > KinesisUtils.createStream(ssc, streamName, endpointUrl, > kinesi

Re: Having problem with Spark streaming with Kinesis

2014-12-14 Thread Aniket Bhatnagar
The reason is because of the following code: val numStreams = numShards val kinesisStreams = (0 until numStreams).map { i => KinesisUtils.createStream(ssc, streamName, endpointUrl, kinesisCheckpointInterval, InitialPositionInStream.LATEST, StorageLevel.MEMORY_AND_DISK_2) } In the above co

Re: Having problem with Spark streaming with Kinesis

2014-12-13 Thread A.K.M. Ashrafuzzaman
Thanks Aniket, The trick is to have the #workers >= #shards + 1. But I don’t know why is that. http://spark.apache.org/docs/latest/streaming-kinesis-integration.html Here in the figure[spark streaming kinesis architecture], it seems like one node should be able to take on more than one shards.

Re: Having problem with Spark streaming with Kinesis

2014-12-03 Thread A.K.M. Ashrafuzzaman
Guys, In my local machine it consumes a stream of Kinesis with 3 shards. But in EC2 it does not consume from the stream. Later we found that the EC2 machine was of 2 cores and my local machine was of 4 cores. I am using a single machine and in spark standalone mode. And we got a larger machine f

Re: Having problem with Spark streaming with Kinesis

2014-11-26 Thread Aniket Bhatnagar
Did you set spark master as local[*]? If so, then it means that nunber of executors is equal to number of cores of the machine. Perhaps your mac machine has more cores (certainly more than number of kinesis shards +1). Try explicitly setting master as local[N] where N is number of kinesis shards +

Re: Having problem with Spark streaming with Kinesis

2014-11-26 Thread Ashrafuzzaman
I was trying in one machine with just sbt run. And it is working with my mac environment with the same configuration. I used the sample code from https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala val kines

Re: Having problem with Spark streaming with Kinesis

2014-11-26 Thread Aniket Bhatnagar
What's your cluster size? For streamig to work, it needs shards + 1 executors. On Wed, Nov 26, 2014, 5:53 PM A.K.M. Ashrafuzzaman < ashrafuzzaman...@gmail.com> wrote: > Hi guys, > When we are using Kinesis with 1 shard then it works fine. But when we use > more that 1 then it falls into an infini

Re: Having problem with Spark streaming with Kinesis

2014-11-26 Thread Akhil Das
I have it working without any issues (tried with 5 shrads), except my java version was 1.7. Here's the piece of code that i used. System.setProperty("AWS_ACCESS_KEY_ID", this.kConf.getOrElse("access_key", "")) System.setProperty("AWS_SECRET_KEY", this.kConf.getOrElse("secret", "")) val

Having problem with Spark streaming with Kinesis

2014-11-26 Thread A.K.M. Ashrafuzzaman
Hi guys, When we are using Kinesis with 1 shard then it works fine. But when we use more that 1 then it falls into an infinite loop and no data is processed by the spark streaming. In the kinesis dynamo DB, I can see that it keeps increasing the leaseCounter. But it do start processing. I am us

Re: Spark Streaming with Kinesis

2014-10-29 Thread Matt Chu
I haven't tried this myself yet, but this sounds relevant: https://github.com/apache/spark/pull/2535 Will be giving this a try today or so, will report back. On Wednesday, October 29, 2014, Harold Nguyen wrote: > Hi again, > > After getting through several dependencies, I finally got to this >

Re: Spark Streaming with Kinesis

2014-10-29 Thread Harold Nguyen
Hi again, After getting through several dependencies, I finally got to this non-dependency type error: Exception in thread "main" java.lang.NoSuchMethodError: org.apache.http.impl.conn.DefaultClientConnectionOperator.(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)

Spark Streaming with Kinesis

2014-10-29 Thread Harold Nguyen
Hi all, I followed the guide here: http://spark.apache.org/docs/latest/streaming-kinesis-integration.html But got this error: Exception in thread "main" java.lang.NoClassDefFoundError: com/amazonaws/auth/AWSCredentialsProvider Would you happen to know what dependency or jar is needed ? Harold