Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-13 Thread Cody Koeninger
titions) are being handled by the executors at any given time? > > Thanks, > > Ivan > > On Sat, Nov 12, 2016 at 1:25 PM, Cody Koeninger <c...@koeninger.org> > wrote: > >> You should not be getting consumer churn on executors at all, that's >> the whole

Re: Spark Streaming- ReduceByKey not removing Duplicates for the same key in a Batch

2016-11-12 Thread Cody Koeninger
s and stages are all successful and even the speculation > is turned off . > > On Sat, Nov 12, 2016 at 9:55 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Are you certain you aren't getting any failed tasks or other errors? >> Output actions like foreach aren't ex

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-12 Thread Cody Koeninger
c and a consumer >> group that is made from the topic (each RDD uses a different topic and >> group), process the data and write to Parquet files. >> >> Per my Nov 10th post, we still get polling timeouts unless the poll.ms is >> set to something like 10 seconds. We also

Re: Spark Streaming- ReduceByKey not removing Duplicates for the same key in a Batch

2016-11-12 Thread Cody Koeninger
Are you certain you aren't getting any failed tasks or other errors? Output actions like foreach aren't exactly once and will be retried on failures. On Nov 12, 2016 06:36, "dev loper" wrote: > Dear fellow Spark Users, > > My Spark Streaming application (Spark 2.0 , on AWS

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-11 Thread Cody Koeninger
e() >> } >> >> I am not sure why the concurrent issue is there as I have tried to debug >> and also looked at the KafkaConsumer code as well, but everything looks >> like it should not occur. The things to figure out is why when running in >> parallel do

Re: Akka Stream as the source for Spark Streaming. Please advice...

2016-11-10 Thread Cody Koeninger
The basic structured streaming source for Kafka is already committed to master, build it and try it out. If you're already using Kafka I don't really see much point in trying to put Akka in between it and Spark. On Nov 10, 2016 02:25, "vincent gromakowski" wrote:

Re: Kafka stream offset management question

2016-11-08 Thread Cody Koeninger
http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html specifically http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#storing-offsets Have you set enable.auto.commit to false? The new consumer stores offsets in kafka, so the idea of specifically

Re: Using Apache Spark Streaming - how to handle changing data format within stream

2016-11-07 Thread Cody Koeninger
I may be misunderstanding, but you need to take each kafka message, and turn it into multiple items in the transformed rdd? so something like (pseudocode): stream.flatMap { message => val items = new ArrayBuffer var parser = null message.split("\n").foreach { line => if // it's a

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-07 Thread Cody Koeninger
ed to set poll.ms to 30 seconds and still get the issue. > Something is not right here and just not seem right. As I mentioned with the > streaming application, with Spark 1.6 and Kafka 0.8.x we never saw this > issue. We have been running the same basic logic for over a year now without >

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-07 Thread Cody Koeninger
ce > spark.task.maxFailures is reached? Has anyone else seen this behavior of a > streaming application skipping over failed microbatches? > > Thanks, > Sean > > >> On Nov 4, 2016, at 2:48 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> So basicall

Re: expected behavior of Kafka dynamic topic subscription

2016-11-07 Thread Cody Koeninger
;hw...@qilinsoft.com> wrote: > Cody, thanks for the response. Do you think it's a Spark issue or Kafka > issue? Can you please let me know the jira ticket number? > > -Original Message- > From: Cody Koeninger [mailto:c...@koeninger.org] > Sent: 2016年11月4日 22:35 >

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-04 Thread Cody Koeninger
e since Spark now requires one to assign or subscribe to a topic in > order to even update the offsets. In 0.8.2.x you did not have to worry about > that. This approach limits your exposure to duplicate data since idempotent > records are not entirely possible in our scenario. At least

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-04 Thread Cody Koeninger
So just to be clear, the answers to my questions are - you are not using different group ids, you're using the same group id everywhere - you are committing offsets manually Right? If you want to eliminate network or kafka misbehavior as a source, tune poll.ms upwards even higher. You must

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-04 Thread Cody Koeninger
- are you using different group ids for the different streams? - are you manually committing offsets? - what are the values of your kafka-related settings? On Fri, Nov 4, 2016 at 12:20 PM, vonnagy wrote: > I am getting the issues using Spark 2.0.1 and Kafka 0.10. I have two jobs,

Re: expected behavior of Kafka dynamic topic subscription

2016-11-04 Thread Cody Koeninger
That's not what I would expect from the underlying kafka consumer, no. But this particular case (no matching topics, then add a topic after SubscribePattern stream starts) actually isn't part of unit tests for either the DStream or the structured stream. I'll make a jira ticket. On Thu, Nov 3,

Re: Does Data pipeline using kafka and structured streaming work?

2016-11-01 Thread Cody Koeninger
t time windowed > aggregation for several weeks now. > > On Tue, Nov 1, 2016 at 6:18 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Look at the resolved subtasks attached to that ticket you linked. >> Some of them are unresolved, but basic functionality is there. >&

Re: Does Data pipeline using kafka and structured streaming work?

2016-11-01 Thread Cody Koeninger
Look at the resolved subtasks attached to that ticket you linked. Some of them are unresolved, but basic functionality is there. On Tue, Nov 1, 2016 at 7:37 PM, shyla deshpande wrote: > Hi Michael, > > Thanks for the reply. > > The following link says there is a open

Re: MapWithState partitioning

2016-10-31 Thread Cody Koeninger
gt; > > The reason I ask is that it simply looks strange to me that Spark will have > to shuffle each time my input stream and "state" stream during the > mapWithState operation when I now for sure that those two streams will > always share same keys and will not need ac

Re: MapWithState partitioning

2016-10-31 Thread Cody Koeninger
If you call a transformation on an rdd using the same partitioner as that rdd, no shuffle will occur. KafkaRDD doesn't have a partitioner, there's no consistent partitioning scheme that works for all kafka uses. You can wrap each kafkardd with an rdd that has a custom partitioner that you write

Re: Reason for Kafka topic existence check / "Does the topic exist?" error

2016-10-29 Thread Cody Koeninger
to operationalize > topic creation. Not a big deal but now we'll have to make sure we execute > the 'create-topics' type of task or shell script at install time. > > This seems like a Kafka doc issue potentially, to explain what exactly one > can expect from the auto.create.topics.en

Re: Zero Data Loss in Spark with Kafka

2016-10-26 Thread Cody Koeninger
application is running and the files are created right. But as soon as I >>> restart the application, it goes back to fromOffset as 0. Any thoughts? >>> >>> regards >>> Sunita >>> >>> On Tue, Oct 25, 2016 at 1:52 PM, Sunita Arvind <sunitarv...@gmail.c

Re: Zero Data Loss in Spark with Kafka

2016-10-25 Thread Cody Koeninger
t;Done reading offsets from ZooKeeper. Took " + >> (System.currentTimeMillis() - start)) >> >> Some(offsets) >> case None => >> LogHandler.log.info("No offsets found in ZooKeeper. Took " + >> (System.currentTimeMillis() - start

Re: Spark streaming communication with different versions of kafka

2016-10-25 Thread Cody Koeninger
Kafka consumers should be backwards compatible with kafka brokers, so at the very least you should be able to use the streaming-spark-kafka-0-10 to do what you're talking about. On Tue, Oct 25, 2016 at 4:30 AM, Prabhu GS wrote: > Hi, > > I would like to know if the same

Re: Spark 2.0 with Kafka 0.10 exception

2016-10-21 Thread Cody Koeninger
al.ms can be set differently. > I'll leave it to you on how to add this to docs! > > > On Thu, Oct 20, 2016 at 1:41 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Right on, I put in a PR to make a note of that in the docs. >> >> On Thu, Oct 20, 2016 at 1

Re: Kafka Direct Stream: Offset Managed Manually (Exactly Once)

2016-10-21 Thread Cody Koeninger
gain, I know what I have to do :) > > On Fri, Oct 21, 2016 at 5:05 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> 0. If your processing time is regularly greater than your batch >> interval you're going to have problems anyway. Investigate this more, >> set m

Re: Kafka Direct Stream: Offset Managed Manually (Exactly Once)

2016-10-21 Thread Cody Koeninger
0. If your processing time is regularly greater than your batch interval you're going to have problems anyway. Investigate this more, set maxRatePerPartition, something. 1. That's personally what I tend to do. 2. Why are you relying on checkpoints if you're storing offset state in the database?

Re: Spark 2.0 with Kafka 0.10 exception

2016-10-20 Thread Cody Koeninger
Right on, I put in a PR to make a note of that in the docs. On Thu, Oct 20, 2016 at 12:13 PM, Srikanth <srikanth...@gmail.com> wrote: > Yeah, setting those params helped. > > On Wed, Oct 19, 2016 at 1:32 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >>

Re: Spark 2.0 with Kafka 0.10 exception

2016-10-19 Thread Cody Koeninger
Iterator.next(KafkaRDD.scala:193) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$21.next(Iterator.scala:838) > > > On Wed, Sep 7, 2016 at 3:55 PM, Cod

Re: Problems with new experimental Kafka Consumer for 0.10

2016-10-14 Thread Cody Koeninger
nge[] offsets = ((HasOffsetRanges) rdd.rdd()).offsetRanges(); > CanCommitOffsets cco = (CanCommitOffsets) directKafkaStream.dstream(); > cco.commitAsync(offsets); > > I tried setting "max.poll.records" to 1000 but this did not help. > > Any idea what could be wrong? >

Re: Kafka integration: get existing Kafka messages?

2016-10-14 Thread Cody Koeninger
utor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.AssertionError: assertion failed: Failed to get records for > spark-executor-null mytopic2 0 2 after polling for 512 > == > > -Original Message- > From: Cody Koeninger [mailto:c...@koeninger.org] > Sent: 201

Re: Want to test spark-sql-kafka but get unresolved dependency error

2016-10-14 Thread Cody Koeninger
I can't be sure, no. On Fri, Oct 14, 2016 at 3:06 AM, Julian Keppel <juliankeppel1...@gmail.com> wrote: > Okay, thank you! Can you say, when this feature will be released? > > 2016-10-13 16:29 GMT+02:00 Cody Koeninger <c...@koeninger.org>: >> >> As Sean said, it'

Re: Limit Kafka batches size with Spark Streaming

2016-10-13 Thread Cody Koeninger
fka.maxRatePerPartition", "10") > conf.set("spark.streaming.backpressure.enabled", "true") > > That's not normal, is it? Do you notice anything odd in my logs? > > Thanks a lot. > > > > On 10/12/2016 07:31 PM, Cody Koeninger w

Re: Want to test spark-sql-kafka but get unresolved dependency error

2016-10-13 Thread Cody Koeninger
As Sean said, it's unreleased. If you want to try it out, build spark http://spark.apache.org/docs/latest/building-spark.html The easiest way to include the jar is probably to use mvn install to put it in your local repository, then link it in your application's mvn or sbt build file as

Re: Kafka integration: get existing Kafka messages?

2016-10-12 Thread Cody Koeninger
response. > > > > So Kafka direct stream actually has consumer on both the driver and > executor? Can you please provide more details? Thank you very much! > > > > ________ > > From: Cody Koeninger [mailto:c...@koeninger.org] > Sent:

Re: Limit Kafka batches size with Spark Streaming

2016-10-12 Thread Cody Koeninger
rg/e730492453.png notice the cutover point On Wed, Oct 12, 2016 at 11:00 AM, Samy Dindane <s...@dindane.com> wrote: > I am 100% sure. > > println(conf.get("spark.streaming.backpressure.enabled")) prints true. > > > On 10/12/2016 05:48 PM, Cody Koeninger w

Re: Limit Kafka batches size with Spark Streaming

2016-10-12 Thread Cody Koeninger
gt; Do you have any idea about why aren't backpressure working? How to debug > this? > > > On 10/11/2016 06:08 PM, Cody Koeninger wrote: >> >> http://spark.apache.org/docs/latest/configuration.html >> >> "This rate is upper bounded by the values >&g

Re: What happens when an executor crashes?

2016-10-12 Thread Cody Koeninger
, are you talking about > repartition using partitionBy()? > > > On 10/11/2016 01:23 AM, Cody Koeninger wrote: >> >> Repartition almost always involves a shuffle. >> >> Let me see if I can explain the recovery stuff... >> >> Say you start with two kafka part

Re: Kafka integration: get existing Kafka messages?

2016-10-12 Thread Cody Koeninger
its set to none for the executors, because otherwise they wont do exactly what the driver told them to do. you should be able to set up the driver consumer to determine batches however you want, though. On Wednesday, October 12, 2016, Haopu Wang wrote: > Hi, > > > > I want

Re: Limit Kafka batches size with Spark Streaming

2016-10-11 Thread Cody Koeninger
http://spark.apache.org/docs/latest/configuration.html "This rate is upper bounded by the values spark.streaming.receiver.maxRate and spark.streaming.kafka.maxRatePerPartition if they are set (see below)." On Tue, Oct 11, 2016 at 10:57 AM, Samy Dindane wrote: > Hi, > > Is it

Re: Can mapWithState state func be called every batchInterval?

2016-10-11 Thread Cody Koeninger
involving mapWithState > of course :) I'm just wondering why it doesn't support this use case yet. > > On Tue, Oct 11, 2016 at 3:41 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> They're telling you not to use the old function because it's linear on the >> total

Re: Can mapWithState state func be called every batchInterval?

2016-10-11 Thread Cody Koeninger
They're telling you not to use the old function because it's linear on the total number of keys, not keys in the batch, so it's slow. But if that's what you really want, go ahead and do it, and see if it performs well enough. On Oct 11, 2016 6:28 AM, "DandyDev" wrote: Hi

Re: Problems with new experimental Kafka Consumer for 0.10

2016-10-11 Thread Cody Koeninger
it works. Thanks! Although changing this is ok for us, i am interested in the why :-) With the old connector this was not a problem nor is it afaik with the pure kafka consumer api 2016-10-11 14:30 GMT+02:00 Cody Koeninger <c...@koeninger.org>: > Just out of curiosity, have you tried

Re: Problems with new experimental Kafka Consumer for 0.10

2016-10-11 Thread Cody Koeninger
c: String): >> InputDStream[ConsumerRecord[String, Bytes]] = { >> KafkaUtils.createDirectStream[String, Bytes](ssc, >> LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, >> Bytes](Set(topic), kafkaParams)) >> } >> >>

Re: Manually committing offset in Spark 2.0 with Kafka 0.10 and Java

2016-10-11 Thread Cody Koeninger
ync(offsets); > > java.lang.ClassCastException: > org.apache.spark.streaming.api.java.JavaInputDStream > cannot be cast to org.apache.spark.streaming.kafka010.CanCommitOffsets > at SparkTest.lambda$0(SparkTest.java:103) > > Best regards, > Max > > > 2016-10-10 20:18 GMT+02:00 Cody Koeninger <c...@koeninger.org>:

Re: What happens when an executor crashes?

2016-10-10 Thread Cody Koeninger
, starting at topic-0 offsets 60, topic-1 offsets 66 Clear as mud? On Mon, Oct 10, 2016 at 5:36 PM, Samy Dindane <s...@dindane.com> wrote: > > > On 10/10/2016 8:14 PM, Cody Koeninger wrote: >> >> Glad it was helpful :) >> >> As far as executors, my e

Re: Manually committing offset in Spark 2.0 with Kafka 0.10 and Java

2016-10-10 Thread Cody Koeninger
This should give you hints on the necessary cast: http://spark.apache.org/docs/latest/streaming-kafka-0-8-integration.html#tab_java_2 The main ugly thing there is that the java rdd is wrapping the scala rdd, so you need to unwrap one layer via rdd.rdd() If anyone wants to work on a PR to update

Re: What happens when an executor crashes?

2016-10-10 Thread Cody Koeninger
-exactly-once/blob/master/src/main/scala/example/TransactionalPerPartition.scala >> I am trying to understand how this would work if an executor crashes, so I >> tried making one crash manually, but I noticed it kills the whole job >> instead of creating another executor to resume the task.

Re: What happens when an executor crashes?

2016-10-10 Thread Cody Koeninger
What is it you're actually trying to accomplish? On Mon, Oct 10, 2016 at 5:26 AM, Samy Dindane wrote: > I managed to make a specific executor crash by using > TaskContext.get.partitionId and throwing an exception for a specific > executor. > > The issue I have now is that the

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
gt; Spark standalone is not Yarn… or secure for that matter… ;-) > >> On Sep 29, 2016, at 11:18 AM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Spark streaming helps with aggregation because >> >> A. raw kafka consumers have no built in framework

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
t, query the >> respective table in Cassandra / Postgres. (select .. from data where user = >> ? and date between and and some_field = ?) >> >> How will Spark Streaming help w/ aggregation? Couldn't the data be queried >> from Cassandra / Postgres via the Kafka cons

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
Spark streaming helps with aggregation because A. raw kafka consumers have no built in framework for shuffling amongst nodes, short of writing into an intermediate topic (I'm not touching Kafka Streams here, I don't have experience), and B. it deals with batches, so you can transactionally

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
a should not be lost. The system should be as fault tolerant as >> possible. >> >> What's the advantage of using Spark for reading Kafka instead of direct >> Kafka consumers? >> >> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger <c...@koeninger.org> >> wrote

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
otherwise updates will be > idempotent but not inserts. > > Data should not be lost. The system should be as fault tolerant as possible. > > What's the advantage of using Spark for reading Kafka instead of direct > Kafka consumers? > > On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger &l

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
Spark direct stream is just fine for this use case. > But why postgres and not cassandra? > Is there anything specific here that i may not be aware? > > Thanks > Deepak > > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> How are y

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
How are you going to handle etl failures? Do you care about lost / duplicated data? Are your writes idempotent? Absent any other information about the problem, I'd stay away from cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream feeding postgres. On Thu, Sep 29, 2016 at 10:04

Re: Problems with new experimental Kafka Consumer for 0.10

2016-09-28 Thread Cody Koeninger
uest-7, > sapxm.adserving.log.ad_request-9] for group spark_aggregation_job-kafka010 > 16/09/28 08:16:48 INFO ConsumerCoordinator: Revoking previously assigned > partitions [sapxm.adserving.log.view-3, sapxm.adserving.log.view-4, > sapxm.adserving.log.view-1, sapxm.adserving.log.view-2, >

Re: Problems with new experimental Kafka Consumer for 0.10

2016-09-27 Thread Cody Koeninger
What's the actual stacktrace / exception you're getting related to commit failure? On Tue, Sep 27, 2016 at 9:37 AM, Matthias Niehoff wrote: > Hi everybody, > > i am using the new Kafka Receiver for Spark Streaming for my Job. When > running with old consumer it

Re: Slow Shuffle Operation on Empty Batch

2016-09-26 Thread Cody Koeninger
Do you have a minimal example of how to reproduce the problem, that doesn't depend on Cassandra? On Mon, Sep 26, 2016 at 4:10 PM, Erwan ALLAIN wrote: > Hi > > I'm working with > - Kafka 0.8.2 > - Spark Streaming (2.0) direct input stream. > - cassandra 3.0 > > My batch

Re: Can Spark Streaming 2.0 work with Kafka 0.10?

2016-09-26 Thread Cody Koeninger
Either artifact should work with 0.10 brokers. The 0.10 integration has more features but is still marked experimental. On Sep 26, 2016 3:41 AM, "Haopu Wang" wrote: > Hi, in the official integration guide, it says "Spark Streaming 2.0.0 is > compatible with Kafka 0.8.2.1."

Re: ideas on de duplication for spark streaming?

2016-09-24 Thread Cody Koeninger
Spark alone isn't going to solve this problem, because you have no reliable way of making sure a given worker has a consistent shard of the messages seen so far, especially if there's an arbitrary amount of delay between duplicate messages. You need a DHT or something equivalent. On Sep 24, 2016

Re: Error while Spark 1.6.1 streaming from Kafka-2.11_0.10.0.1 cluster

2016-09-23 Thread Cody Koeninger
ct Approach (No >> receivers) in Spark streaming. >> >> I am not sure if I have that leverage to upgrade at this point, but do you >> know if Spark 1.6.1 to Spark 2.0 jump is smooth usually or does it involve >> lot of hick-ups. >> Also is there a migratio

Re: Error while Spark 1.6.1 streaming from Kafka-2.11_0.10.0.1 cluster

2016-09-22 Thread Cody Koeninger
Do you have the ability to try using Spark 2.0 with the streaming-kafka-0-10 connector? I'd expect the 1.6.1 version to be compatible with kafka 0.10, but it would be good to rule that out. On Thu, Sep 22, 2016 at 1:37 PM, sagarcasual . wrote: > Hello, > > I am trying to

Re: Continuous warning while consuming using new kafka-spark010 API

2016-09-20 Thread Cody Koeninger
-dev +user Than warning pretty much means what it says - the consumer tried to get records for the given partition / offset, and couldn't do so after polling the kafka broker for X amount of time. If that only happens when you put additional load on Kafka via producing, the first thing I'd do is

Re: Spark Streaming - dividing DStream into mini batches

2016-09-15 Thread Cody Koeninger
reduceByKey. Does this work on _all_ elements > in the stream, as they're coming in, or is this a transformation per > RDD/micro batch? I assume the latter, otherwise it would be more akin to > updateStateByKey, right? > > On Tue, Sep 13, 2016 at 4:42 PM, Cody Koeninger

Re: Spark kafka integration issues

2016-09-14 Thread Cody Koeninger
for the details, > much appreciated. > > http://blog.cloudera.com/blog/2015/03/exactly-once-spark-streaming-from-apache-kafka/ > > On Tue, Sep 13, 2016 at 8:14 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> 1. see >> http://spark.apache.org/docs/lat

Re: KafkaUtils.createDirectStream() with kafka topic expanded

2016-09-13 Thread Cody Koeninger
That version of createDirectStream doesn't handle partition changes. You can work around it by starting the job again. The spark 2.0 consumer for kafka 0.10 should handle partition changes via SubscribePattern. On Tue, Sep 13, 2016 at 7:13 PM, vinay gupta wrote: >

Re: Spark kafka integration issues

2016-09-13 Thread Cody Koeninger
1. see http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers look for HasOffsetRange. If you really want the info per-message rather than per-partition, createRDD has an overload that takes a messageHandler from MessageAndMetadata to

Re: Spark Streaming - dividing DStream into mini batches

2016-09-13 Thread Cody Koeninger
han 1 partition based on size? Or is it something that the "source" > (SocketStream, KafkaStream etc.) decides? > > On Tue, Sep 13, 2016 at 4:26 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> A micro batch is an RDD. >> >> An RDD has partitions, s

Re: Spark Streaming - dividing DStream into mini batches

2016-09-13 Thread Cody Koeninger
A micro batch is an RDD. An RDD has partitions, so different executors can work on different partitions concurrently. Don't think of that as multiple micro-batches within a time slot. It's one RDD within a time slot, with multiple partitions. On Tue, Sep 13, 2016 at 9:01 AM, Daan Debie

Re: Why is Spark getting Kafka data out from port 2181 ?

2016-09-10 Thread Cody Koeninger
Are you using the receiver based stream? On Sep 10, 2016 15:45, "Eric Ho" wrote: > I notice that some Spark programs would contact something like 'zoo1:2181' > when trying to suck data out of Kafka. > > Does the kafka data actually transported out over this port ? > >

Re: spark streaming kafka connector questions

2016-09-10 Thread Cody Koeninger
Hard to say without seeing the code, but if you do multiple actions on an Rdd without caching, the Rdd will be computed multiple times. On Sep 10, 2016 2:43 AM, "Cheng Yi" wrote: After some investigation, the problem i see is liked caused by a filter and union of the

Re: Streaming Backpressure with Multiple Streams

2016-09-09 Thread Cody Koeninger
Does the same thing happen if you're only using direct stream plus back pressure, not the receiver stream? On Sep 9, 2016 6:41 PM, "Jeff Nadler" wrote: > Maybe this is a pretty esoteric implementation, but I'm seeing some bad > behavior with backpressure plus multiple Kafka

Re: spark streaming kafka connector questions

2016-09-08 Thread Cody Koeninger
- If you're seeing repeated attempts to process the same message, you should be able to look in the UI or logs and see that a task has failed. Figure out why that task failed before chasing other things - You're not using the latest version, the latest version is for spark 2.0. There are two

Re: Spark 2.0 with Kafka 0.10 exception

2016-09-07 Thread Cody Koeninger
ask how big an overhead is that? > > It happens intermittently and each time it happens retry is successful. > > Srikanth > > On Wed, Sep 7, 2016 at 3:55 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> That's not what I would have expected to happen with a

Re: Spark 2.0 with Kafka 0.10 exception

2016-09-07 Thread Cody Koeninger
k.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > at java.util.concurrent.ThreadPoolExecuto

Re: Spark 2.0 with Kafka 0.10 exception

2016-09-07 Thread Cody Koeninger
dy. Setting poll timeout helped. >> Our network is fine but brokers are not fully provisioned in test cluster. >> But there isn't enough load to max out on broker capacity. >> Curious that kafkacat running on the same node doesn't have any issues. >> >> Srikanth >&g

Re: Reset auto.offset.reset in Kafka 0.10 integ

2016-09-07 Thread Cody Koeninger
The restart doesn't have to be all that > silent. It requires us to set a flag. I thought auto.offset.reset is that > flag. > But there isn't much I can do at this point given that retention has cleaned > things up. > The app has to start. Let admins address the data loss on the side. >

Re: Reset auto.offset.reset in Kafka 0.10 integ

2016-09-07 Thread Cody Koeninger
tion instead of honoring > auto.offset.reset. > It isn't a corner case where retention expired after driver created a batch. > Its easily reproducible and consistent. > > On Tue, Sep 6, 2016 at 3:34 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> You don't

Re: Q: Multiple spark streaming app, one kafka topic, same consumer group

2016-09-06 Thread Cody Koeninger
In general, see the material linked from https://github.com/koeninger/kafka-exactly-once if you want a better understanding of the direct stream. For spark-streaming-kafka-0-8, the direct stream doesn't really care about consumer group, since it uses the simple consumer. For the 0.10 version,

Re: Reset auto.offset.reset in Kafka 0.10 integ

2016-09-06 Thread Cody Koeninger
on was why I got the exception instead of it using > auto.offset.reset on restart? > > > > > On Tue, Sep 6, 2016 at 10:48 AM, Cody Koeninger <c...@koeninger.org> wrote: >> >> If you leave enable.auto.commit set to true, it will commit offsets to >>

Re: Reset auto.offset.reset in Kafka 0.10 integ

2016-09-06 Thread Cody Koeninger
> enable.auto.commit = true > auto.offset.reset = latest > > Srikanth > > On Sat, Sep 3, 2016 at 8:59 AM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Seems like you're confused about the purpose of that line of code, it >> applies to executors, not the driver.

Re: Pausing spark kafka streaming (direct) or exclude/include some partitions on the fly per batch

2016-09-03 Thread Cody Koeninger
r > temporarily excluding partitions is there any way I can supply > topic-partition info on the fly at the beginning of every pull dynamically. > Will streaminglistener be of any help? > > On Fri, Sep 2, 2016 at 10:37 AM, Cody Koeninger <c...@koeninger.org> > wrote: > >

Re: Pausing spark kafka streaming (direct) or exclude/include some partitions on the fly per batch

2016-09-02 Thread Cody Koeninger
If you just want to pause the whole stream, just stop the app and then restart it when you're ready. If you want to do some type of per-partition manipulation, you're going to need to write some code. The 0.10 integration makes the underlying kafka consumer pluggable, so you may be able to wrap

Re: how should I compose keyStore and trustStore if Spark needs to talk to Kafka & Cassandra ?

2016-09-01 Thread Cody Koeninger
Why not just use different files for Kafka? Nothing else in Spark should be using those Kafka configuration parameters. On Thu, Sep 1, 2016 at 3:26 AM, Eric Ho wrote: > I'm interested in what I should put into the trustStore file, not just for > Spark but also for Kafka

Re: Spark to Kafka communication encrypted ?

2016-08-31 Thread Cody Koeninger
; Warden Ave > Markham, ON L6G 1C7 > Canada > > > > - Original message - > From: Cody Koeninger <c...@koeninger.org> > To: Eric Ho <e...@analyticsmd.com> > Cc: "user@spark.apache.org" <user@spark.apache.org> > Subject: Re: Spark to

Re: Spark to Kafka communication encrypted ?

2016-08-31 Thread Cody Koeninger
Encryption is only available in spark-streaming-kafka-0-10, not 0-8. You enable it the same way you enable it for the Kafka project's new consumer, by setting kafka configuration parameters appropriately. http://kafka.apache.org/documentation.html#security_ssl On Wed, Aug 31, 2016 at 2:03 AM,

Re: Model abstract class in spark ml

2016-08-31 Thread Cody Koeninger
http://blog.originate.com/blog/2014/02/27/types-inside-types-in-scala/ On Wed, Aug 31, 2016 at 2:19 AM, Sean Owen wrote: > Weird, I recompiled Spark with a similar change to Model and it seemed > to work but maybe I missed a step in there. > > On Wed, Aug 31, 2016 at 6:33 AM,

Re: Kafka message metadata with Dstreams

2016-08-25 Thread Cody Koeninger
http://spark.apache.org/docs/latest/api/java/index.html messageHandler receives a kafka MessageAndMetadata object. Alternatively, if you just need metadata information on a per-partition basis, you can use HasOffsetRanges

Re: Maelstrom: Kafka integration with Spark

2016-08-24 Thread Cody Koeninger
stable scenarios (e.g. > advertised hostname failures on EMR). > > Maelstrom will work I believe even for Spark 1.3 and Kafka 0.8.2.1 (and of > course with the latest Kafka 0.10 as well) > > > On Wed, Aug 24, 2016 at 9:49 AM, Cody Koeninger <c...@koeninger.org> wrote: >> >&

Re: Spark 2.0 with Kafka 0.10 exception

2016-08-23 Thread Cody Koeninger
You can set that poll timeout higher with spark.streaming.kafka.consumer.poll.ms but half a second is fairly generous. I'd try to take a look at what's going on with your network or kafka broker during that time. On Tue, Aug 23, 2016 at 4:44 PM, Srikanth wrote: > Hello,

Re: Maelstrom: Kafka integration with Spark

2016-08-23 Thread Cody Koeninger
Were you aware that the spark 2.0 / kafka 0.10 integration also reuses kafka consumer instances on the executors? On Tue, Aug 23, 2016 at 3:19 PM, Jeoffrey Lim wrote: > Hi, > > I have released the first version of a new Kafka integration with Spark > that we use in the

Re: Zero Data Loss in Spark with Kafka

2016-08-23 Thread Cody Koeninger
See https://github.com/koeninger/kafka-exactly-once On Aug 23, 2016 10:30 AM, "KhajaAsmath Mohammed" wrote: > Hi Experts, > > I am looking for some information on how to acheive zero data loss while > working with kafka and Spark. I have searched online and blogs have >

Re: spark streaming Directkafka with checkpointing : changed parameters not considered

2016-08-19 Thread Cody Koeninger
both fault tolerance and > application code/config changes without checkpointing? Is there anything > else which checkpointing gives? I might be missing something. > > > Regards, > Chandan > > > On Thu, Aug 18, 2016 at 8:27 PM, Cody Koeninger <c...@koeninger.org>

Re: spark streaming Directkafka with checkpointing : changed parameters not considered

2016-08-18 Thread Cody Koeninger
Checkpointing is not kafka-specific. It encompasses metadata about the application. You can't re-use a checkpoint if your application has changed. http://spark.apache.org/docs/latest/streaming-programming-guide.html#upgrading-application-code On Thu, Aug 18, 2016 at 4:39 AM, chandan prakash

Re: Rebalancing when adding kafka partitions

2016-08-16 Thread Cody Koeninger
streaming-kafka-0-10-assembly?? > > Srikanth > > On Fri, Aug 12, 2016 at 5:15 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Hrrm, that's interesting. Did you try with subscribe pattern, out of >> curiosity? >> >> I haven't tested repartitioning

Re: read kafka offset from spark checkpoint

2016-08-15 Thread Cody Koeninger
No, you really shouldn't rely on checkpoints if you cant afford to reprocess from the beginning of your retention (or lose data and start from the latest messages). If you're in a real bind, you might be able to get something out of the serialized data in the checkpoint, but it'd probably be

Re: Rebalancing when adding kafka partitions

2016-08-12 Thread Cody Koeninger
logger.info(s"rdd has ${rdd.getNumPartitions} partitions.") > > > Should I be setting some parameter/config? Is the doc for new integ > available? > > Thanks, > Srikanth > > On Fri, Jul 22, 2016 at 2:15 PM, Cody Koeninger <c...@koeninger.org> > wrote: > >

Re: KafkaUtils.createStream not picking smallest offset

2016-08-12 Thread Cody Koeninger
e same consumer group name. But this is not working though . Somehow > createstream is picking the offset from some where other than > /consumers/ from zookeeper > > > Sent from Samsung Mobile. > > > > > > > > > Original message

Re: Spark streaming not processing messages from partitioned topics

2016-08-10 Thread Cody Koeninger
16/08/10 18:16:44 INFO JobScheduler: Added jobs for time 1470833204000 ms > 16/08/10 18:16:45 INFO JobScheduler: Added jobs for time 1470833205000 ms > 16/08/10 18:16:46 INFO JobScheduler: Added jobs for time 1470833206000 ms > 16/08/10 18:16:47 INFO JobScheduler: Added jobs for time 1470833

Re: Spark streaming not processing messages from partitioned topics

2016-08-10 Thread Cody Koeninger
Those logs you're posting are from right after your failure, they don't include what actually went wrong when attempting to read json. Look at your logs more carefully. On Aug 10, 2016 2:07 AM, "Diwakar Dhanuskodi" wrote: > Hi Siva, > > With below code, it is stuck

<    1   2   3   4   5   6   7   >