titions) are being handled by the executors at any given time?
>
> Thanks,
>
> Ivan
>
> On Sat, Nov 12, 2016 at 1:25 PM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>> You should not be getting consumer churn on executors at all, that's
>> the whole
s and stages are all successful and even the speculation
> is turned off .
>
> On Sat, Nov 12, 2016 at 9:55 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> Are you certain you aren't getting any failed tasks or other errors?
>> Output actions like foreach aren't ex
c and a consumer
>> group that is made from the topic (each RDD uses a different topic and
>> group), process the data and write to Parquet files.
>>
>> Per my Nov 10th post, we still get polling timeouts unless the poll.ms is
>> set to something like 10 seconds. We also
Are you certain you aren't getting any failed tasks or other errors?
Output actions like foreach aren't exactly once and will be retried on
failures.
On Nov 12, 2016 06:36, "dev loper" wrote:
> Dear fellow Spark Users,
>
> My Spark Streaming application (Spark 2.0 , on AWS
e()
>> }
>>
>> I am not sure why the concurrent issue is there as I have tried to debug
>> and also looked at the KafkaConsumer code as well, but everything looks
>> like it should not occur. The things to figure out is why when running in
>> parallel do
The basic structured streaming source for Kafka is already committed to
master, build it and try it out.
If you're already using Kafka I don't really see much point in trying to
put Akka in between it and Spark.
On Nov 10, 2016 02:25, "vincent gromakowski"
wrote:
http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
specifically
http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#storing-offsets
Have you set enable.auto.commit to false?
The new consumer stores offsets in kafka, so the idea of specifically
I may be misunderstanding, but you need to take each kafka message,
and turn it into multiple items in the transformed rdd?
so something like (pseudocode):
stream.flatMap { message =>
val items = new ArrayBuffer
var parser = null
message.split("\n").foreach { line =>
if // it's a
ed to set poll.ms to 30 seconds and still get the issue.
> Something is not right here and just not seem right. As I mentioned with the
> streaming application, with Spark 1.6 and Kafka 0.8.x we never saw this
> issue. We have been running the same basic logic for over a year now without
>
ce
> spark.task.maxFailures is reached? Has anyone else seen this behavior of a
> streaming application skipping over failed microbatches?
>
> Thanks,
> Sean
>
>
>> On Nov 4, 2016, at 2:48 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> So basicall
;hw...@qilinsoft.com> wrote:
> Cody, thanks for the response. Do you think it's a Spark issue or Kafka
> issue? Can you please let me know the jira ticket number?
>
> -Original Message-
> From: Cody Koeninger [mailto:c...@koeninger.org]
> Sent: 2016年11月4日 22:35
>
e since Spark now requires one to assign or subscribe to a topic in
> order to even update the offsets. In 0.8.2.x you did not have to worry about
> that. This approach limits your exposure to duplicate data since idempotent
> records are not entirely possible in our scenario. At least
So just to be clear, the answers to my questions are
- you are not using different group ids, you're using the same group
id everywhere
- you are committing offsets manually
Right?
If you want to eliminate network or kafka misbehavior as a source,
tune poll.ms upwards even higher.
You must
- are you using different group ids for the different streams?
- are you manually committing offsets?
- what are the values of your kafka-related settings?
On Fri, Nov 4, 2016 at 12:20 PM, vonnagy wrote:
> I am getting the issues using Spark 2.0.1 and Kafka 0.10. I have two jobs,
That's not what I would expect from the underlying kafka consumer, no.
But this particular case (no matching topics, then add a topic after
SubscribePattern stream starts) actually isn't part of unit tests for
either the DStream or the structured stream.
I'll make a jira ticket.
On Thu, Nov 3,
t time windowed
> aggregation for several weeks now.
>
> On Tue, Nov 1, 2016 at 6:18 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> Look at the resolved subtasks attached to that ticket you linked.
>> Some of them are unresolved, but basic functionality is there.
>&
Look at the resolved subtasks attached to that ticket you linked.
Some of them are unresolved, but basic functionality is there.
On Tue, Nov 1, 2016 at 7:37 PM, shyla deshpande
wrote:
> Hi Michael,
>
> Thanks for the reply.
>
> The following link says there is a open
gt;
>
> The reason I ask is that it simply looks strange to me that Spark will have
> to shuffle each time my input stream and "state" stream during the
> mapWithState operation when I now for sure that those two streams will
> always share same keys and will not need ac
If you call a transformation on an rdd using the same partitioner as that
rdd, no shuffle will occur. KafkaRDD doesn't have a partitioner, there's
no consistent partitioning scheme that works for all kafka uses. You can
wrap each kafkardd with an rdd that has a custom partitioner that you write
to operationalize
> topic creation. Not a big deal but now we'll have to make sure we execute
> the 'create-topics' type of task or shell script at install time.
>
> This seems like a Kafka doc issue potentially, to explain what exactly one
> can expect from the auto.create.topics.en
application is running and the files are created right. But as soon as I
>>> restart the application, it goes back to fromOffset as 0. Any thoughts?
>>>
>>> regards
>>> Sunita
>>>
>>> On Tue, Oct 25, 2016 at 1:52 PM, Sunita Arvind <sunitarv...@gmail.c
t;Done reading offsets from ZooKeeper. Took " +
>> (System.currentTimeMillis() - start))
>>
>> Some(offsets)
>> case None =>
>> LogHandler.log.info("No offsets found in ZooKeeper. Took " +
>> (System.currentTimeMillis() - start
Kafka consumers should be backwards compatible with kafka brokers, so
at the very least you should be able to use the
streaming-spark-kafka-0-10 to do what you're talking about.
On Tue, Oct 25, 2016 at 4:30 AM, Prabhu GS wrote:
> Hi,
>
> I would like to know if the same
al.ms can be set differently.
> I'll leave it to you on how to add this to docs!
>
>
> On Thu, Oct 20, 2016 at 1:41 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> Right on, I put in a PR to make a note of that in the docs.
>>
>> On Thu, Oct 20, 2016 at 1
gain, I know what I have to do :)
>
> On Fri, Oct 21, 2016 at 5:05 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> 0. If your processing time is regularly greater than your batch
>> interval you're going to have problems anyway. Investigate this more,
>> set m
0. If your processing time is regularly greater than your batch
interval you're going to have problems anyway. Investigate this more,
set maxRatePerPartition, something.
1. That's personally what I tend to do.
2. Why are you relying on checkpoints if you're storing offset state
in the database?
Right on, I put in a PR to make a note of that in the docs.
On Thu, Oct 20, 2016 at 12:13 PM, Srikanth <srikanth...@gmail.com> wrote:
> Yeah, setting those params helped.
>
> On Wed, Oct 19, 2016 at 1:32 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>>
Iterator.next(KafkaRDD.scala:193)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$$anon$21.next(Iterator.scala:838)
>
>
> On Wed, Sep 7, 2016 at 3:55 PM, Cod
nge[] offsets = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
> CanCommitOffsets cco = (CanCommitOffsets) directKafkaStream.dstream();
> cco.commitAsync(offsets);
>
> I tried setting "max.poll.records" to 1000 but this did not help.
>
> Any idea what could be wrong?
>
utor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.AssertionError: assertion failed: Failed to get records for
> spark-executor-null mytopic2 0 2 after polling for 512
> ==
>
> -Original Message-
> From: Cody Koeninger [mailto:c...@koeninger.org]
> Sent: 201
I can't be sure, no.
On Fri, Oct 14, 2016 at 3:06 AM, Julian Keppel
<juliankeppel1...@gmail.com> wrote:
> Okay, thank you! Can you say, when this feature will be released?
>
> 2016-10-13 16:29 GMT+02:00 Cody Koeninger <c...@koeninger.org>:
>>
>> As Sean said, it'
fka.maxRatePerPartition", "10")
> conf.set("spark.streaming.backpressure.enabled", "true")
>
> That's not normal, is it? Do you notice anything odd in my logs?
>
> Thanks a lot.
>
>
>
> On 10/12/2016 07:31 PM, Cody Koeninger w
As Sean said, it's unreleased. If you want to try it out, build spark
http://spark.apache.org/docs/latest/building-spark.html
The easiest way to include the jar is probably to use mvn install to
put it in your local repository, then link it in your application's
mvn or sbt build file as
response.
>
>
>
> So Kafka direct stream actually has consumer on both the driver and
> executor? Can you please provide more details? Thank you very much!
>
>
>
> ________
>
> From: Cody Koeninger [mailto:c...@koeninger.org]
> Sent:
rg/e730492453.png
notice the cutover point
On Wed, Oct 12, 2016 at 11:00 AM, Samy Dindane <s...@dindane.com> wrote:
> I am 100% sure.
>
> println(conf.get("spark.streaming.backpressure.enabled")) prints true.
>
>
> On 10/12/2016 05:48 PM, Cody Koeninger w
gt; Do you have any idea about why aren't backpressure working? How to debug
> this?
>
>
> On 10/11/2016 06:08 PM, Cody Koeninger wrote:
>>
>> http://spark.apache.org/docs/latest/configuration.html
>>
>> "This rate is upper bounded by the values
>&g
, are you talking about
> repartition using partitionBy()?
>
>
> On 10/11/2016 01:23 AM, Cody Koeninger wrote:
>>
>> Repartition almost always involves a shuffle.
>>
>> Let me see if I can explain the recovery stuff...
>>
>> Say you start with two kafka part
its set to none for the executors, because otherwise they wont do exactly
what the driver told them to do.
you should be able to set up the driver consumer to determine batches
however you want, though.
On Wednesday, October 12, 2016, Haopu Wang wrote:
> Hi,
>
>
>
> I want
http://spark.apache.org/docs/latest/configuration.html
"This rate is upper bounded by the values
spark.streaming.receiver.maxRate and
spark.streaming.kafka.maxRatePerPartition if they are set (see
below)."
On Tue, Oct 11, 2016 at 10:57 AM, Samy Dindane wrote:
> Hi,
>
> Is it
involving mapWithState
> of course :) I'm just wondering why it doesn't support this use case yet.
>
> On Tue, Oct 11, 2016 at 3:41 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> They're telling you not to use the old function because it's linear on the
>> total
They're telling you not to use the old function because it's linear on the
total number of keys, not keys in the batch, so it's slow.
But if that's what you really want, go ahead and do it, and see if it
performs well enough.
On Oct 11, 2016 6:28 AM, "DandyDev" wrote:
Hi
it works. Thanks!
Although changing this is ok for us, i am interested in the why :-) With
the old connector this was not a problem nor is it afaik with the pure
kafka consumer api
2016-10-11 14:30 GMT+02:00 Cody Koeninger <c...@koeninger.org>:
> Just out of curiosity, have you tried
c: String):
>> InputDStream[ConsumerRecord[String, Bytes]] = {
>> KafkaUtils.createDirectStream[String, Bytes](ssc,
>> LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String,
>> Bytes](Set(topic), kafkaParams))
>> }
>>
>>
ync(offsets);
>
> java.lang.ClassCastException:
> org.apache.spark.streaming.api.java.JavaInputDStream
> cannot be cast to org.apache.spark.streaming.kafka010.CanCommitOffsets
> at SparkTest.lambda$0(SparkTest.java:103)
>
> Best regards,
> Max
>
>
> 2016-10-10 20:18 GMT+02:00 Cody Koeninger <c...@koeninger.org>:
, starting
at topic-0 offsets 60, topic-1 offsets 66
Clear as mud?
On Mon, Oct 10, 2016 at 5:36 PM, Samy Dindane <s...@dindane.com> wrote:
>
>
> On 10/10/2016 8:14 PM, Cody Koeninger wrote:
>>
>> Glad it was helpful :)
>>
>> As far as executors, my e
This should give you hints on the necessary cast:
http://spark.apache.org/docs/latest/streaming-kafka-0-8-integration.html#tab_java_2
The main ugly thing there is that the java rdd is wrapping the scala
rdd, so you need to unwrap one layer via rdd.rdd()
If anyone wants to work on a PR to update
-exactly-once/blob/master/src/main/scala/example/TransactionalPerPartition.scala
>> I am trying to understand how this would work if an executor crashes, so I
>> tried making one crash manually, but I noticed it kills the whole job
>> instead of creating another executor to resume the task.
What is it you're actually trying to accomplish?
On Mon, Oct 10, 2016 at 5:26 AM, Samy Dindane wrote:
> I managed to make a specific executor crash by using
> TaskContext.get.partitionId and throwing an exception for a specific
> executor.
>
> The issue I have now is that the
gt; Spark standalone is not Yarn… or secure for that matter… ;-)
>
>> On Sep 29, 2016, at 11:18 AM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> Spark streaming helps with aggregation because
>>
>> A. raw kafka consumers have no built in framework
t, query the
>> respective table in Cassandra / Postgres. (select .. from data where user =
>> ? and date between and and some_field = ?)
>>
>> How will Spark Streaming help w/ aggregation? Couldn't the data be queried
>> from Cassandra / Postgres via the Kafka cons
Spark streaming helps with aggregation because
A. raw kafka consumers have no built in framework for shuffling
amongst nodes, short of writing into an intermediate topic (I'm not
touching Kafka Streams here, I don't have experience), and
B. it deals with batches, so you can transactionally
a should not be lost. The system should be as fault tolerant as
>> possible.
>>
>> What's the advantage of using Spark for reading Kafka instead of direct
>> Kafka consumers?
>>
>> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger <c...@koeninger.org>
>> wrote
otherwise updates will be
> idempotent but not inserts.
>
> Data should not be lost. The system should be as fault tolerant as possible.
>
> What's the advantage of using Spark for reading Kafka instead of direct
> Kafka consumers?
>
> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger &l
Spark direct stream is just fine for this use case.
> But why postgres and not cassandra?
> Is there anything specific here that i may not be aware?
>
> Thanks
> Deepak
>
> On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> How are y
How are you going to handle etl failures? Do you care about lost /
duplicated data? Are your writes idempotent?
Absent any other information about the problem, I'd stay away from
cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
feeding postgres.
On Thu, Sep 29, 2016 at 10:04
uest-7,
> sapxm.adserving.log.ad_request-9] for group spark_aggregation_job-kafka010
> 16/09/28 08:16:48 INFO ConsumerCoordinator: Revoking previously assigned
> partitions [sapxm.adserving.log.view-3, sapxm.adserving.log.view-4,
> sapxm.adserving.log.view-1, sapxm.adserving.log.view-2,
>
What's the actual stacktrace / exception you're getting related to
commit failure?
On Tue, Sep 27, 2016 at 9:37 AM, Matthias Niehoff
wrote:
> Hi everybody,
>
> i am using the new Kafka Receiver for Spark Streaming for my Job. When
> running with old consumer it
Do you have a minimal example of how to reproduce the problem, that
doesn't depend on Cassandra?
On Mon, Sep 26, 2016 at 4:10 PM, Erwan ALLAIN wrote:
> Hi
>
> I'm working with
> - Kafka 0.8.2
> - Spark Streaming (2.0) direct input stream.
> - cassandra 3.0
>
> My batch
Either artifact should work with 0.10 brokers. The 0.10 integration has
more features but is still marked experimental.
On Sep 26, 2016 3:41 AM, "Haopu Wang" wrote:
> Hi, in the official integration guide, it says "Spark Streaming 2.0.0 is
> compatible with Kafka 0.8.2.1."
Spark alone isn't going to solve this problem, because you have no reliable
way of making sure a given worker has a consistent shard of the messages
seen so far, especially if there's an arbitrary amount of delay between
duplicate messages. You need a DHT or something equivalent.
On Sep 24, 2016
ct Approach (No
>> receivers) in Spark streaming.
>>
>> I am not sure if I have that leverage to upgrade at this point, but do you
>> know if Spark 1.6.1 to Spark 2.0 jump is smooth usually or does it involve
>> lot of hick-ups.
>> Also is there a migratio
Do you have the ability to try using Spark 2.0 with the
streaming-kafka-0-10 connector?
I'd expect the 1.6.1 version to be compatible with kafka 0.10, but it
would be good to rule that out.
On Thu, Sep 22, 2016 at 1:37 PM, sagarcasual . wrote:
> Hello,
>
> I am trying to
-dev +user
Than warning pretty much means what it says - the consumer tried to
get records for the given partition / offset, and couldn't do so after
polling the kafka broker for X amount of time.
If that only happens when you put additional load on Kafka via
producing, the first thing I'd do is
reduceByKey. Does this work on _all_ elements
> in the stream, as they're coming in, or is this a transformation per
> RDD/micro batch? I assume the latter, otherwise it would be more akin to
> updateStateByKey, right?
>
> On Tue, Sep 13, 2016 at 4:42 PM, Cody Koeninger
for the details,
> much appreciated.
>
> http://blog.cloudera.com/blog/2015/03/exactly-once-spark-streaming-from-apache-kafka/
>
> On Tue, Sep 13, 2016 at 8:14 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> 1. see
>> http://spark.apache.org/docs/lat
That version of createDirectStream doesn't handle partition changes.
You can work around it by starting the job again.
The spark 2.0 consumer for kafka 0.10 should handle partition changes
via SubscribePattern.
On Tue, Sep 13, 2016 at 7:13 PM, vinay gupta
wrote:
>
1. see
http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers
look for HasOffsetRange. If you really want the info per-message
rather than per-partition, createRDD has an overload that takes a
messageHandler from MessageAndMetadata to
han 1 partition based on size? Or is it something that the "source"
> (SocketStream, KafkaStream etc.) decides?
>
> On Tue, Sep 13, 2016 at 4:26 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> A micro batch is an RDD.
>>
>> An RDD has partitions, s
A micro batch is an RDD.
An RDD has partitions, so different executors can work on different
partitions concurrently.
Don't think of that as multiple micro-batches within a time slot.
It's one RDD within a time slot, with multiple partitions.
On Tue, Sep 13, 2016 at 9:01 AM, Daan Debie
Are you using the receiver based stream?
On Sep 10, 2016 15:45, "Eric Ho" wrote:
> I notice that some Spark programs would contact something like 'zoo1:2181'
> when trying to suck data out of Kafka.
>
> Does the kafka data actually transported out over this port ?
>
>
Hard to say without seeing the code, but if you do multiple actions on an
Rdd without caching, the Rdd will be computed multiple times.
On Sep 10, 2016 2:43 AM, "Cheng Yi" wrote:
After some investigation, the problem i see is liked caused by a filter and
union of the
Does the same thing happen if you're only using direct stream plus back
pressure, not the receiver stream?
On Sep 9, 2016 6:41 PM, "Jeff Nadler" wrote:
> Maybe this is a pretty esoteric implementation, but I'm seeing some bad
> behavior with backpressure plus multiple Kafka
- If you're seeing repeated attempts to process the same message, you
should be able to look in the UI or logs and see that a task has
failed. Figure out why that task failed before chasing other things
- You're not using the latest version, the latest version is for spark
2.0. There are two
ask how big an overhead is that?
>
> It happens intermittently and each time it happens retry is successful.
>
> Srikanth
>
> On Wed, Sep 7, 2016 at 3:55 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> That's not what I would have expected to happen with a
k.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.util.concurrent.ThreadPoolExecuto
dy. Setting poll timeout helped.
>> Our network is fine but brokers are not fully provisioned in test cluster.
>> But there isn't enough load to max out on broker capacity.
>> Curious that kafkacat running on the same node doesn't have any issues.
>>
>> Srikanth
>&g
The restart doesn't have to be all that
> silent. It requires us to set a flag. I thought auto.offset.reset is that
> flag.
> But there isn't much I can do at this point given that retention has cleaned
> things up.
> The app has to start. Let admins address the data loss on the side.
>
tion instead of honoring
> auto.offset.reset.
> It isn't a corner case where retention expired after driver created a batch.
> Its easily reproducible and consistent.
>
> On Tue, Sep 6, 2016 at 3:34 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> You don't
In general, see the material linked from
https://github.com/koeninger/kafka-exactly-once if you want a better
understanding of the direct stream.
For spark-streaming-kafka-0-8, the direct stream doesn't really care
about consumer group, since it uses the simple consumer. For the 0.10
version,
on was why I got the exception instead of it using
> auto.offset.reset on restart?
>
>
>
>
> On Tue, Sep 6, 2016 at 10:48 AM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> If you leave enable.auto.commit set to true, it will commit offsets to
>>
> enable.auto.commit = true
> auto.offset.reset = latest
>
> Srikanth
>
> On Sat, Sep 3, 2016 at 8:59 AM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> Seems like you're confused about the purpose of that line of code, it
>> applies to executors, not the driver.
r
> temporarily excluding partitions is there any way I can supply
> topic-partition info on the fly at the beginning of every pull dynamically.
> Will streaminglistener be of any help?
>
> On Fri, Sep 2, 2016 at 10:37 AM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>
If you just want to pause the whole stream, just stop the app and then
restart it when you're ready.
If you want to do some type of per-partition manipulation, you're
going to need to write some code. The 0.10 integration makes the
underlying kafka consumer pluggable, so you may be able to wrap
Why not just use different files for Kafka? Nothing else in Spark
should be using those Kafka configuration parameters.
On Thu, Sep 1, 2016 at 3:26 AM, Eric Ho wrote:
> I'm interested in what I should put into the trustStore file, not just for
> Spark but also for Kafka
; Warden Ave
> Markham, ON L6G 1C7
> Canada
>
>
>
> - Original message -
> From: Cody Koeninger <c...@koeninger.org>
> To: Eric Ho <e...@analyticsmd.com>
> Cc: "user@spark.apache.org" <user@spark.apache.org>
> Subject: Re: Spark to
Encryption is only available in spark-streaming-kafka-0-10, not 0-8.
You enable it the same way you enable it for the Kafka project's new
consumer, by setting kafka configuration parameters appropriately.
http://kafka.apache.org/documentation.html#security_ssl
On Wed, Aug 31, 2016 at 2:03 AM,
http://blog.originate.com/blog/2014/02/27/types-inside-types-in-scala/
On Wed, Aug 31, 2016 at 2:19 AM, Sean Owen wrote:
> Weird, I recompiled Spark with a similar change to Model and it seemed
> to work but maybe I missed a step in there.
>
> On Wed, Aug 31, 2016 at 6:33 AM,
http://spark.apache.org/docs/latest/api/java/index.html
messageHandler receives a kafka MessageAndMetadata object.
Alternatively, if you just need metadata information on a
per-partition basis, you can use HasOffsetRanges
stable scenarios (e.g.
> advertised hostname failures on EMR).
>
> Maelstrom will work I believe even for Spark 1.3 and Kafka 0.8.2.1 (and of
> course with the latest Kafka 0.10 as well)
>
>
> On Wed, Aug 24, 2016 at 9:49 AM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>&
You can set that poll timeout higher with
spark.streaming.kafka.consumer.poll.ms
but half a second is fairly generous. I'd try to take a look at
what's going on with your network or kafka broker during that time.
On Tue, Aug 23, 2016 at 4:44 PM, Srikanth wrote:
> Hello,
Were you aware that the spark 2.0 / kafka 0.10 integration also reuses
kafka consumer instances on the executors?
On Tue, Aug 23, 2016 at 3:19 PM, Jeoffrey Lim wrote:
> Hi,
>
> I have released the first version of a new Kafka integration with Spark
> that we use in the
See
https://github.com/koeninger/kafka-exactly-once
On Aug 23, 2016 10:30 AM, "KhajaAsmath Mohammed"
wrote:
> Hi Experts,
>
> I am looking for some information on how to acheive zero data loss while
> working with kafka and Spark. I have searched online and blogs have
>
both fault tolerance and
> application code/config changes without checkpointing? Is there anything
> else which checkpointing gives? I might be missing something.
>
>
> Regards,
> Chandan
>
>
> On Thu, Aug 18, 2016 at 8:27 PM, Cody Koeninger <c...@koeninger.org>
Checkpointing is not kafka-specific. It encompasses metadata about the
application. You can't re-use a checkpoint if your application has changed.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#upgrading-application-code
On Thu, Aug 18, 2016 at 4:39 AM, chandan prakash
streaming-kafka-0-10-assembly??
>
> Srikanth
>
> On Fri, Aug 12, 2016 at 5:15 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> Hrrm, that's interesting. Did you try with subscribe pattern, out of
>> curiosity?
>>
>> I haven't tested repartitioning
No, you really shouldn't rely on checkpoints if you cant afford to
reprocess from the beginning of your retention (or lose data and start
from the latest messages).
If you're in a real bind, you might be able to get something out of
the serialized data in the checkpoint, but it'd probably be
logger.info(s"rdd has ${rdd.getNumPartitions} partitions.")
>
>
> Should I be setting some parameter/config? Is the doc for new integ
> available?
>
> Thanks,
> Srikanth
>
> On Fri, Jul 22, 2016 at 2:15 PM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>
e same consumer group name. But this is not working though . Somehow
> createstream is picking the offset from some where other than
> /consumers/ from zookeeper
>
>
> Sent from Samsung Mobile.
>
>
>
>
>
>
>
>
> Original message
16/08/10 18:16:44 INFO JobScheduler: Added jobs for time 1470833204000 ms
> 16/08/10 18:16:45 INFO JobScheduler: Added jobs for time 1470833205000 ms
> 16/08/10 18:16:46 INFO JobScheduler: Added jobs for time 1470833206000 ms
> 16/08/10 18:16:47 INFO JobScheduler: Added jobs for time 1470833
Those logs you're posting are from right after your failure, they don't
include what actually went wrong when attempting to read json. Look at your
logs more carefully.
On Aug 10, 2016 2:07 AM, "Diwakar Dhanuskodi"
wrote:
> Hi Siva,
>
> With below code, it is stuck
101 - 200 of 652 matches
Mail list logo