to start the stream.
On Tue, Sep 8, 2015 at 1:19 PM, Dan Dutrow <dan.dut...@gmail.com> wrote:
> Yes, but one implementation checks those flags and the other one doesn't.
> I would think they should be consistent.
>
> On Tue, Sep 8, 2015 at 1:32 PM Cody Koeninger <c...
The direct stream just makes a spark partition per kafka partition, so if
those partitions are not getting evenly distributed among executors,
something else is probably wrong with your configuration.
If you replace the kafka stream with a dummy rdd created with e.g.
sc.parallelize, what happens?
The answer already given is correct. You shouldn't doubt this, because
you've already seen the shuffle data change accordingly.
On Fri, Sep 4, 2015 at 11:25 AM, Shushant Arora
wrote:
> But Kafka stream has underlyng RDD which consists of offsets reanges only-
> so
No, there isn't a partitioner for KafkaRDD (KafkaRDD may not even be a pair
rdd, for instance).
It sounds to me like if it's a self-join, you should be able to do it in a
single mapPartition operation.
On Wed, Sep 2, 2015 at 3:06 PM, Chen Song wrote:
> I have a stream
Yeah, in general if you're changing the jar you can't recover the
checkpoint.
If you're just changing parameters, why not externalize those in a
configuration file so your jar doesn't change? I tend to stick even my
app-specific parameters in an external spark config so everything is in one
[
https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14726076#comment-14726076
]
Cody Koeninger commented on SPARK-10320:
You would supply a function, similar to the way
[
https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14725719#comment-14725719
]
Cody Koeninger commented on SPARK-10320:
So if you're changing topics in an event handler, it's
20 batches .
> Will that be a issue ?
> And is there any setting in directkafkastream to enforce not to call
> further computes() method after a threshold of scheduling queue size say
> (50 batches).Once queue size comes back to less than threshold call compute
> and enqueue the
do this in compute method since compute would have been called at
> current batch queue time - but condition is set at previous batch run time.
>
>
> On Tue, Sep 1, 2015 at 7:09 PM, Cody Koeninger <c...@koeninger.org> wrote:
>
>> It's at the time compute() gets called, which should b
It's at the time compute() gets called, which should be near the time the
batch should have been queued.
On Tue, Sep 1, 2015 at 8:02 AM, Shushant Arora
wrote:
> Hi
>
> In spark streaming 1.3 with kafka- when does driver bring latest offsets
> of this run - at start of
f offsets
> being checkpointed at end of each batch.
>
> Will it be possible then to reset the offset.
>
> On Tue, Sep 1, 2015 at 8:42 PM, Cody Koeninger <c...@koeninger.org> wrote:
>
>> No, if you start arbitrarily messing around with offset ranges after
>> compute is
t Can I have infinite retries instead of
> killing the app? What should be the value of retries (-1) will work or
> something else ?
>
> On Thu, Aug 27, 2015 at 6:46 PM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>> Your kafka broker died or you otherwise had a rebalan
[
https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14724074#comment-14724074
]
Cody Koeninger commented on SPARK-10320:
" If you restart the job and specify a new o
[
https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Cody Koeninger updated SPARK-10320:
---
Summary: Kafka Support new topic subscriptions without requiring restart of
the streaming
[
https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14724113#comment-14724113
]
Cody Koeninger commented on SPARK-10320:
It's possible this might be solvable with a user
I wrote some code for this a while back, pretty sure it didn't need access
to anything private in the decision tree / random forest model. If people
want it added to the api I can put together a PR.
I think it's important to have separately parseable operators / operands
though. E.g
I wrote some code for this a while back, pretty sure it didn't need access
to anything private in the decision tree / random forest model. If people
want it added to the api I can put together a PR.
I think it's important to have separately parseable operators / operands
though. E.g
Yeah, the direct api uses the simple consumer
On Fri, Aug 28, 2015 at 1:32 PM, Cassa L lcas...@gmail.com wrote:
Hi I am using below Spark jars with Direct Stream API.
spark-streaming-kafka_2.10
When I look at its pom.xml, Kafka libraries that its pulling in is
Yeah, the direct api uses the simple consumer
On Fri, Aug 28, 2015 at 1:32 PM, Cassa L lcas...@gmail.com wrote:
Hi I am using below Spark jars with Direct Stream API.
spark-streaming-kafka_2.10
When I look at its pom.xml, Kafka libraries that its pulling in is
If you all can say a little more about what your requirements are, maybe we
can get a jira together.
I think the easiest way to deal with this currently is to start a new job
before stopping the old one, which should prevent latency problems.
On Thu, Aug 27, 2015 at 9:24 AM, Sudarshan Kadambi
didn't work for me.
rdd.mapPartitions(partitionOfRecords = {
DBConnectionInit()
val results = partitionOfRecords.map(..)
DBConnection.commit()
})
Best regards,
Ahmed Atef Nawwar
Data Management Big Data Consultant
On Thu, Aug 27, 2015 at 4:16 PM, Cody Koeninger c
.
On Wed, Aug 26, 2015 at 9:32 PM, Cody Koeninger c...@koeninger.org
wrote:
see http://kafka.apache.org/documentation.html#consumerconfigs
fetch.message.max.bytes
in the kafka params passed to the constructor
On Wed, Aug 26, 2015 at 10:39 AM, Shushant Arora
shushantaror...@gmail.com wrote
Kill the job in the middle of a batch, look at the worker logs to see which
offsets were being processed, verify the messages for those offsets are
read when you start the job back up
On Thu, Aug 27, 2015 at 10:14 AM, Hafiz Mujadid hafizmujadi...@gmail.com
wrote:
Hi!
I have enables check
[
https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717342#comment-14717342
]
Cody Koeninger commented on SPARK-10320:
As I said on the list, the best way
there is solution to force action.
I can not use foreachPartition because i need reuse the new RDD after some
maps.
On Thu, Aug 27, 2015 at 5:11 PM, Cody Koeninger c...@koeninger.org
wrote:
Map is lazy. You need an actual action, or nothing will happen. Use
foreachPartition, or do
= partitionOfRecords.map(..)
DBConnection.commit()
results.foreach(row= {})
results
})
On Thu, Aug 27, 2015 at 10:18 PM, Cody Koeninger c...@koeninger.org
wrote:
You need to return an iterator from the closure you provide to
mapPartitions
On Thu, Aug 27, 2015 at 1:42 PM, Ahmed Nawar ahmed.na
On Wed, Aug 26, 2015 at 7:12 AM, Cody Koeninger c...@koeninger.org
wrote:
The first thing that stands out to me is
createOnError = true
Are you sure the checkpoint is actually loading, as opposed to failing
and starting the job anyway? There should be info lines that look like
INFO
On Wed, Aug 26, 2015 at 10:51 AM, Cody Koeninger c...@koeninger.org
wrote:
If your data only changes every few days, why not restart the job every
few days, and just broadcast the data?
Or you can keep a local per-jvm cache with an expiry (e.g. guava cache)
to avoid many mysql reads
see http://kafka.apache.org/documentation.html#consumerconfigs
fetch.message.max.bytes
in the kafka params passed to the constructor
On Wed, Aug 26, 2015 at 10:39 AM, Shushant Arora shushantaror...@gmail.com
wrote:
whats the default buffer in spark streaming 1.3 for kafka messages.
Say In
)
}
}
ssc
}
On Tue, Aug 25, 2015 at 12:07 PM, Cody Koeninger c...@koeninger.org
wrote:
Sounds like something's not set up right... can you post a minimal code
example that reproduces the issue?
On Tue, Aug 25, 2015 at 1:40 PM, Susan Zhang suchenz...@gmail.com
wrote:
Yeah. All
That argument takes a function from MessageAndMetadata to whatever you want
your stream to contain.
See
https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/TransactionalPerBatch.scala#L57
On Wed, Aug 26, 2015 at 7:55 AM, Deepesh Maheshwari
Does the first batch after restart contain all the messages received while
the job was down?
On Tue, Aug 25, 2015 at 12:53 PM, suchenzang suchenz...@gmail.com wrote:
Hello,
I'm using direct spark streaming (from kafka) with checkpointing, and
everything works well until a restart. When I
Are you actually losing messages then?
On Tue, Aug 25, 2015 at 1:15 PM, Susan Zhang suchenz...@gmail.com wrote:
No; first batch only contains messages received after the second job
starts (messages come in at a steady rate of about 400/second).
On Tue, Aug 25, 2015 at 11:07 AM, Cody
Sounds like something's not set up right... can you post a minimal code
example that reproduces the issue?
On Tue, Aug 25, 2015 at 1:40 PM, Susan Zhang suchenz...@gmail.com wrote:
Yeah. All messages are lost while the streaming job was down.
On Tue, Aug 25, 2015 at 11:37 AM, Cody Koeninger c
It doesn't matter if shuffling occurs. Just update ZK from the driver,
inside the foreachRDD, after all your dynamodb updates are done. Since
you're just doing it for monitoring purposes, that should be fine.
On Mon, Aug 24, 2015 at 12:11 PM, suchenzang suchenz...@gmail.com wrote:
Forgot to
I'd start off by trying to simplify that closure - you don't need the
transform step, or currOffsetRanges to be scoped outside of it. Just do
everything in foreachRDD. LIkewise, it looks like zkClient is also scoped
outside of the closure passed to foreachRDD
i.e. you have
zkClient = new
Each spark partition will contain messages only from a single kafka
topcipartition. Use hasOffsetRanges to tell which kafka topicpartition
it's from. See the docs
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
On Sun, Aug 23, 2015 at 10:56 AM, Spark Enthusiast
happen that you are spending
too much time on disk io etc.
On Aug 21, 2015 7:32 AM, Cody Koeninger c...@koeninger.org wrote:
Sounds like that's happening consistently, not an occasional network
problem?
Look at the Kafka broker logs
Make sure you've configured the correct kafka broker
Sounds like that's happening consistently, not an occasional network
problem?
Look at the Kafka broker logs
Make sure you've configured the correct kafka broker hosts / ports (note
that direct stream does not use zookeeper host / port).
Make sure that host / port is reachable from your driver
In general you cannot guarantee which node an RDD will be processed on.
The preferred location for a kafkardd is the kafka leader for that
partition, if they're deployed on the same machine. If you want to try to
override that behavior, the method is getPreferredLocations
But even in that case,
I'm not clear on your question, can you rephrase it? Also, are you talking
about createStream or createDirectStream?
On Thu, Aug 20, 2015 at 9:48 PM, Gaurav Agarwal gaurav130...@gmail.com
wrote:
Hello
Regarding Spark Streaming and Kafka Partitioning
When i send message on kafka topic with
, Shushant Arora
shushantaror...@gmail.com wrote:
But KafkaRDD[K, V, U, T, R] is not subclass of RDD[R] as per java
generic inheritance is not supported so derived class cannot return
different genric typed subclass from overriden method.
On Wed, Aug 19, 2015 at 12:02 AM, Cody Koeninger c
the same error did not come while extending
DirectKafkaInputDStream from InputDStream ? Since new return type
Option[KafkaRDD[K,
V, U, T, R]] is not subclass of Option[RDD[T] so it should have been
failed?
On Tue, Aug 18, 2015 at 7:28 PM, Cody Koeninger c...@koeninger.org
wrote
[][] but its expecting
scala.Optionorg.apache.spark.rdd.RDDbyte[][] from derived class . Is
there something wring with code?
On Mon, Aug 17, 2015 at 7:08 PM, Cody Koeninger c...@koeninger.org
wrote:
Look at the definitions of the java-specific
KafkaUtils.createDirectStream methods (the ones
looking forward to it :)
On Wed, Aug 19, 2015 at 12:02 AM, Cody Koeninger c...@koeninger.org
wrote:
The solution you found is also in the docs:
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
Java uses an atomic reference because Java doesn't allow you to close
over non
The solution you found is also in the docs:
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
Java uses an atomic reference because Java doesn't allow you to close over
non-final references.
I'm not clear on your other question.
On Tue, Aug 18, 2015 at 3:43 AM, Petr Novak
I wouldn't expect a kafka producer to be serializable at all... among other
things, it has a background thread
On Tue, Aug 18, 2015 at 4:55 PM, Shenghua(Daniel) Wan wansheng...@gmail.com
wrote:
Hi,
Did anyone see java.util.ConcurrentModificationException when using
broadcast variables?
I
instead of Function1 ?
On Thu, Aug 13, 2015 at 12:16 AM, Cody Koeninger c...@koeninger.org
wrote:
I'm not aware of an existing api per se, but you could create your own
subclass of the DStream that returns None for compute() under certain
conditions.
On Wed, Aug 12, 2015 at 1:03 PM
[
https://issues.apache.org/jira/browse/SPARK-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697098#comment-14697098
]
Cody Koeninger commented on SPARK-9947:
---
You already have access to offsets and can
[
https://issues.apache.org/jira/browse/SPARK-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697166#comment-14697166
]
Cody Koeninger commented on SPARK-9947:
---
You can't re-use checkpoint data across
[
https://issues.apache.org/jira/browse/SPARK-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697193#comment-14697193
]
Cody Koeninger commented on SPARK-9947:
---
Didn't you already say that you were saving
[
https://issues.apache.org/jira/browse/SPARK-6249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697188#comment-14697188
]
Cody Koeninger commented on SPARK-6249:
---
If you want an api that has imprecise
stopped it cannot
be restarted, but I also haven't found a way to stop the app
programmatically.
The batch duration will probably be around 1-10 seconds. I think this is
small enough to not make it a batch job?
Thanks again
On Thu, Aug 13, 2015 at 10:15 PM, Cody Koeninger c...@koeninger.org
Use your email client to send a message to the mailing list from the email
address you used to subscribe?
The message you just sent reached the list
On Fri, Aug 14, 2015 at 9:36 AM, dutrow dan.dut...@gmail.com wrote:
How do I get beyond the This post has NOT been accepted by the mailing
list
I don't entirely agree with that assessment. Not paying for extra cores to
run receivers was about as important as delivery semantics, as far as
motivations for the api.
As I said in the jira tickets on the topic, if you want to use the direct
api and save offsets to ZK, you can. The right way
You'll resume and re-process the rdd that didnt finish
On Fri, Aug 14, 2015 at 1:31 PM, Dmitry Goldenberg dgoldenberg...@gmail.com
wrote:
Our additional question on checkpointing is basically the logistics of it
--
At which point does the data get written into checkpointing? Is it
written
The current kafka stream implementation assumes the set of topics doesn't
change during operation.
You could either take a crack at writing a subclass that does what you
need; stop/start; or if your batch duration isn't too small, you could run
it as a series of RDDs (using the existing
It's just to limit the maximum number of records a given executor needs to
deal with in a given batch.
Typical usage would be if you're starting a stream from the beginning of a
kafka log, or after a long downtime, and don't want ALL of the messages in
the first batch.
On Thu, Aug 13, 2015 at
Access the offsets using HasOffsetRanges, save them in your datastore,
provide them as the fromOffsets argument when starting the stream.
See https://github.com/koeninger/kafka-exactly-once
On Thu, Aug 13, 2015 at 3:53 PM, Stephen Durfey sjdur...@gmail.com wrote:
When deploying a spark
[
https://issues.apache.org/jira/browse/SPARK-6249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695939#comment-14695939
]
Cody Koeninger commented on SPARK-6249:
---
Regarding streaming stats, those should
[
https://issues.apache.org/jira/browse/SPARK-6249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695929#comment-14695929
]
Cody Koeninger commented on SPARK-6249:
---
https://github.com/koeninger/kafka-exactly
will just ignore already
processed events by accessing counter of failed task. Is it directly
possible to access accumulator per task basis without writing to hdfs or
hbase.
On Tue, Aug 11, 2015 at 3:15 AM, Cody Koeninger c...@koeninger.org
wrote:
http://spark.apache.org/docs/latest
[
https://issues.apache.org/jira/browse/SPARK-9780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681886#comment-14681886
]
Cody Koeninger commented on SPARK-9780:
---
Makes sense, traveling currently but I'll
That looks like it's during recovery from a checkpoint, so it'd be driver
memory not executor memory.
How big is the checkpoint directory that you're trying to restore from?
On Mon, Aug 10, 2015 at 10:57 AM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
We're getting the below error.
the original checkpointing directory :( Thanks for the clarification
on spark.driver.memory, I'll keep testing (at 2g things seem OK for now).
On Mon, Aug 10, 2015 at 12:10 PM, Cody Koeninger c...@koeninger.org
wrote:
That looks like it's during recovery from a checkpoint, so it'd be
driver
There's no long-running receiver pushing blocks of messages, so
blockInterval isn't relevant.
Batch interval is what matters.
On Mon, Aug 10, 2015 at 12:52 PM, allonsy luke1...@gmail.com wrote:
Hi everyone,
I recently started using the new Kafka direct approach.
Now, as far as I
http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers
http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations
https://www.youtube.com/watch?v=fXnNEq1v3VA
On Mon, Aug 10, 2015 at 4:32 PM, Shushant
tolerance. Why does Spark persist whole RDD's of data? Shouldn't it be
sufficient to just persist the offsets, to know where to resume from?
Thanks.
On Mon, Aug 10, 2015 at 1:07 PM, Cody Koeninger c...@koeninger.org
wrote:
You need to keep a certain number of rdds around for checkpointing
of these costs?
On Mon, Aug 10, 2015 at 4:27 PM, Cody Koeninger c...@koeninger.org
wrote:
The rdd is indeed defined by mostly just the offsets / topic partitions.
On Mon, Aug 10, 2015 at 3:24 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
You need to keep a certain number of rdds
For direct stream questions:
https://github.com/koeninger/kafka-exactly-once
Yes, it is used in production.
For general spark streaming question:
http://spark.apache.org/docs/latest/streaming-programming-guide.html
On Mon, Aug 10, 2015 at 7:51 AM, Mohit Durgapal durgapalmo...@gmail.com
[
https://issues.apache.org/jira/browse/SPARK-9476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14679380#comment-14679380
]
Cody Koeninger commented on SPARK-9476:
---
The stacktraces you posted don't contain
Those jobs will still be created for each valid time, they just may not
have many messages in them
On Mon, Aug 3, 2015 at 11:11 PM, Shushant Arora shushantaror...@gmail.com
wrote:
1.In spark 1.3(Non receiver) - If my batch interval is 1 sec and I don't
set
Have you tried using the console consumer to see if anything is actually
getting published to that topic?
On Tue, Aug 4, 2015 at 11:45 AM, narendra narencs...@gmail.com wrote:
My application takes Twitter4j tweets and publishes those to a topic in
Kafka. Spark Streaming subscribes to that
Just to be clear, did you rebuild your job against spark 1.4.1 as well as
upgrading the cluster?
On Mon, Aug 3, 2015 at 8:36 AM, Netwaver wanglong_...@163.com wrote:
Hi All,
I have a spark streaming + kafka program written by Scala, it
works well on Spark 1.3.1, but after I migrate
Show us the relevant code
On Fri, Jul 31, 2015 at 12:16 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
I've instrumented checkpointing per the programming guide and I can tell
that Spark Streaming is creating the checkpoint directories but I'm not
seeing any content being created in
[
https://issues.apache.org/jira/browse/SPARK-9476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14649123#comment-14649123
]
Cody Koeninger commented on SPARK-9476:
---
Those all look like info or warn level log
Cody Koeninger created SPARK-9475:
-
Summary: Consistent hadoop config for external/*
Key: SPARK-9475
URL: https://issues.apache.org/jira/browse/SPARK-9475
Project: Spark
Issue Type: Sub-task
Cody Koeninger created SPARK-9473:
-
Summary: Consistent hadoop config for SQL
Key: SPARK-9473
URL: https://issues.apache.org/jira/browse/SPARK-9473
Project: Spark
Issue Type: Sub-task
Cody Koeninger created SPARK-9472:
-
Summary: Consistent hadoop config for streaming
Key: SPARK-9472
URL: https://issues.apache.org/jira/browse/SPARK-9472
Project: Spark
Issue Type: Sub-task
Cody Koeninger created SPARK-9474:
-
Summary: Consistent hadoop config for core
Key: SPARK-9474
URL: https://issues.apache.org/jira/browse/SPARK-9474
Project: Spark
Issue Type: Sub-task
Just so I'm clear, the difference in timing you're talking about is this:
15/07/30 14:33:59 INFO DAGScheduler: Job 24 finished: foreachRDD at
MetricsSpark.scala:67, took 60.391761 s
15/07/30 14:37:35 INFO DAGScheduler: Job 93 finished: foreachRDD at
MetricsSpark.scala:67, took 0.531323 s
Are
You can't use checkpoints across code upgrades. That may or may not change
in the future, but for now that's a limitation of spark checkpoints
(regardless of whether you're using Kafka).
Some options:
- Start up the new job on a different cluster, then kill the old job once
it's caught up to
.
On Thu, Jul 30, 2015 at 9:29 AM, Guillermo Ortiz konstt2...@gmail.com
wrote:
I have three topics with one partition each topic. So each jobs run about
one topics.
2015-07-30 16:20 GMT+02:00 Cody Koeninger c...@koeninger.org:
Just so I'm clear, the difference in timing you're talking about
is valid and it has data I can see data in it using another kafka
consumer.
On Jul 30, 2015 7:31 AM, Cody Koeninger c...@koeninger.org wrote:
The last time someone brought this up on the mailing list, the issue
actually was that the topic(s) didn't exist in Kafka at the time the spark
job
The last time someone brought this up on the mailing list, the issue
actually was that the topic(s) didn't exist in Kafka at the time the spark
job was running.
On Wed, Jul 29, 2015 at 6:17 PM, Tathagata Das t...@databricks.com wrote:
There is a known issue that Kafka cannot return leader
[
https://issues.apache.org/jira/browse/SPARK-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646043#comment-14646043
]
Cody Koeninger commented on SPARK-9434:
---
Dmitry, the nabble link you posted
[
https://issues.apache.org/jira/browse/SPARK-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646113#comment-14646113
]
Cody Koeninger commented on SPARK-9434:
---
If you want to submit a PR with doc changes
@Ashwin you don't need to append the topic to your data if you're using the
direct stream. You can get the topic from the offset range, see
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
(search for offsetRange)
If you're using the receiver based stream, you'll need to
You don't have to use some other package in order to get access to the
offsets.
Shushant, have you read the available documentation at
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
or watched
Yes, you need to follow the documentation. Configure your stream,
including the transformations made to it, inside the getOrCreate function.
On Tue, Jul 28, 2015 at 3:14 AM, Guillermo Ortiz konstt2...@gmail.com
wrote:
I'm using SparkStreaming and I want to configure checkpoint to manage
24, 2015 at 3:32 PM, Cody Koeninger c...@koeninger.org
wrote:
It's really a question of whether you need access to the
MessageAndMetadata, or just the key / value from the message.
If you just need the key/value, dstream map is fine.
In your case, since you need to be able to control
Use foreachPartition and batch the writes
On Sat, Jul 25, 2015 at 9:14 AM, nib...@free.fr wrote:
Hello,
I am new user of Spark, and need to know what could be the best practice
to do the following scenario :
- Spark Streaming receives XML messages from Kafka
- Spark transforms each message
Well... there are only 2 hard problems in computer science: naming things,
cache invalidation, and off-by-one errors.
The direct stream implementation isn't asking you to commit anything.
It's asking you to provide a starting point for the stream on startup.
Because offset ranges are inclusive
/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java#L805
After processing read ConsumerRecord, commit expects me to submit not
offset 0 but offset 1...
Kind regards,
Stevo Slavic
On Fri, Jul 24, 2015 at 6:31 PM, Cody Koeninger c...@koeninger.org
wrote:
Well
method (convert each message within this handler) or dstream.foreachRDD.map
(map rdd) or dstream.map.foreachRDD (map dstream) ?
Thank you for your help Cody.
Regards,
Nicolas PHUNG
On Tue, Jul 21, 2015 at 4:53 PM, Cody Koeninger c...@koeninger.org
wrote:
Yeah, I'm referring to that api
Yes, look at KafkaUtils.createRDD
On Wed, Jul 22, 2015 at 11:17 AM, Shushant Arora shushantaror...@gmail.com
wrote:
Thanks !
I am using spark streaming 1.3 , And if some post fails because of any
reason, I will store the offset of that message in another kafka topic. I
want to read these
, Jul 20, 2015 at 7:08 PM, Cody Koeninger c...@koeninger.org
wrote:
Yeah, in the function you supply for the messageHandler parameter to
createDirectStream, catch the exception and do whatever makes sense for
your application.
On Mon, Jul 20, 2015 at 11:58 AM, Nicolas Phung nicolas.ph
[
https://issues.apache.org/jira/browse/SPARK-9215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635179#comment-14635179
]
Cody Koeninger commented on SPARK-9215:
---
I left some comments in the doc.
Is Chris
,
Nicolas PHUNG
On Thu, Jul 16, 2015 at 7:08 PM, Cody Koeninger c...@koeninger.org
wrote:
Not exactly the same issue, but possibly related:
https://issues.apache.org/jira/browse/KAFKA-1196
On Thu, Jul 16, 2015 at 12:03 PM, Cody Koeninger c...@koeninger.org
wrote:
Well, working backwards
or KafkaUtils.createDirectStream ?
Regards,
Nicolas PHUNG
On Mon, Jul 20, 2015 at 3:46 PM, Cody Koeninger c...@koeninger.org
wrote:
I'd try logging the offsets for each message, see where problems start,
then try using the console consumer starting at those offsets and see if
you can reproduce the problem
1001 - 1100 of 1347 matches
Mail list logo