Questions about spark's kafka integration should probably be directed to
the spark user mailing list, not this one. I don't monitor kafka mailing
lists as closely, for instance.
For the direct stream, Spark doesn't keep any state regarding offsets,
unless you enable checkpointing. Have you read
Questions about Spark-kafka integration are better directed to the Spark
user mailing list.
I'm not 100% sure what you're asking. The spark createDirectStream api
will not store any offsets internally, unless you enable checkpointing.
On Sun, Nov 1, 2015 at 10:26 PM, Charan Ganga Phani
That looks like the OOM is in the driver, when getting partition metadata
to create the direct stream. In that case, executor memory allocation
doesn't matter.
Allocate more driver memory, or put a profiler on it to see what's taking
up heap.
On Thu, Sep 24, 2015 at 3:51 PM, Sourabh Chandak
$.main(SparkSubmit.scala:75)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> Thanks,
> Sourabh
>
> On Thu, Sep 24, 2015 at 2:04 PM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>> That looks like the OOM is in the driver, when getting pa
Yeah, the direct api uses the simple consumer
On Fri, Aug 28, 2015 at 1:32 PM, Cassa L lcas...@gmail.com wrote:
Hi I am using below Spark jars with Direct Stream API.
spark-streaming-kafka_2.10
When I look at its pom.xml, Kafka libraries that its pulling in is
Spark's direct stream kafka integration should take advantage of data
locality if you're running Spark executors on the same nodes as Kafka
brokers.
On Wed, Nov 25, 2015 at 9:50 AM, Dave Ariens wrote:
> I just finished reading up on Kafka Connect<
>
Spark specific questions are better directed to the Spark user list.
Spark will retry failed tasks automatically up to a configurable number of
times. The direct stream will retry failures on the driver up to a
configurable number of times.
See
Also, if you actually want to use kafka, you're much better off with a
replication factor greater than 1, so you get leader re-election.
On Fri, Nov 20, 2015 at 9:20 AM, Cody Koeninger <c...@koeninger.org> wrote:
> Spark specific questions are better directed to the Spark user list.
.@gmail.com> wrote:
>
>> In order to do anything meaningful with the consumer itself in rebalance
>> callback (e.g. commit offset), you would need to hold on the consumer
>> reference; admittedly it sounds a bit awkward, but by design we choose to
>> not enforce it in
gt; callback (e.g. commit offset), you would need to hold on the consumer
> reference; admittedly it sounds a bit awkward, but by design we choose to
> not enforce it in the interface itself.
>
> Guozhang
>
> On Wed, Mar 9, 2016 at 3:39 PM, Cody Koeninger <c...@koeninger.org> w
out some specific change on the consumer API, please
> feel free to create a new KIP with the detailed motivation and proposed
> modifications.
>
> Guozhang
>
> On Fri, Mar 11, 2016 at 12:28 PM, Cody Koeninger <c...@koeninger.org> wrote:
>
>> Is there a KIP or Jira relat
en the seek since position() will block to get
> the new offset.
>
> -Jason
>
> On Mon, Mar 14, 2016 at 2:37 PM, Cody Koeninger <c...@koeninger.org> wrote:
>
>> Sorry, by metadata I also meant the equivalent of the old
>> OffsetRequest api, which partitionsFor
s going to get all the partitions assigned
> to itself (i.e. you are only running a single instance).
>
> Guozhang
>
>
> On Wed, Mar 9, 2016 at 6:22 AM, Cody Koeninger <c...@koeninger.org> wrote:
>
>> Another unfortunate thing about ConsumerRebalanceListener is t
Using the 0.9 consumer, I would like to start consuming at the
beginning or end, without specifying auto.offset.reset.
This does not seem to be possible:
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> conf.getString("kafka.brokers"),
"key.deserializer" ->
> me know if you have any other ideas.
>
> Guozhang
>
> On Wed, Mar 9, 2016 at 12:25 PM, Cody Koeninger <c...@koeninger.org> wrote:
>
>> Yeah, I think I understood what you were saying. What I'm saying is
>> that if there were a way to just fetch metadata without
@Override
> public void onPartitionsAssigned(Collection
> partitions) {
> consumer.seekToBeginning(partitions.toArray(new
> TopicPartition[0]));
> }
> };
>
> consumer.subscribe(topics, listener);
>
> On
ee
>> that new partitions should be consumed automatically). I guess we can
>> continue this discussion on the spark list then :-)
>>
>> Thanks
>> Mansi.
>>
>> On Thu, Mar 10, 2016 at 7:43 AM, Cody Koeninger <c...@koeninger.org>
>> wrote:
to the consumer. Seems like this makes it unnecessarily
awkward to serialize or provide a 0 arg constructor for the listener.
On Wed, Mar 9, 2016 at 7:28 AM, Cody Koeninger <c...@koeninger.org> wrote:
> I thought about ConsumerRebalanceListener, but seeking to the
> beginning any time there's
gt; On Wed, Mar 9, 2016 at 2:11 PM, Guozhang Wang <wangg...@gmail.com> wrote:
>
> > Filed https://issues.apache.org/jira/browse/KAFKA-3370.
> >
> > On Wed, Mar 9, 2016 at 1:11 PM, Cody Koeninger <c...@koeninger.org>
> wrote:
> >
> >> That sounds like
nsuming
> messages. We'd probably want to understand why those are insufficient
> before considering new APIs.
>
> -Jason
>
> On Mon, Mar 14, 2016 at 12:17 PM, Cody Koeninger <c...@koeninger.org> wrote:
>
>> Regarding the rebalance listener, in the case of the spark
>
a should not be lost. The system should be as fault tolerant as
>> possible.
>>
>> What's the advantage of using Spark for reading Kafka instead of direct
>> Kafka consumers?
>>
>> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger <c...@koeninger.org>
>> wrote
Spark streaming helps with aggregation because
A. raw kafka consumers have no built in framework for shuffling
amongst nodes, short of writing into an intermediate topic (I'm not
touching Kafka Streams here, I don't have experience), and
B. it deals with batches, so you can transactionally
t, query the
>> respective table in Cassandra / Postgres. (select .. from data where user =
>> ? and date between and and some_field = ?)
>>
>> How will Spark Streaming help w/ aggregation? Couldn't the data be queried
>> from Cassandra / Postgres via the Kafka cons
How are you going to handle etl failures? Do you care about lost /
duplicated data? Are your writes idempotent?
Absent any other information about the problem, I'd stay away from
cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
feeding postgres.
On Thu, Sep 29, 2016 at 10:04
Spark direct stream is just fine for this use case.
> But why postgres and not cassandra?
> Is there anything specific here that i may not be aware?
>
> Thanks
> Deepak
>
> On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> How are y
otherwise updates will be
> idempotent but not inserts.
>
> Data should not be lost. The system should be as fault tolerant as possible.
>
> What's the advantage of using Spark for reading Kafka instead of direct
> Kafka consumers?
>
> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger &l
gt; Spark standalone is not Yarn… or secure for that matter… ;-)
>
>> On Sep 29, 2016, at 11:18 AM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> Spark streaming helps with aggregation because
>>
>> A. raw kafka consumers have no built in framework
Can anyone clarify what (other than the known cases of compaction or
transactions) could be causing non-contiguous offsets?
That sounds like a potential defect, given that I ran billions of
messages a day through kafka 0.8.x series for years without seeing
that.
On Tue, Jan 23, 2018 at 3:35 PM,
https://issues.apache.org/jira/browse/SPARK-19680
and
https://issues.apache.org/jira/browse/KAFKA-3370
has a good explanation.
Verify that it works correctly with auto offset set to latest, to rule
out other issues.
Then try providing explicit starting offsets reasonably near the
beginning of
I can't speak for committers, but my guess is it's more likely for
DStreams in general to stop being supported before that particular
integration is removed.
On Sun, Feb 18, 2018 at 9:34 PM, naresh Goud wrote:
> Thanks Ted.
>
> I see createDirectStream is
Here's one reason I might want to be able to tell whether a given
offset is a transactional marker:
https://issues.apache.org/jira/browse/SPARK-24720
Alternatively, is there any efficient way to tell what the offset of
the last actual record in a topicpartition is (i.e. like endOffsets)
On Thu,
31 matches
Mail list logo