Re: Regarding the Kafka offset management issue in Direct Stream Approach.

2015-10-26 Thread Cody Koeninger
Questions about spark's kafka integration should probably be directed to the spark user mailing list, not this one. I don't monitor kafka mailing lists as closely, for instance. For the direct stream, Spark doesn't keep any state regarding offsets, unless you enable checkpointing. Have you read

Re: Regarding The Kafka Offset Management Issue In Direct Stream Approach.

2015-11-06 Thread Cody Koeninger
Questions about Spark-kafka integration are better directed to the Spark user mailing list. I'm not 100% sure what you're asking. The spark createDirectStream api will not store any offsets internally, unless you enable checkpointing. On Sun, Nov 1, 2015 at 10:26 PM, Charan Ganga Phani

Re: ERROR BoundedByteBufferReceive: OOME with size 352518400

2015-09-24 Thread Cody Koeninger
That looks like the OOM is in the driver, when getting partition metadata to create the direct stream. In that case, executor memory allocation doesn't matter. Allocate more driver memory, or put a profiler on it to see what's taking up heap. On Thu, Sep 24, 2015 at 3:51 PM, Sourabh Chandak

Re: ERROR BoundedByteBufferReceive: OOME with size 352518400

2015-09-25 Thread Cody Koeninger
$.main(SparkSubmit.scala:75) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > > Thanks, > Sourabh > > On Thu, Sep 24, 2015 at 2:04 PM, Cody Koeninger <c...@koeninger.org> > wrote: > >> That looks like the OOM is in the driver, when getting pa

Re: SSL between Kafka and Spark Streaming API

2015-08-28 Thread Cody Koeninger
Yeah, the direct api uses the simple consumer On Fri, Aug 28, 2015 at 1:32 PM, Cassa L lcas...@gmail.com wrote: Hi I am using below Spark jars with Direct Stream API. spark-streaming-kafka_2.10 When I look at its pom.xml, Kafka libraries that its pulling in is

Re: Kafka Connect and Spark/Storm Comparisons

2015-11-25 Thread Cody Koeninger
Spark's direct stream kafka integration should take advantage of data locality if you're running Spark executors on the same nodes as Kafka brokers. On Wed, Nov 25, 2015 at 9:50 AM, Dave Ariens wrote: > I just finished reading up on Kafka Connect< >

Re: Couldn't find leaders for Set([TOPICNNAME,0])) When we are uisng in Apache Saprk.

2015-11-20 Thread Cody Koeninger
Spark specific questions are better directed to the Spark user list. Spark will retry failed tasks automatically up to a configurable number of times. The direct stream will retry failures on the driver up to a configurable number of times. See

Re: Couldn't find leaders for Set([TOPICNNAME,0])) When we are uisng in Apache Saprk.

2015-11-20 Thread Cody Koeninger
Also, if you actually want to use kafka, you're much better off with a replication factor greater than 1, so you get leader re-election. On Fri, Nov 20, 2015 at 9:20 AM, Cody Koeninger <c...@koeninger.org> wrote: > Spark specific questions are better directed to the Spark user list.

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-10 Thread Cody Koeninger
.@gmail.com> wrote: > >> In order to do anything meaningful with the consumer itself in rebalance >> callback (e.g. commit offset), you would need to hold on the consumer >> reference; admittedly it sounds a bit awkward, but by design we choose to >> not enforce it in

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-10 Thread Cody Koeninger
gt; callback (e.g. commit offset), you would need to hold on the consumer > reference; admittedly it sounds a bit awkward, but by design we choose to > not enforce it in the interface itself. > > Guozhang > > On Wed, Mar 9, 2016 at 3:39 PM, Cody Koeninger <c...@koeninger.org> w

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-14 Thread Cody Koeninger
out some specific change on the consumer API, please > feel free to create a new KIP with the detailed motivation and proposed > modifications. > > Guozhang > > On Fri, Mar 11, 2016 at 12:28 PM, Cody Koeninger <c...@koeninger.org> wrote: > >> Is there a KIP or Jira relat

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-15 Thread Cody Koeninger
en the seek since position() will block to get > the new offset. > > -Jason > > On Mon, Mar 14, 2016 at 2:37 PM, Cody Koeninger <c...@koeninger.org> wrote: > >> Sorry, by metadata I also meant the equivalent of the old >> OffsetRequest api, which partitionsFor

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-09 Thread Cody Koeninger
s going to get all the partitions assigned > to itself (i.e. you are only running a single instance). > > Guozhang > > > On Wed, Mar 9, 2016 at 6:22 AM, Cody Koeninger <c...@koeninger.org> wrote: > >> Another unfortunate thing about ConsumerRebalanceListener is t

seekToBeginning doesn't work without auto.offset.reset

2016-03-08 Thread Cody Koeninger
Using the 0.9 consumer, I would like to start consuming at the beginning or end, without specifying auto.offset.reset. This does not seem to be possible: val kafkaParams = Map[String, Object]( "bootstrap.servers" -> conf.getString("kafka.brokers"), "key.deserializer" ->

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-09 Thread Cody Koeninger
> me know if you have any other ideas. > > Guozhang > > On Wed, Mar 9, 2016 at 12:25 PM, Cody Koeninger <c...@koeninger.org> wrote: > >> Yeah, I think I understood what you were saying. What I'm saying is >> that if there were a way to just fetch metadata without

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-09 Thread Cody Koeninger
@Override > public void onPartitionsAssigned(Collection > partitions) { > consumer.seekToBeginning(partitions.toArray(new > TopicPartition[0])); > } > }; > > consumer.subscribe(topics, listener); > > On

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-11 Thread Cody Koeninger
ee >> that new partitions should be consumed automatically). I guess we can >> continue this discussion on the spark list then :-) >> >> Thanks >> Mansi. >> >> On Thu, Mar 10, 2016 at 7:43 AM, Cody Koeninger <c...@koeninger.org> >> wrote:

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-09 Thread Cody Koeninger
to the consumer. Seems like this makes it unnecessarily awkward to serialize or provide a 0 arg constructor for the listener. On Wed, Mar 9, 2016 at 7:28 AM, Cody Koeninger <c...@koeninger.org> wrote: > I thought about ConsumerRebalanceListener, but seeking to the > beginning any time there's

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-09 Thread Cody Koeninger
gt; On Wed, Mar 9, 2016 at 2:11 PM, Guozhang Wang <wangg...@gmail.com> wrote: > > > Filed https://issues.apache.org/jira/browse/KAFKA-3370. > > > > On Wed, Mar 9, 2016 at 1:11 PM, Cody Koeninger <c...@koeninger.org> > wrote: > > > >> That sounds like

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-14 Thread Cody Koeninger
nsuming > messages. We'd probably want to understand why those are insufficient > before considering new APIs. > > -Jason > > On Mon, Mar 14, 2016 at 12:17 PM, Cody Koeninger <c...@koeninger.org> wrote: > >> Regarding the rebalance listener, in the case of the spark >

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
a should not be lost. The system should be as fault tolerant as >> possible. >> >> What's the advantage of using Spark for reading Kafka instead of direct >> Kafka consumers? >> >> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger <c...@koeninger.org> >> wrote

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
Spark streaming helps with aggregation because A. raw kafka consumers have no built in framework for shuffling amongst nodes, short of writing into an intermediate topic (I'm not touching Kafka Streams here, I don't have experience), and B. it deals with batches, so you can transactionally

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
t, query the >> respective table in Cassandra / Postgres. (select .. from data where user = >> ? and date between and and some_field = ?) >> >> How will Spark Streaming help w/ aggregation? Couldn't the data be queried >> from Cassandra / Postgres via the Kafka cons

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
How are you going to handle etl failures? Do you care about lost / duplicated data? Are your writes idempotent? Absent any other information about the problem, I'd stay away from cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream feeding postgres. On Thu, Sep 29, 2016 at 10:04

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
Spark direct stream is just fine for this use case. > But why postgres and not cassandra? > Is there anything specific here that i may not be aware? > > Thanks > Deepak > > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> How are y

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
otherwise updates will be > idempotent but not inserts. > > Data should not be lost. The system should be as fault tolerant as possible. > > What's the advantage of using Spark for reading Kafka instead of direct > Kafka consumers? > > On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger &l

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
gt; Spark standalone is not Yarn… or secure for that matter… ;-) > >> On Sep 29, 2016, at 11:18 AM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Spark streaming helps with aggregation because >> >> A. raw kafka consumers have no built in framework

Re: Contiguous Offsets on non-compacted topics

2018-01-24 Thread Cody Koeninger
Can anyone clarify what (other than the known cases of compaction or transactions) could be causing non-contiguous offsets? That sounds like a potential defect, given that I ran billions of messages a day through kafka 0.8.x series for years without seeing that. On Tue, Jan 23, 2018 at 3:35 PM,

Re: org.apache.kafka.clients.consumer.OffsetOutOfRangeException

2018-02-12 Thread Cody Koeninger
https://issues.apache.org/jira/browse/SPARK-19680 and https://issues.apache.org/jira/browse/KAFKA-3370 has a good explanation. Verify that it works correctly with auto offset set to latest, to rule out other issues. Then try providing explicit starting offsets reasonably near the beginning of

Re: KafkaUtils.createStream(..) is removed for API

2018-02-19 Thread Cody Koeninger
I can't speak for committers, but my guess is it's more likely for DStreams in general to stop being supported before that particular integration is removed. On Sun, Feb 18, 2018 at 9:34 PM, naresh Goud wrote: > Thanks Ted. > > I see createDirectStream is

Re: Viewing transactional markers in client

2018-08-04 Thread Cody Koeninger
Here's one reason I might want to be able to tell whether a given offset is a transactional marker: https://issues.apache.org/jira/browse/SPARK-24720 Alternatively, is there any efficient way to tell what the offset of the last actual record in a topicpartition is (i.e. like endOffsets) On Thu,