Re: Different Kafka createDirectStream implementations

2015-09-08 Thread Cody Koeninger
to start the stream. On Tue, Sep 8, 2015 at 1:19 PM, Dan Dutrow <dan.dut...@gmail.com> wrote: > Yes, but one implementation checks those flags and the other one doesn't. > I would think they should be consistent. > > On Tue, Sep 8, 2015 at 1:32 PM Cody Koeninger <c...

Re: [spark-streaming] New directStream API reads topic's partitions sequentially. Why?

2015-09-04 Thread Cody Koeninger
The direct stream just makes a spark partition per kafka partition, so if those partitions are not getting evenly distributed among executors, something else is probably wrong with your configuration. If you replace the kafka stream with a dummy rdd created with e.g. sc.parallelize, what happens?

Re: repartition on direct kafka stream

2015-09-04 Thread Cody Koeninger
The answer already given is correct. You shouldn't doubt this, because you've already seen the shuffle data change accordingly. On Fri, Sep 4, 2015 at 11:25 AM, Shushant Arora wrote: > But Kafka stream has underlyng RDD which consists of offsets reanges only- > so

Re: Kafka Direct Stream join without data shuffle

2015-09-02 Thread Cody Koeninger
No, there isn't a partitioner for KafkaRDD (KafkaRDD may not even be a pair rdd, for instance). It sounds to me like if it's a self-join, you should be able to do it in a single mapPartition operation. On Wed, Sep 2, 2015 at 3:06 PM, Chen Song wrote: > I have a stream

Re: Is it required to remove checkpoint when submitting a code change?

2015-09-02 Thread Cody Koeninger
Yeah, in general if you're changing the jar you can't recover the checkpoint. If you're just changing parameters, why not externalize those in a configuration file so your jar doesn't change? I tend to stick even my app-specific parameters in an external spark config so everything is in one

[jira] [Commented] (SPARK-10320) Kafka Support new topic subscriptions without requiring restart of the streaming context

2015-09-01 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14726076#comment-14726076 ] Cody Koeninger commented on SPARK-10320: You would supply a function, similar to the way

[jira] [Commented] (SPARK-10320) Kafka Support new topic subscriptions without requiring restart of the streaming context

2015-09-01 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14725719#comment-14725719 ] Cody Koeninger commented on SPARK-10320: So if you're changing topics in an event handler, it's

Re: spark streaming 1.3 with kafka

2015-09-01 Thread Cody Koeninger
20 batches . > Will that be a issue ? > And is there any setting in directkafkastream to enforce not to call > further computes() method after a threshold of scheduling queue size say > (50 batches).Once queue size comes back to less than threshold call compute > and enqueue the

Re: spark streaming 1.3 with kafka

2015-09-01 Thread Cody Koeninger
do this in compute method since compute would have been called at > current batch queue time - but condition is set at previous batch run time. > > > On Tue, Sep 1, 2015 at 7:09 PM, Cody Koeninger <c...@koeninger.org> wrote: > >> It's at the time compute() gets called, which should b

Re: spark streaming 1.3 with kafka

2015-09-01 Thread Cody Koeninger
It's at the time compute() gets called, which should be near the time the batch should have been queued. On Tue, Sep 1, 2015 at 8:02 AM, Shushant Arora wrote: > Hi > > In spark streaming 1.3 with kafka- when does driver bring latest offsets > of this run - at start of

Re: spark streaming 1.3 with kafka

2015-09-01 Thread Cody Koeninger
f offsets > being checkpointed at end of each batch. > > Will it be possible then to reset the offset. > > On Tue, Sep 1, 2015 at 8:42 PM, Cody Koeninger <c...@koeninger.org> wrote: > >> No, if you start arbitrarily messing around with offset ranges after >> compute is

Re: spark streaming 1.3 kafka topic error

2015-08-31 Thread Cody Koeninger
t Can I have infinite retries instead of > killing the app? What should be the value of retries (-1) will work or > something else ? > > On Thu, Aug 27, 2015 at 6:46 PM, Cody Koeninger <c...@koeninger.org> > wrote: > >> Your kafka broker died or you otherwise had a rebalan

[jira] [Commented] (SPARK-10320) Support new topic subscriptions without requiring restart of the streaming context

2015-08-31 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14724074#comment-14724074 ] Cody Koeninger commented on SPARK-10320: " If you restart the job and specify a new o

[jira] [Updated] (SPARK-10320) Kafka Support new topic subscriptions without requiring restart of the streaming context

2015-08-31 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Koeninger updated SPARK-10320: --- Summary: Kafka Support new topic subscriptions without requiring restart of the streaming

[jira] [Commented] (SPARK-10320) Kafka Support new topic subscriptions without requiring restart of the streaming context

2015-08-31 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14724113#comment-14724113 ] Cody Koeninger commented on SPARK-10320: It's possible this might be solvable with a user

Re: Feedback: Feature request

2015-08-28 Thread Cody Koeninger
I wrote some code for this a while back, pretty sure it didn't need access to anything private in the decision tree / random forest model. If people want it added to the api I can put together a PR. I think it's important to have separately parseable operators / operands though. E.g

Re: Feedback: Feature request

2015-08-28 Thread Cody Koeninger
I wrote some code for this a while back, pretty sure it didn't need access to anything private in the decision tree / random forest model. If people want it added to the api I can put together a PR. I think it's important to have separately parseable operators / operands though. E.g

Re: SSL between Kafka and Spark Streaming API

2015-08-28 Thread Cody Koeninger
Yeah, the direct api uses the simple consumer On Fri, Aug 28, 2015 at 1:32 PM, Cassa L lcas...@gmail.com wrote: Hi I am using below Spark jars with Direct Stream API. spark-streaming-kafka_2.10 When I look at its pom.xml, Kafka libraries that its pulling in is

Re: SSL between Kafka and Spark Streaming API

2015-08-28 Thread Cody Koeninger
Yeah, the direct api uses the simple consumer On Fri, Aug 28, 2015 at 1:32 PM, Cassa L lcas...@gmail.com wrote: Hi I am using below Spark jars with Direct Stream API. spark-streaming-kafka_2.10 When I look at its pom.xml, Kafka libraries that its pulling in is

Re: Adding Kafka topics to a running streaming context

2015-08-27 Thread Cody Koeninger
If you all can say a little more about what your requirements are, maybe we can get a jira together. I think the easiest way to deal with this currently is to start a new job before stopping the old one, which should prevent latency problems. On Thu, Aug 27, 2015 at 9:24 AM, Sudarshan Kadambi

Re: spark streaming 1.3 kafka topic error

2015-08-27 Thread Cody Koeninger
didn't work for me. rdd.mapPartitions(partitionOfRecords = { DBConnectionInit() val results = partitionOfRecords.map(..) DBConnection.commit() }) Best regards, Ahmed Atef Nawwar Data Management Big Data Consultant On Thu, Aug 27, 2015 at 4:16 PM, Cody Koeninger c

Re: spark streaming 1.3 kafka buffer size

2015-08-27 Thread Cody Koeninger
. On Wed, Aug 26, 2015 at 9:32 PM, Cody Koeninger c...@koeninger.org wrote: see http://kafka.apache.org/documentation.html#consumerconfigs fetch.message.max.bytes in the kafka params passed to the constructor On Wed, Aug 26, 2015 at 10:39 AM, Shushant Arora shushantaror...@gmail.com wrote

Re: Writing test case for spark streaming checkpointing

2015-08-27 Thread Cody Koeninger
Kill the job in the middle of a batch, look at the worker logs to see which offsets were being processed, verify the messages for those offsets are read when you start the job back up On Thu, Aug 27, 2015 at 10:14 AM, Hafiz Mujadid hafizmujadi...@gmail.com wrote: Hi! I have enables check

[jira] [Commented] (SPARK-10320) Support new topic subscriptions without requiring restart of the streaming context

2015-08-27 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717342#comment-14717342 ] Cody Koeninger commented on SPARK-10320: As I said on the list, the best way

Re: Commit DB Transaction for each partition

2015-08-27 Thread Cody Koeninger
there is solution to force action. I can not use foreachPartition because i need reuse the new RDD after some maps. On Thu, Aug 27, 2015 at 5:11 PM, Cody Koeninger c...@koeninger.org wrote: Map is lazy. You need an actual action, or nothing will happen. Use foreachPartition, or do

Re: Commit DB Transaction for each partition

2015-08-27 Thread Cody Koeninger
= partitionOfRecords.map(..) DBConnection.commit() results.foreach(row= {}) results }) On Thu, Aug 27, 2015 at 10:18 PM, Cody Koeninger c...@koeninger.org wrote: You need to return an iterator from the closure you provide to mapPartitions On Thu, Aug 27, 2015 at 1:42 PM, Ahmed Nawar ahmed.na

Re: Spark Streaming Checkpointing Restarts with 0 Event Batches

2015-08-26 Thread Cody Koeninger
On Wed, Aug 26, 2015 at 7:12 AM, Cody Koeninger c...@koeninger.org wrote: The first thing that stands out to me is createOnError = true Are you sure the checkpoint is actually loading, as opposed to failing and starting the job anyway? There should be info lines that look like INFO

Re: JDBC Streams

2015-08-26 Thread Cody Koeninger
On Wed, Aug 26, 2015 at 10:51 AM, Cody Koeninger c...@koeninger.org wrote: If your data only changes every few days, why not restart the job every few days, and just broadcast the data? Or you can keep a local per-jvm cache with an expiry (e.g. guava cache) to avoid many mysql reads

Re: spark streaming 1.3 kafka buffer size

2015-08-26 Thread Cody Koeninger
see http://kafka.apache.org/documentation.html#consumerconfigs fetch.message.max.bytes in the kafka params passed to the constructor On Wed, Aug 26, 2015 at 10:39 AM, Shushant Arora shushantaror...@gmail.com wrote: whats the default buffer in spark streaming 1.3 for kafka messages. Say In

Re: Spark Streaming Checkpointing Restarts with 0 Event Batches

2015-08-26 Thread Cody Koeninger
) } } ssc } On Tue, Aug 25, 2015 at 12:07 PM, Cody Koeninger c...@koeninger.org wrote: Sounds like something's not set up right... can you post a minimal code example that reproduces the issue? On Tue, Aug 25, 2015 at 1:40 PM, Susan Zhang suchenz...@gmail.com wrote: Yeah. All

Re: Custom Offset Management

2015-08-26 Thread Cody Koeninger
That argument takes a function from MessageAndMetadata to whatever you want your stream to contain. See https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/TransactionalPerBatch.scala#L57 On Wed, Aug 26, 2015 at 7:55 AM, Deepesh Maheshwari

Re: Spark Streaming Checkpointing Restarts with 0 Event Batches

2015-08-25 Thread Cody Koeninger
Does the first batch after restart contain all the messages received while the job was down? On Tue, Aug 25, 2015 at 12:53 PM, suchenzang suchenz...@gmail.com wrote: Hello, I'm using direct spark streaming (from kafka) with checkpointing, and everything works well until a restart. When I

Re: Spark Streaming Checkpointing Restarts with 0 Event Batches

2015-08-25 Thread Cody Koeninger
Are you actually losing messages then? On Tue, Aug 25, 2015 at 1:15 PM, Susan Zhang suchenz...@gmail.com wrote: No; first batch only contains messages received after the second job starts (messages come in at a steady rate of about 400/second). On Tue, Aug 25, 2015 at 11:07 AM, Cody

Re: Spark Streaming Checkpointing Restarts with 0 Event Batches

2015-08-25 Thread Cody Koeninger
Sounds like something's not set up right... can you post a minimal code example that reproduces the issue? On Tue, Aug 25, 2015 at 1:40 PM, Susan Zhang suchenz...@gmail.com wrote: Yeah. All messages are lost while the streaming job was down. On Tue, Aug 25, 2015 at 11:37 AM, Cody Koeninger c

Re: Spark Direct Streaming With ZK Updates

2015-08-24 Thread Cody Koeninger
It doesn't matter if shuffling occurs. Just update ZK from the driver, inside the foreachRDD, after all your dynamodb updates are done. Since you're just doing it for monitoring purposes, that should be fine. On Mon, Aug 24, 2015 at 12:11 PM, suchenzang suchenz...@gmail.com wrote: Forgot to

Re: Spark Direct Streaming With ZK Updates

2015-08-24 Thread Cody Koeninger
I'd start off by trying to simplify that closure - you don't need the transform step, or currOffsetRanges to be scoped outside of it. Just do everything in foreachRDD. LIkewise, it looks like zkClient is also scoped outside of the closure passed to foreachRDD i.e. you have zkClient = new

Re: How to parse multiple event types using Kafka

2015-08-23 Thread Cody Koeninger
Each spark partition will contain messages only from a single kafka topcipartition. Use hasOffsetRanges to tell which kafka topicpartition it's from. See the docs http://spark.apache.org/docs/latest/streaming-kafka-integration.html On Sun, Aug 23, 2015 at 10:56 AM, Spark Enthusiast

Re: spark streaming 1.3 kafka error

2015-08-22 Thread Cody Koeninger
happen that you are spending too much time on disk io etc. On Aug 21, 2015 7:32 AM, Cody Koeninger c...@koeninger.org wrote: Sounds like that's happening consistently, not an occasional network problem? Look at the Kafka broker logs Make sure you've configured the correct kafka broker

Re: spark streaming 1.3 kafka error

2015-08-21 Thread Cody Koeninger
Sounds like that's happening consistently, not an occasional network problem? Look at the Kafka broker logs Make sure you've configured the correct kafka broker hosts / ports (note that direct stream does not use zookeeper host / port). Make sure that host / port is reachable from your driver

Re: Kafka Spark Partition Mapping

2015-08-20 Thread Cody Koeninger
In general you cannot guarantee which node an RDD will be processed on. The preferred location for a kafkardd is the kafka leader for that partition, if they're deployed on the same machine. If you want to try to override that behavior, the method is getPreferredLocations But even in that case,

Re: spark kafka partitioning

2015-08-20 Thread Cody Koeninger
I'm not clear on your question, can you rephrase it? Also, are you talking about createStream or createDirectStream? On Thu, Aug 20, 2015 at 9:48 PM, Gaurav Agarwal gaurav130...@gmail.com wrote: Hello Regarding Spark Streaming and Kafka Partitioning When i send message on kafka topic with

Re: spark streaming 1.3 doubts(force it to not consume anything)

2015-08-19 Thread Cody Koeninger
, Shushant Arora shushantaror...@gmail.com wrote: But KafkaRDD[K, V, U, T, R] is not subclass of RDD[R] as per java generic inheritance is not supported so derived class cannot return different genric typed subclass from overriden method. On Wed, Aug 19, 2015 at 12:02 AM, Cody Koeninger c

Re: spark streaming 1.3 doubts(force it to not consume anything)

2015-08-18 Thread Cody Koeninger
the same error did not come while extending DirectKafkaInputDStream from InputDStream ? Since new return type Option[KafkaRDD[K, V, U, T, R]] is not subclass of Option[RDD[T] so it should have been failed? On Tue, Aug 18, 2015 at 7:28 PM, Cody Koeninger c...@koeninger.org wrote

Re: spark streaming 1.3 doubts(force it to not consume anything)

2015-08-18 Thread Cody Koeninger
[][] but its expecting scala.Optionorg.apache.spark.rdd.RDDbyte[][] from derived class . Is there something wring with code? On Mon, Aug 17, 2015 at 7:08 PM, Cody Koeninger c...@koeninger.org wrote: Look at the definitions of the java-specific KafkaUtils.createDirectStream methods (the ones

Re: Transform KafkaRDD to KafkaRDD, not plain RDD, or how to keep OffsetRanges after transformation

2015-08-18 Thread Cody Koeninger
looking forward to it :) On Wed, Aug 19, 2015 at 12:02 AM, Cody Koeninger c...@koeninger.org wrote: The solution you found is also in the docs: http://spark.apache.org/docs/latest/streaming-kafka-integration.html Java uses an atomic reference because Java doesn't allow you to close over non

Re: Transform KafkaRDD to KafkaRDD, not plain RDD, or how to keep OffsetRanges after transformation

2015-08-18 Thread Cody Koeninger
The solution you found is also in the docs: http://spark.apache.org/docs/latest/streaming-kafka-integration.html Java uses an atomic reference because Java doesn't allow you to close over non-final references. I'm not clear on your other question. On Tue, Aug 18, 2015 at 3:43 AM, Petr Novak

Re: broadcast variable of Kafka producer throws ConcurrentModificationException

2015-08-18 Thread Cody Koeninger
I wouldn't expect a kafka producer to be serializable at all... among other things, it has a background thread On Tue, Aug 18, 2015 at 4:55 PM, Shenghua(Daniel) Wan wansheng...@gmail.com wrote: Hi, Did anyone see java.util.ConcurrentModificationException when using broadcast variables? I

Re: spark streaming 1.3 doubts(force it to not consume anything)

2015-08-17 Thread Cody Koeninger
instead of Function1 ? On Thu, Aug 13, 2015 at 12:16 AM, Cody Koeninger c...@koeninger.org wrote: I'm not aware of an existing api per se, but you could create your own subclass of the DStream that returns None for compute() under certain conditions. On Wed, Aug 12, 2015 at 1:03 PM

[jira] [Commented] (SPARK-9947) Separate Metadata and State Checkpoint Data

2015-08-14 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697098#comment-14697098 ] Cody Koeninger commented on SPARK-9947: --- You already have access to offsets and can

[jira] [Commented] (SPARK-9947) Separate Metadata and State Checkpoint Data

2015-08-14 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697166#comment-14697166 ] Cody Koeninger commented on SPARK-9947: --- You can't re-use checkpoint data across

[jira] [Commented] (SPARK-9947) Separate Metadata and State Checkpoint Data

2015-08-14 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697193#comment-14697193 ] Cody Koeninger commented on SPARK-9947: --- Didn't you already say that you were saving

[jira] [Commented] (SPARK-6249) Get Kafka offsets from consumer group in ZK when using direct stream

2015-08-14 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697188#comment-14697188 ] Cody Koeninger commented on SPARK-6249: --- If you want an api that has imprecise

Re: Spark Streaming: Change Kafka topics on runtime

2015-08-14 Thread Cody Koeninger
stopped it cannot be restarted, but I also haven't found a way to stop the app programmatically. The batch duration will probably be around 1-10 seconds. I think this is small enough to not make it a batch job? Thanks again On Thu, Aug 13, 2015 at 10:15 PM, Cody Koeninger c...@koeninger.org

Re: Maintaining Kafka Direct API Offsets

2015-08-14 Thread Cody Koeninger
Use your email client to send a message to the mailing list from the email address you used to subscribe? The message you just sent reached the list On Fri, Aug 14, 2015 at 9:36 AM, dutrow dan.dut...@gmail.com wrote: How do I get beyond the This post has NOT been accepted by the mailing list

Re: Maintaining Kafka Direct API Offsets

2015-08-14 Thread Cody Koeninger
I don't entirely agree with that assessment. Not paying for extra cores to run receivers was about as important as delivery semantics, as far as motivations for the api. As I said in the jira tickets on the topic, if you want to use the direct api and save offsets to ZK, you can. The right way

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-08-14 Thread Cody Koeninger
You'll resume and re-process the rdd that didnt finish On Fri, Aug 14, 2015 at 1:31 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Our additional question on checkpointing is basically the logistics of it -- At which point does the data get written into checkpointing? Is it written

Re: Spark Streaming: Change Kafka topics on runtime

2015-08-13 Thread Cody Koeninger
The current kafka stream implementation assumes the set of topics doesn't change during operation. You could either take a crack at writing a subclass that does what you need; stop/start; or if your batch duration isn't too small, you could run it as a series of RDDs (using the existing

Re: spark.streaming.maxRatePerPartition parameter: what are the benefits?

2015-08-13 Thread Cody Koeninger
It's just to limit the maximum number of records a given executor needs to deal with in a given batch. Typical usage would be if you're starting a stream from the beginning of a kafka log, or after a long downtime, and don't want ALL of the messages in the first batch. On Thu, Aug 13, 2015 at

Re: Retrieving offsets from previous spark streaming checkpoint

2015-08-13 Thread Cody Koeninger
Access the offsets using HasOffsetRanges, save them in your datastore, provide them as the fromOffsets argument when starting the stream. See https://github.com/koeninger/kafka-exactly-once On Thu, Aug 13, 2015 at 3:53 PM, Stephen Durfey sjdur...@gmail.com wrote: When deploying a spark

[jira] [Commented] (SPARK-6249) Get Kafka offsets from consumer group in ZK when using direct stream

2015-08-13 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695939#comment-14695939 ] Cody Koeninger commented on SPARK-6249: --- Regarding streaming stats, those should

[jira] [Commented] (SPARK-6249) Get Kafka offsets from consumer group in ZK when using direct stream

2015-08-13 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695929#comment-14695929 ] Cody Koeninger commented on SPARK-6249: --- https://github.com/koeninger/kafka-exactly

Re: avoid duplicate due to executor failure in spark stream

2015-08-12 Thread Cody Koeninger
will just ignore already processed events by accessing counter of failed task. Is it directly possible to access accumulator per task basis without writing to hdfs or hbase. On Tue, Aug 11, 2015 at 3:15 AM, Cody Koeninger c...@koeninger.org wrote: http://spark.apache.org/docs/latest

[jira] [Commented] (SPARK-9780) In case of invalid initialization of KafkaDirectStream, NPE is thrown

2015-08-11 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681886#comment-14681886 ] Cody Koeninger commented on SPARK-9780: --- Makes sense, traveling currently but I'll

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Cody Koeninger
That looks like it's during recovery from a checkpoint, so it'd be driver memory not executor memory. How big is the checkpoint directory that you're trying to restore from? On Mon, Aug 10, 2015 at 10:57 AM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: We're getting the below error.

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Cody Koeninger
the original checkpointing directory :( Thanks for the clarification on spark.driver.memory, I'll keep testing (at 2g things seem OK for now). On Mon, Aug 10, 2015 at 12:10 PM, Cody Koeninger c...@koeninger.org wrote: That looks like it's during recovery from a checkpoint, so it'd be driver

Re: Kafka direct approach: blockInterval and topic partitions

2015-08-10 Thread Cody Koeninger
There's no long-running receiver pushing blocks of messages, so blockInterval isn't relevant. Batch interval is what matters. On Mon, Aug 10, 2015 at 12:52 PM, allonsy luke1...@gmail.com wrote: Hi everyone, I recently started using the new Kafka direct approach. Now, as far as I

Re: avoid duplicate due to executor failure in spark stream

2015-08-10 Thread Cody Koeninger
http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations https://www.youtube.com/watch?v=fXnNEq1v3VA On Mon, Aug 10, 2015 at 4:32 PM, Shushant

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Cody Koeninger
tolerance. Why does Spark persist whole RDD's of data? Shouldn't it be sufficient to just persist the offsets, to know where to resume from? Thanks. On Mon, Aug 10, 2015 at 1:07 PM, Cody Koeninger c...@koeninger.org wrote: You need to keep a certain number of rdds around for checkpointing

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Cody Koeninger
of these costs? On Mon, Aug 10, 2015 at 4:27 PM, Cody Koeninger c...@koeninger.org wrote: The rdd is indeed defined by mostly just the offsets / topic partitions. On Mon, Aug 10, 2015 at 3:24 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: You need to keep a certain number of rdds

Re: spark-kafka directAPI vs receivers based API

2015-08-10 Thread Cody Koeninger
For direct stream questions: https://github.com/koeninger/kafka-exactly-once Yes, it is used in production. For general spark streaming question: http://spark.apache.org/docs/latest/streaming-programming-guide.html On Mon, Aug 10, 2015 at 7:51 AM, Mohit Durgapal durgapalmo...@gmail.com

[jira] [Commented] (SPARK-9476) Kafka stream loses leader after 2h of operation

2015-08-09 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14679380#comment-14679380 ] Cody Koeninger commented on SPARK-9476: --- The stacktraces you posted don't contain

Re: spark streaming max receiver rate doubts

2015-08-04 Thread Cody Koeninger
Those jobs will still be created for each valid time, they just may not have many messages in them On Mon, Aug 3, 2015 at 11:11 PM, Shushant Arora shushantaror...@gmail.com wrote: 1.In spark 1.3(Non receiver) - If my batch interval is 1 sec and I don't set

Re: No Twitter Input from Kafka to Spark Streaming

2015-08-04 Thread Cody Koeninger
Have you tried using the console consumer to see if anything is actually getting published to that topic? On Tue, Aug 4, 2015 at 11:45 AM, narendra narencs...@gmail.com wrote: My application takes Twitter4j tweets and publishes those to a topic in Kafka. Spark Streaming subscribes to that

Re: spark streaming program failed on Spark 1.4.1

2015-08-03 Thread Cody Koeninger
Just to be clear, did you rebuild your job against spark 1.4.1 as well as upgrading the cluster? On Mon, Aug 3, 2015 at 8:36 AM, Netwaver wanglong_...@163.com wrote: Hi All, I have a spark streaming + kafka program written by Scala, it works well on Spark 1.3.1, but after I migrate

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-07-31 Thread Cody Koeninger
Show us the relevant code On Fri, Jul 31, 2015 at 12:16 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: I've instrumented checkpointing per the programming guide and I can tell that Spark Streaming is creating the checkpoint directories but I'm not seeing any content being created in

[jira] [Commented] (SPARK-9476) Kafka stream loses leader after 2h of operation

2015-07-31 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14649123#comment-14649123 ] Cody Koeninger commented on SPARK-9476: --- Those all look like info or warn level log

[jira] [Created] (SPARK-9475) Consistent hadoop config for external/*

2015-07-30 Thread Cody Koeninger (JIRA)
Cody Koeninger created SPARK-9475: - Summary: Consistent hadoop config for external/* Key: SPARK-9475 URL: https://issues.apache.org/jira/browse/SPARK-9475 Project: Spark Issue Type: Sub-task

[jira] [Created] (SPARK-9473) Consistent hadoop config for SQL

2015-07-30 Thread Cody Koeninger (JIRA)
Cody Koeninger created SPARK-9473: - Summary: Consistent hadoop config for SQL Key: SPARK-9473 URL: https://issues.apache.org/jira/browse/SPARK-9473 Project: Spark Issue Type: Sub-task

[jira] [Created] (SPARK-9472) Consistent hadoop config for streaming

2015-07-30 Thread Cody Koeninger (JIRA)
Cody Koeninger created SPARK-9472: - Summary: Consistent hadoop config for streaming Key: SPARK-9472 URL: https://issues.apache.org/jira/browse/SPARK-9472 Project: Spark Issue Type: Sub-task

[jira] [Created] (SPARK-9474) Consistent hadoop config for core

2015-07-30 Thread Cody Koeninger (JIRA)
Cody Koeninger created SPARK-9474: - Summary: Consistent hadoop config for core Key: SPARK-9474 URL: https://issues.apache.org/jira/browse/SPARK-9474 Project: Spark Issue Type: Sub-task

Re: Problems with JobScheduler

2015-07-30 Thread Cody Koeninger
Just so I'm clear, the difference in timing you're talking about is this: 15/07/30 14:33:59 INFO DAGScheduler: Job 24 finished: foreachRDD at MetricsSpark.scala:67, took 60.391761 s 15/07/30 14:37:35 INFO DAGScheduler: Job 93 finished: foreachRDD at MetricsSpark.scala:67, took 0.531323 s Are

Re: Upgrade of Spark-Streaming application

2015-07-30 Thread Cody Koeninger
You can't use checkpoints across code upgrades. That may or may not change in the future, but for now that's a limitation of spark checkpoints (regardless of whether you're using Kafka). Some options: - Start up the new job on a different cluster, then kill the old job once it's caught up to

Re: Problems with JobScheduler

2015-07-30 Thread Cody Koeninger
. On Thu, Jul 30, 2015 at 9:29 AM, Guillermo Ortiz konstt2...@gmail.com wrote: I have three topics with one partition each topic. So each jobs run about one topics. 2015-07-30 16:20 GMT+02:00 Cody Koeninger c...@koeninger.org: Just so I'm clear, the difference in timing you're talking about

Re: Spark Streaming Kafka could not find leader offset for Set()

2015-07-30 Thread Cody Koeninger
is valid and it has data I can see data in it using another kafka consumer. On Jul 30, 2015 7:31 AM, Cody Koeninger c...@koeninger.org wrote: The last time someone brought this up on the mailing list, the issue actually was that the topic(s) didn't exist in Kafka at the time the spark job

Re: Spark Streaming Kafka could not find leader offset for Set()

2015-07-29 Thread Cody Koeninger
The last time someone brought this up on the mailing list, the issue actually was that the topic(s) didn't exist in Kafka at the time the spark job was running. On Wed, Jul 29, 2015 at 6:17 PM, Tathagata Das t...@databricks.com wrote: There is a known issue that Kafka cannot return leader

[jira] [Commented] (SPARK-9434) Need how-to for resuming direct Kafka streaming consumers where they had left off before getting terminated, OR actual support for that mode in the Streaming API

2015-07-29 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646043#comment-14646043 ] Cody Koeninger commented on SPARK-9434: --- Dmitry, the nabble link you posted

[jira] [Commented] (SPARK-9434) Need how-to for resuming direct Kafka streaming consumers where they had left off before getting terminated, OR actual support for that mode in the Streaming API

2015-07-29 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646113#comment-14646113 ] Cody Koeninger commented on SPARK-9434: --- If you want to submit a PR with doc changes

Re: Has anybody ever tried running Spark Streaming on 500 text streams?

2015-07-29 Thread Cody Koeninger
@Ashwin you don't need to append the topic to your data if you're using the direct stream. You can get the topic from the offset range, see http://spark.apache.org/docs/latest/streaming-kafka-integration.html (search for offsetRange) If you're using the receiver based stream, you'll need to

Re: spark streaming get kafka individual message's offset and partition no

2015-07-28 Thread Cody Koeninger
You don't have to use some other package in order to get access to the offsets. Shushant, have you read the available documentation at http://spark.apache.org/docs/latest/streaming-kafka-integration.html https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md or watched

Re: Checkpoints in SparkStreaming

2015-07-28 Thread Cody Koeninger
Yes, you need to follow the documentation. Configure your stream, including the transformations made to it, inside the getOrCreate function. On Tue, Jul 28, 2015 at 3:14 AM, Guillermo Ortiz konstt2...@gmail.com wrote: I'm using SparkStreaming and I want to configure checkpoint to manage

Re: Resume checkpoint failed with Spark Streaming Kafka via createDirectStream under heavy reprocessing

2015-07-28 Thread Cody Koeninger
24, 2015 at 3:32 PM, Cody Koeninger c...@koeninger.org wrote: It's really a question of whether you need access to the MessageAndMetadata, or just the key / value from the message. If you just need the key/value, dstream map is fine. In your case, since you need to be able to control

Re: Best practice for transforming and storing from Spark to Mongo/HDFS

2015-07-25 Thread Cody Koeninger
Use foreachPartition and batch the writes On Sat, Jul 25, 2015 at 9:14 AM, nib...@free.fr wrote: Hello, I am new user of Spark, and need to know what could be the best practice to do the following scenario : - Spark Streaming receives XML messages from Kafka - Spark transforms each message

Re: New consumer - offset one gets in poll is not offset one is supposed to commit

2015-07-24 Thread Cody Koeninger
Well... there are only 2 hard problems in computer science: naming things, cache invalidation, and off-by-one errors. The direct stream implementation isn't asking you to commit anything. It's asking you to provide a starting point for the stream on startup. Because offset ranges are inclusive

Re: New consumer - offset one gets in poll is not offset one is supposed to commit

2015-07-24 Thread Cody Koeninger
/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java#L805 After processing read ConsumerRecord, commit expects me to submit not offset 0 but offset 1... Kind regards, Stevo Slavic On Fri, Jul 24, 2015 at 6:31 PM, Cody Koeninger c...@koeninger.org wrote: Well

Re: Resume checkpoint failed with Spark Streaming Kafka via createDirectStream under heavy reprocessing

2015-07-24 Thread Cody Koeninger
method (convert each message within this handler) or dstream.foreachRDD.map (map rdd) or dstream.map.foreachRDD (map dstream) ? Thank you for your help Cody. Regards, Nicolas PHUNG On Tue, Jul 21, 2015 at 4:53 PM, Cody Koeninger c...@koeninger.org wrote: Yeah, I'm referring to that api

Re: user threads in executors

2015-07-22 Thread Cody Koeninger
Yes, look at KafkaUtils.createRDD On Wed, Jul 22, 2015 at 11:17 AM, Shushant Arora shushantaror...@gmail.com wrote: Thanks ! I am using spark streaming 1.3 , And if some post fails because of any reason, I will store the offset of that message in another kafka topic. I want to read these

Re: Resume checkpoint failed with Spark Streaming Kafka via createDirectStream under heavy reprocessing

2015-07-21 Thread Cody Koeninger
, Jul 20, 2015 at 7:08 PM, Cody Koeninger c...@koeninger.org wrote: Yeah, in the function you supply for the messageHandler parameter to createDirectStream, catch the exception and do whatever makes sense for your application. On Mon, Jul 20, 2015 at 11:58 AM, Nicolas Phung nicolas.ph

[jira] [Commented] (SPARK-9215) Implement WAL-free Kinesis receiver that give at-least once guarantee

2015-07-21 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635179#comment-14635179 ] Cody Koeninger commented on SPARK-9215: --- I left some comments in the doc. Is Chris

Re: Resume checkpoint failed with Spark Streaming Kafka via createDirectStream under heavy reprocessing

2015-07-20 Thread Cody Koeninger
, Nicolas PHUNG On Thu, Jul 16, 2015 at 7:08 PM, Cody Koeninger c...@koeninger.org wrote: Not exactly the same issue, but possibly related: https://issues.apache.org/jira/browse/KAFKA-1196 On Thu, Jul 16, 2015 at 12:03 PM, Cody Koeninger c...@koeninger.org wrote: Well, working backwards

Re: Resume checkpoint failed with Spark Streaming Kafka via createDirectStream under heavy reprocessing

2015-07-20 Thread Cody Koeninger
or KafkaUtils.createDirectStream ? Regards, Nicolas PHUNG On Mon, Jul 20, 2015 at 3:46 PM, Cody Koeninger c...@koeninger.org wrote: I'd try logging the offsets for each message, see where problems start, then try using the console consumer starting at those offsets and see if you can reproduce the problem

<    6   7   8   9   10   11   12   13   14   >