There isn't an easy way of ensuring delivery semantics for producing to kafka (see https://cwiki.apache.org/confluence/display/KAFKA/KIP-27+-+Conditional+Publish). If there's only one logical consumer of the intermediate state, I wouldn't write it back to kafka, i'd just keep it in a single spark job.
If the intermediate state is useful in its own right (multiple consumers), then sure write it to kafka, just be aware of the possibility of duplicate messages. On Tue, Sep 29, 2015 at 6:42 AM, Arttii <a.topch...@reply.de> wrote: > Hi, > > So I am working on a project where we might end up having a bunch of > decoupled logic components that have to run inside spark streaming. We are > using KAFKA as the source of streaming data. > My first question is; is it better to chain these logics together by > applying transforms to a single rdd or say transforming and writing back to > KAFKA and consuming this in another stream and applying more logic. The > benfit of the second approach is that it is more decoupled. > > Another question would be is what the best practice to have one huge spark > streaming job with a bunch of subscriptions and transform chains? Or should > I group this into a bunch of jobs with some logical paritioning? > > Any idea what the performance drawbacks would be in any case? I know this > is > a broadish question, but help would be greatly appreciated. > > Arti > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-many-subscriptions-vs-many-jobs-tp24862.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >