Glad you got it worked out. That's cool as long as your use case doesn't actually require e.g. partition 0 to always be scheduled to the same executor across different batches.
On Tue, Mar 21, 2017 at 7:35 PM, OUASSAIDI, Sami <sami.ouassa...@mind7.fr> wrote: > So it worked quite well with a coalesce, I was able to find an solution to > my problem : Altough not directly handling the executor a good roundaway > was to assign the desired partition to a specific stream through assign > strategy and coalesce to a single partition then repeat the same process > for the remaining topics on different streams and at the end do a an union > of these streams. > > PS : No shuffle was made during the whole thing since the rdd partitions > were collapsed to a single one > > Le 17 mars 2017 8:04 PM, "Michael Armbrust" <mich...@databricks.com> a > écrit : > >> Another option that would avoid a shuffle would be to use assign and >> coalesce, running two separate streams. >> >> spark.readStream >> .format("kafka") >> .option("kafka.bootstrap.servers", "...") >> .option("assign", """{t0: {"0": xxxx}, t1:{"0": xxxxx}}""") >> .load() >> .coalesce(1) >> .writeStream >> .foreach(... code to write to cassandra ...) >> >> spark.readStream >> .format("kafka") >> .option("kafka.bootstrap.servers", "...") >> .option("assign", """{t0: {"1": xxxx}, t1:{"1": xxxxx}}""") >> .load() >> .coalesce(1) >> .writeStream >> .foreach(... code to write to cassandra ...) >> >> On Fri, Mar 17, 2017 at 7:35 AM, OUASSAIDI, Sami <sami.ouassa...@mind7.fr >> > wrote: >> >>> @Cody : Duly noted. >>> @Michael Ambrust : A repartition is out of the question for our project >>> as it would be a fairly expensive operation. We tried looking into >>> targeting a specific executor so as to avoid this extra cost and directly >>> have well partitioned data after consuming the kafka topics. Also we are >>> using Spark streaming to save to the cassandra DB and try to keep shuffle >>> operations to a strict minimum (at best none). As of now we are not >>> entirely pleased with our current performances, that's why I'm doing a >>> kafka topic sharding POC and getting the executor to handle the specificied >>> partitions is central. >>> ᐧ >>> >>> 2017-03-17 9:14 GMT+01:00 Michael Armbrust <mich...@databricks.com>: >>> >>>> Sorry, typo. Should be a repartition not a groupBy. >>>> >>>> >>>>> spark.readStream >>>>> .format("kafka") >>>>> .option("kafka.bootstrap.servers", "...") >>>>> .option("subscribe", "t0,t1") >>>>> .load() >>>>> .repartition($"partition") >>>>> .writeStream >>>>> .foreach(... code to write to cassandra ...) >>>>> >>>> >>> >>> >>> -- >>> *Mind7 Consulting* >>> >>> Sami Ouassaid | Consultant Big Data | sami.ouassa...@mind7.com >>> __ >>> >>> 64 Rue Taitbout, 75009 Paris >>> >> >>