Re: [Spark Streaming+Kafka][How-to]

Cody Koeninger Wed, 22 Mar 2017 07:49:19 -0700

Glad you got it worked out.  That's cool as long as your use case doesn't
actually require e.g. partition 0 to always be scheduled to the same
executor across different batches.


On Tue, Mar 21, 2017 at 7:35 PM, OUASSAIDI, Sami <sami.ouassa...@mind7.fr>
wrote:

> So it worked quite well with a coalesce, I was able to find an solution to
> my problem : Altough not directly handling the executor a good roundaway
> was to assign the desired partition to a specific stream through assign
> strategy and coalesce to a single partition then repeat the same process
> for the remaining topics on different streams and at the end do a an union
> of these streams.
>
> PS : No shuffle was made during the whole thing since the rdd partitions
> were collapsed to a single one
>
> Le 17 mars 2017 8:04 PM, "Michael Armbrust" <mich...@databricks.com> a
> écrit :
>
>> Another option that would avoid a shuffle would be to use assign and
>> coalesce, running two separate streams.
>>
>> spark.readStream
>>   .format("kafka")
>>   .option("kafka.bootstrap.servers", "...")
>>   .option("assign", """{t0: {"0": xxxx}, t1:{"0": xxxxx}}""")
>>   .load()
>>   .coalesce(1)
>>   .writeStream
>>   .foreach(... code to write to cassandra ...)
>>
>> spark.readStream
>>   .format("kafka")
>>   .option("kafka.bootstrap.servers", "...")
>>   .option("assign", """{t0: {"1": xxxx}, t1:{"1": xxxxx}}""")
>>   .load()
>>   .coalesce(1)
>>   .writeStream
>>   .foreach(... code to write to cassandra ...)
>>
>> On Fri, Mar 17, 2017 at 7:35 AM, OUASSAIDI, Sami <sami.ouassa...@mind7.fr
>> > wrote:
>>
>>> @Cody : Duly noted.
>>> @Michael Ambrust : A repartition is out of the question for our project
>>> as it would be a fairly expensive operation. We tried looking into
>>> targeting a specific executor so as to avoid this extra cost and directly
>>> have well partitioned data after consuming the kafka topics. Also we are
>>> using Spark streaming to save to the cassandra DB and try to keep shuffle
>>> operations to a strict minimum (at best none). As of now we are not
>>> entirely pleased with our current performances, that's why I'm doing a
>>> kafka topic sharding POC and getting the executor to handle the specificied
>>> partitions is central.
>>> ᐧ
>>>
>>> 2017-03-17 9:14 GMT+01:00 Michael Armbrust <mich...@databricks.com>:
>>>
>>>> Sorry, typo.  Should be a repartition not a groupBy.
>>>>
>>>>
>>>>> spark.readStream
>>>>>   .format("kafka")
>>>>>   .option("kafka.bootstrap.servers", "...")
>>>>>   .option("subscribe", "t0,t1")
>>>>>   .load()
>>>>>   .repartition($"partition")
>>>>>   .writeStream
>>>>>   .foreach(... code to write to cassandra ...)
>>>>>
>>>>
>>>
>>>
>>> --
>>> *Mind7 Consulting*
>>>
>>> Sami Ouassaid | Consultant Big Data | sami.ouassa...@mind7.com
>>> __
>>>
>>> 64 Rue Taitbout, 75009 Paris
>>>
>>
>>

Re: [Spark Streaming+Kafka][How-to]

Reply via email to