Re: Spark Streaming - How to control the parallelism like storm

Patrick Wendell Thu, 24 Oct 2013 22:48:20 -0700

This will be in 0.8.1 and 0.9.0 as the "repartition()" function.


On Tue, Oct 22, 2013 at 11:25 PM, Patrick Wendell <[email protected]>wrote:

> This is something we should add directly to the streaming API rather than
> requiring a transform() call. In fact, it's one of the really nice things
> about Spark Streaming, you can dynamically change the parallelism any time
> in the stream declaration by "coalesce'ing" the input stream. For streams
> this is especially useful, because they may be collected on a single node
> but the heavy lifting requires the resources of the entire cluster.
>
> - Patrick
>
> On Tue, Oct 22, 2013 at 12:17 PM, Aaron Davidson <[email protected]>wrote:
>
>>
>>
>> ---------- Forwarded message ----------
>> From: Aaron Davidson <[email protected]>
>> Date: Tue, Oct 22, 2013 at 12:04 PM
>> Subject: Re: Spark Streaming - How to control the parallelism like storm
>> To: [email protected]
>>
>>
>> As Mark said, flatMap can only parallelize into as many partitions as
>> exist in the incoming RDD. socketTextStream() only produces 1 RDD at a
>> time. However, you can utilize the RDD.coalesce() method to split one RDD
>> into multiple (excuse the name; it can be used for shrinking or growing the
>> number of partitions), like so:
>>
>> val lines = ssc.socketTextStream(args(1), args(2).toInt)
>> val partitionedLines = stream.transform(rdd => rdd.coalesce(10, shuffle =
>> true))
>> val words = partitionedLines.flatMap(_.split(" "))
>> ...
>>
>> This splits the incoming text stream into 10 partitions, so flatMap can
>> run up to 10x faster, assuming you have that many worker threads (and
>> ignoring the increased latency in partitioning the rdd across your nodes).
>>
>>
>> On Tue, Oct 22, 2013 at 8:21 AM, Mark Hamstra <[email protected]>wrote:
>>
>>> Not separately at the level of `flatMap` and `map`.  The number of
>>> partitions in the RDD those operations are working on determines the
>>> potential parallelism.  The number of worker cores available determines how
>>> much of that potential can be actualized.
>>>
>>>
>>> On Tue, Oct 22, 2013 at 7:24 AM, Ryan Chan <[email protected]>wrote:
>>>
>>>> In storm, you can control the number of thread with the setSpout/setBolt,
>>>> and how to do the same with Spark Streaming?
>>>>
>>>> e.g.
>>>>
>>>> val lines = ssc.socketTextStream(args(1), args(2).toInt)
>>>> val words = lines.flatMap(_.split(" "))
>>>> val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
>>>> wordCounts.print()
>>>> ssc.start()
>>>>
>>>>
>>>> Sound like I cannot tell Spark to tell how many thread to be used with
>>>> `flatMap` and how many thread to be used with `map` etc, right?
>>>>
>>>>
>>>>
>>>
>>
>>
>

Re: Spark Streaming - How to control the parallelism like storm

Reply via email to