This will be in 0.8.1 and 0.9.0 as the "repartition()" function.
On Tue, Oct 22, 2013 at 11:25 PM, Patrick Wendell <[email protected]>wrote: > This is something we should add directly to the streaming API rather than > requiring a transform() call. In fact, it's one of the really nice things > about Spark Streaming, you can dynamically change the parallelism any time > in the stream declaration by "coalesce'ing" the input stream. For streams > this is especially useful, because they may be collected on a single node > but the heavy lifting requires the resources of the entire cluster. > > - Patrick > > On Tue, Oct 22, 2013 at 12:17 PM, Aaron Davidson <[email protected]>wrote: > >> >> >> ---------- Forwarded message ---------- >> From: Aaron Davidson <[email protected]> >> Date: Tue, Oct 22, 2013 at 12:04 PM >> Subject: Re: Spark Streaming - How to control the parallelism like storm >> To: [email protected] >> >> >> As Mark said, flatMap can only parallelize into as many partitions as >> exist in the incoming RDD. socketTextStream() only produces 1 RDD at a >> time. However, you can utilize the RDD.coalesce() method to split one RDD >> into multiple (excuse the name; it can be used for shrinking or growing the >> number of partitions), like so: >> >> val lines = ssc.socketTextStream(args(1), args(2).toInt) >> val partitionedLines = stream.transform(rdd => rdd.coalesce(10, shuffle = >> true)) >> val words = partitionedLines.flatMap(_.split(" ")) >> ... >> >> This splits the incoming text stream into 10 partitions, so flatMap can >> run up to 10x faster, assuming you have that many worker threads (and >> ignoring the increased latency in partitioning the rdd across your nodes). >> >> >> On Tue, Oct 22, 2013 at 8:21 AM, Mark Hamstra <[email protected]>wrote: >> >>> Not separately at the level of `flatMap` and `map`. The number of >>> partitions in the RDD those operations are working on determines the >>> potential parallelism. The number of worker cores available determines how >>> much of that potential can be actualized. >>> >>> >>> On Tue, Oct 22, 2013 at 7:24 AM, Ryan Chan <[email protected]>wrote: >>> >>>> In storm, you can control the number of thread with the setSpout/setBolt, >>>> and how to do the same with Spark Streaming? >>>> >>>> e.g. >>>> >>>> val lines = ssc.socketTextStream(args(1), args(2).toInt) >>>> val words = lines.flatMap(_.split(" ")) >>>> val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) >>>> wordCounts.print() >>>> ssc.start() >>>> >>>> >>>> Sound like I cannot tell Spark to tell how many thread to be used with >>>> `flatMap` and how many thread to be used with `map` etc, right? >>>> >>>> >>>> >>> >> >> >
