Dear list, A quick question about spark streaming:
Say I have this stage set up in my Spark Streaming cluster: batched TCP stream ==> map(expensive computation) ===> ReduceByKey I know I can set the number of tasks for ReduceByKey. But I didn't find a place to specify the parallelism for the input dstream(RDD sequence generated after the TCP stream). Do I need to explicitly call repartition() to split the input RDD streams into many parititions? If that is the case, what is the mechanism used to split the RDD stream? Random fully reparation on each (K,V) pair (effectively a shuffle) or more like rebalance? And what is the default parallelism level for input stream? Thank you so much -Mo