Thanks Cody,

I still have concerns about this.
What's do you mean by saying Spark direct stream doesn't have a default
partitioner? Could you please help me to explain more?

When i assign 20 cores to 20 Kafka partitions, I am expecting each core
will work on a partition. Is it correct?

I'm still couldn't figure out how RDD will be partitioned after mapToPair
function. It would be great if you could brieftly explain ( or send me some
document, i couldnt find it) about how shuffle work on mapToPair function.

Thank you very much.
On Nov 23, 2015 12:26 AM, "Cody Koeninger" <[email protected]> wrote:

> Spark direct stream doesn't have a default partitioner.
>
> If you know that you want to do an operation on keys that are already
> partitioned by kafka, just use mapPartitions or foreachPartition to avoid a
> shuffle.
>
> On Sat, Nov 21, 2015 at 11:46 AM, trung kien <[email protected]> wrote:
>
>> Hi all,
>>
>> I am having problem of understanding how RDD will be partitioned after
>> calling mapToPair function.
>> Could anyone give me more information about parititoning in this function?
>>
>> I have a simple application doing following job:
>>
>> JavaPairInputDStream<String, String> messages =
>> KafkaUtils.createDirectStream(...)
>>
>> JavaPairDStream<String, Double> stats = messages.mapToPair(JSON_DECODE)
>>
>> .reduceByKey(SUM);
>>
>> saveToDB(stats)
>>
>> I setup 2 workers (each dedicate 20 cores) for this job.
>> My kafka topic has 40 partitions (I want each core handle a partition),
>> and the messages send to queue are partitioned by the same key as mapToPair
>> function.
>> I'm using default Partitioner of both Kafka and Sprark.
>>
>> Ideally, I shouldn't see the data shuffle between cores in mapToPair
>> stage, right?
>> However, in my Spark UI, I see that the "Locality Level" for this stage
>> is "ANY", which means data need to be transfered.
>> Any comments on this?
>>
>> --
>> Thanks
>> Kien
>>
>
>

Reply via email to