Neither of your code examples invoke a repartitioning. Add in a repartition command.
On Sat, Jun 24, 2017, 11:53 AM Vikash Pareek <vikash.par...@infoobjects.com> wrote: > Hi Vadim, > > Thank you for your response. > > I would like to know how partitioner choose the key, If we look at my > example then following question arises: > 1. In case of rdd1, hash partitioning should calculate hashcode of key > (i.e. *"aa"* in this case), so *all records should go to single partition* > instead of uniform distribution? > 2. In case of rdd2, there is no key value pair so how hash partitoning > going to work i.e. *what is the key* to calculate hashcode? > > > > Best Regards, > > > [image: InfoObjects Inc.] <http://www.infoobjects.com/> > Vikash Pareek > Team Lead *InfoObjects Inc.* > Big Data Analytics > > m: +91 8800206898 a: E5, Jhalana Institutionall Area, Jaipur, Rajasthan > 302004 > w: www.linkedin.com/in/pvikash e: vikash.par...@infoobjects.com > > > > > On Fri, Jun 23, 2017 at 10:38 PM, Vadim Semenov < > vadim.seme...@datadoghq.com> wrote: > >> This is the code that chooses the partition for a key: >> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L85-L88 >> >> it's basically `math.abs(key.hashCode % numberOfPartitions)` >> >> On Fri, Jun 23, 2017 at 3:42 AM, Vikash Pareek < >> vikash.par...@infoobjects.com> wrote: >> >>> I am trying to understand how spark partitoing works. >>> >>> To understand this I have following piece of code on spark 1.6 >>> >>> def countByPartition1(rdd: RDD[(String, Int)]) = { >>> rdd.mapPartitions(iter => Iterator(iter.length)) >>> } >>> def countByPartition2(rdd: RDD[String]) = { >>> rdd.mapPartitions(iter => Iterator(iter.length)) >>> } >>> >>> //RDDs Creation >>> val rdd1 = sc.parallelize(Array(("aa", 1), ("aa", 1), ("aa", 1), >>> ("aa", >>> 1)), 8) >>> countByPartition(rdd1).collect() >>> >> Array[Int] = Array(0, 1, 0, 1, 0, 1, 0, 1) >>> >>> val rdd2 = sc.parallelize(Array("aa", "aa", "aa", "aa"), 8) >>> countByPartition(rdd2).collect() >>> >> Array[Int] = Array(0, 1, 0, 1, 0, 1, 0, 1) >>> >>> In both the cases data is distributed uniformaly. >>> I do have following questions on the basis of above observation: >>> >>> 1. In case of rdd1, hash partitioning should calculate hashcode of key >>> (i.e. "aa" in this case), so all records should go to single partition >>> instead of uniform distribution? >>> 2. In case of rdd2, there is no key value pair so how hash partitoning >>> going to work i.e. what is the key to calculate hashcode? >>> >>> I have followed @zero323 answer but not getting answer of these. >>> >>> https://stackoverflow.com/questions/31424396/how-does-hashpartitioner-work >>> >>> >>> >>> >>> ----- >>> >>> __Vikash Pareek >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/How-does-HashPartitioner-distribute-data-in-Spark-tp28785.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> >> >