Hi, Well as it appears you have 5 entries in your data and 12 cores. The theory is that you run multiple tasks in parallel across multiple cores on a desktop which applies to your case. The statistics is not there to give a meaningful interpretation why Spark decided to put all data in one partition. If an RDD has too many partitions, then task scheduling may take more time than the actual execution time. In summary you just do not have enough statistics to draw a meaningful conclusion.
Try to generate 100,000 rows and run your query and look at the pattern. HTH LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Tue, 16 Mar 2021 at 04:35, Renganathan Mutthiah <renganatha...@gmail.com> wrote: > Hi, > > I have a question with respect to default partitioning in RDD. > > > > > *case class Animal(id:Int, name:String) val myRDD = > session.sparkContext.parallelize( (Array( Animal(1, "Lion"), > Animal(2,"Elephant"), Animal(3,"Jaguar"), Animal(4,"Tiger"), Animal(5, > "Chetah") ) ))Console println myRDD.getNumPartitions * > > I am running the above piece of code in my laptop which has 12 logical > cores. > Hence I see that there are 12 partitions created. > > My understanding is that hash partitioning is used to determine which > object needs to go to which partition. So in this case, the formula would > be: hashCode() % 12 > But when I further examine, I see all the RDDs are put in the last > partition. > > *myRDD.foreachPartition( e => { println("----------"); e.foreach(println) > } )* > > Above code prints the below(first eleven partitions are empty and the last > one has all the objects. The line is separate the partition contents): > ---------- > ---------- > ---------- > ---------- > ---------- > ---------- > ---------- > ---------- > ---------- > ---------- > ---------- > ---------- > Animal(2,Elephant) > Animal(4,Tiger) > Animal(3,Jaguar) > Animal(5,Chetah) > Animal(1,Lion) > > I don't know why this happens. Can you please help. > > Thanks! >