I came across the feature in spark where it allows you to schedule different tasks within a spark context. I want to implement this feature in a program where I map my input RDD(from a text source) into a key value RDD [K,V] subsequently make a composite key value RDD [(K1,K2),V] and and a filtered RDD containing some specific values. Further pipeline involves calling some statistical methods from MLlib on both the RDDs and a join operation followed by externalizing the result to disk.
I am trying to understand how will spark's internal fair scheduler handle these operations. I tried reading the job scheduling documentation <https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application> but got more confused with the concept of pools, users and tasks. What exactly are the pools, are they certain 'tasks' which can be grouped together or are they linux users pooled into a group What are users in this context. Do they refer to threads? or is it something like SQL context queries I guess it relates to how are tasks scheduled within a spark context. But reading the documentation makes it seem like we are dealing with multiple applications with different clients and user groups. Can someone please clarify this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Internal-Job-Scheduling-tp22648.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org