I came across the feature in spark where it allows you to schedule different
tasks within a spark context. I want to implement this feature in a program
where I map my input RDD(from a text source) into a key value RDD [K,V]
subsequently make a composite key value RDD [(K1,K2),V] and and a filtered
RDD containing some specific values. Further pipeline involves calling some
statistical methods from MLlib on both the RDDs and a join operation
followed by externalizing the result to disk.

I am trying to understand how will spark's internal fair scheduler handle
these operations. I tried reading the  job scheduling documentation
<https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application>
  
but got more confused with the concept of pools, users and tasks. 

What exactly are the pools, are they certain 'tasks' which can be grouped
together or are they linux users pooled into a group

What are users in this context. Do they refer to threads? or is it 
something like SQL context queries

I guess it relates to how are tasks scheduled within a spark context. But
reading the documentation makes it seem like we are dealing with multiple
applications with different clients and user groups. 

Can someone please clarify this?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Internal-Job-Scheduling-tp22648.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to