Re: Spark executors resources. Blocking?
I can¹t speak to Mesos solutions, but for YARN you can define queues in which to run your jobs, and you can customize the amount of resources the queue consumes. When deploying your Spark job, you can specify the queue option to schedule the job to a particular queue. Here are some links for reference: http://hadoop.apache.org/docs/r2.5.1/hadoop-yarn/hadoop-yarn-site/CapacitySc heduler.html http://lucene.472066.n3.nabble.com/Capacity-Scheduler-on-YARN-td4081242.html From: Luis Guerra Date: Tuesday, January 13, 2015 at 3:19 AM To: "Buttler, David" , user Subject: Re: Spark executors resources. Blocking? Thanks for your answer David, It is as I thought then. When you write that there could be some approaches to solve this using Yarn or Mesos, can you give any idea about this? Or better yet, is there any site with documentation about this issue? Currently, we are launching our jobs using Yarn, but still we do not know how to properly schedule our jobs to have the highest utilization of our cluster. Best On Tue, Jan 13, 2015 at 2:12 AM, Buttler, David wrote: > Spark has a built-in cluster manager (The spark stand-alone cluster), but it > is not designed for multi-tenancy either multiple people using the system, or > multiple tasks sharing resources. It is a first in, first out queue of tasks > where tasks will block until the previous tasks are finished (as you > described). If you want to have higher utilization of your cluster, then you > should use either Yarn or Mesos to schedule the system. The same issues will > come up, but they have a much broader range of approaches that you can take to > solve the problem. > > Dave > > From: Luis Guerra [mailto:luispelay...@gmail.com] > Sent: Monday, January 12, 2015 8:36 AM > To: user > Subject: Spark executors resources. Blocking? > > > > Hello all, > > > > I have a naive question regarding how spark uses the executors in a cluster of > machines. Imagine the scenario in which I do not know the input size of my > data in execution A, so I set Spark to use 20 (out of my 25 nodes, for > instance). At the same time, I also launch a second execution B, setting Spark > to use 10 nodes for this. > > > > Assuming a huge input size for execution A, which implies an execution time of > 30 minutes for example (using all the resources), and a constant execution > time for B of 10 minutes, then both executions will last for 40 minutes (I > assume that B cannot be launched until 10 resources are completely available, > when A finishes). > > > > Now, assuming a very small input size for execution A running for 5 minutes in > only 2 of the 20 planned resources, I would like execution B to be launched at > that time, consuming both executions only 10 minutes (and 12 resources). > However, as execution A has set Spark to use 20 resources, execution B has to > wait until A has finished, so the total execution time lasts for 15 minutes. > > > > Is this right? If so, how can I solve this kind of scenarios? If I am wrong, > what would be the correct interpretation for this? > > > > Thanks in advance, > > > > Best smime.p7s Description: S/MIME cryptographic signature
Re: View executor logs on YARN mode
You can view the logs for the particular containers on the YARN UI if you go to the page for a specific node, and then from the Tools menu on the left, select Local Logs. There should be a userlogs directory which will contain the specific application ids for each job that you run. Inside the directories for the applications, you can find the specific container logs. From: lonely Feb Date: Wednesday, January 14, 2015 at 8:07 AM To: "user@spark.apache.org" Subject: View executor logs on YARN mode Hi all, i sadly found on YARN mode i cannot view executor logs on YARN web UI nor on SPARK history web UI. On YARN web UI i can only view AppMaster logs and on SPARK history web UI i can only view Application metric information. If i want to see whether a executor is being full GC i can only use "yarn logs" command. It's very unfriendly. BTW, "yarn logs" command requires a option of containerID which i could not found on YARN web UI either. I need to grep it in AppMaster log. I just wonder how do you handle this situation? smime.p7s Description: S/MIME cryptographic signature
Location of logs in local mode
I¹m submitting a script using spark-submit in local mode for testing, and I¹m having trouble figuring out where the logs are stored. The documentation indicates that they should be in the work folder in the directory in which Spark lives on my system, but I see no such folder there. I¹ve set the SPARK_LOCAL_DIRS and SPARK_LOG_DIR environment variables in spark-env.sh, but there doesn¹t seem to be any log output generated in the locations I¹ve specified there either. I¹m just using spark-submit with master local, I haven¹t run any of the standalone cluster scripts, so I¹m not sure if there¹s something I¹m missing here as far as a default output location for logging. Thanks, Brett smime.p7s Description: S/MIME cryptographic signature
Location of logs in local mode
I¹m submitting a script using spark-submit in local mode for testing, and I¹m having trouble figuring out where the logs are stored. The documentation indicates that they should be in the work folder in the directory in which Spark lives on my system, but I see no such folder there. I¹ve set the SPARK_LOCAL_DIRS and SPARK_LOG_DIR environment variables in spark-env.sh, but there doesn¹t seem to be any log output generated in the locations I¹ve specified there either. I¹m just using spark-submit with master local, I haven¹t run any of the standalone cluster scripts, so I¹m not sure if there¹s something I¹m missing here as far as a default output location for logging. Thanks, Brett smime.p7s Description: S/MIME cryptographic signature
How to pass options to KeyConverter using PySpark
I¹m running PySpark on YARN, and I¹m reading in SequenceFiles for which I have a custom KeyConverter class. My KeyConverter needs to have some configuration options passed to it, but I am unable to find a way to get the options to that class without modifying the Spark source. Is there a currently built-in method by which this can be done? The sequenceFile() method does not take any arguments for the KeyConverter, and the SparkContext doesn¹t seem to be accessible from my KeyConverter class, so I¹m at a loss as to how to accomplish this. Thanks, Brett smime.p7s Description: S/MIME cryptographic signature
Re: Many retries for Python job
According to the web UI I don¹t see any executors dying during Stage 2. I looked over the YARN logs and didn¹t see anything suspicious, but I may not have been looking closely enough. Stage 2 seems to complete just fine, it¹s just when it enters Stage 3 that the results from the previous stage seem to be missing in many cases and result in FetchFailure errors. I should probably also mention that I have the spark.storage.memoryFraction set to 0.2. From: Sandy Ryza Date: Friday, November 21, 2014 at 1:41 PM To: Brett Meyer Cc: "user@spark.apache.org" Subject: Re: Many retries for Python job Hi Brett, Are you noticing executors dying? Are you able to check the YARN NodeManager logs and see whether YARN is killing them for exceeding memory limits? -Sandy On Fri, Nov 21, 2014 at 9:47 AM, Brett Meyer wrote: > I¹m running a Python script with spark-submit on top of YARN on an EMR cluster > with 30 nodes. The script reads in approximately 3.9 TB of data from S3, and > then does some transformations and filtering, followed by some aggregate > counts. During Stage 2 of the job, everything looks to complete just fine > with no executor failures or resubmissions, but when Stage 3 starts up, many > Stage 2 tasks have to be rerun due to FetchFailure errors. Actually, I > usually see at least 3-4 retries on Stage 2 before Stage 3 can successfully > start. The whole application eventually completes, but there is an addition > of about 1+ hour overhead for all of the retries. > > I¹m trying to determine why there were FetchFailure exceptions, since anything > computed in the job that could not fit in the available memory cache should be > by default spilled to disk for further retrieval. I also see some > "java.net.ConnectException: Connection refused² and "java.io.IOException: > sendMessageReliably failed without being ACK¹d" errors in the logs after a > CancelledKeyException followed by a ClosedChannelException, but I have no idea > why the nodes in the EMR cluster would suddenly stop being able to > communicate. > > If anyone has ideas as to why the data needs to be rerun several times in this > job, please let me know as I am fairly bewildered about this behavior. smime.p7s Description: S/MIME cryptographic signature
Many retries for Python job
I¹m running a Python script with spark-submit on top of YARN on an EMR cluster with 30 nodes. The script reads in approximately 3.9 TB of data from S3, and then does some transformations and filtering, followed by some aggregate counts. During Stage 2 of the job, everything looks to complete just fine with no executor failures or resubmissions, but when Stage 3 starts up, many Stage 2 tasks have to be rerun due to FetchFailure errors. Actually, I usually see at least 3-4 retries on Stage 2 before Stage 3 can successfully start. The whole application eventually completes, but there is an addition of about 1+ hour overhead for all of the retries. I¹m trying to determine why there were FetchFailure exceptions, since anything computed in the job that could not fit in the available memory cache should be by default spilled to disk for further retrieval. I also see some "java.net.ConnectException: Connection refused² and "java.io.IOException: sendMessageReliably failed without being ACK¹d" errors in the logs after a CancelledKeyException followed by a ClosedChannelException, but I have no idea why the nodes in the EMR cluster would suddenly stop being able to communicate. If anyone has ideas as to why the data needs to be rerun several times in this job, please let me know as I am fairly bewildered about this behavior. smime.p7s Description: S/MIME cryptographic signature
Failed jobs showing as SUCCEEDED on web UI
I¹m running a Python script using spark-submit on YARN in an EMR cluster, and if I have a job that fails due to ExecutorLostFailure or if I kill the job, it still shows up on the web UI with a FinalStatus of SUCCEEDED. Is this due to PySpark, or is there potentially some other issue with the job failure status not propagating to the logs? smime.p7s Description: S/MIME cryptographic signature