Re: Spark executors resources. Blocking?

2015-01-14 Thread Brett Meyer
I can¹t speak to Mesos solutions, but for YARN you can define queues in
which to run your jobs, and you can customize the amount of resources the
queue consumes.  When deploying your Spark job, you can specify the ‹queue
 option to schedule the job to a particular queue.  Here are
some links for reference:

http://hadoop.apache.org/docs/r2.5.1/hadoop-yarn/hadoop-yarn-site/CapacitySc
heduler.html
http://lucene.472066.n3.nabble.com/Capacity-Scheduler-on-YARN-td4081242.html



From:  Luis Guerra 
Date:  Tuesday, January 13, 2015 at 3:19 AM
To:  "Buttler, David" , user 
Subject:  Re: Spark executors resources. Blocking?

Thanks for your answer David,

It is as I thought then. When you write that there could be some approaches
to solve this using Yarn or Mesos, can you give any idea about this? Or
better yet, is there any site with documentation about this issue?
Currently, we are launching our jobs using Yarn, but still we do not know
how to properly schedule our jobs to have the highest utilization of our
cluster.

Best

On Tue, Jan 13, 2015 at 2:12 AM, Buttler, David  wrote:
> Spark has a built-in cluster manager (The spark stand-alone cluster), but it
> is not designed for multi-tenancy ­either multiple people using the system, or
> multiple tasks sharing resources.  It is a first in, first out queue of tasks
> where tasks will block until the previous tasks are finished (as you
> described).  If you want to have higher utilization of your cluster, then you
> should use either Yarn or Mesos to schedule the system.  The same issues will
> come up, but they have a much broader range of approaches that you can take to
> solve the problem.
>  
> Dave
>  
> From: Luis Guerra [mailto:luispelay...@gmail.com]
> Sent: Monday, January 12, 2015 8:36 AM
> To: user
> Subject: Spark executors resources. Blocking?
> 
>  
> 
> Hello all,
> 
>  
> 
> I have a naive question regarding how spark uses the executors in a cluster of
> machines. Imagine the scenario in which I do not know the input size of my
> data in execution A, so I set Spark to use 20 (out of my 25 nodes, for
> instance). At the same time, I also launch a second execution B, setting Spark
> to use 10 nodes for this.
> 
>  
> 
> Assuming a huge input size for execution A, which implies an execution time of
> 30 minutes for example (using all the resources), and a constant execution
> time for B of 10 minutes, then both executions will last for 40 minutes (I
> assume that B cannot be launched until 10 resources are completely available,
> when A finishes).
> 
>  
> 
> Now, assuming a very small input size for execution A running for 5 minutes in
> only 2 of the 20 planned resources, I would like execution B to be launched at
> that time, consuming both executions only 10 minutes (and 12 resources).
> However, as execution A has set Spark to use 20 resources, execution B has to
> wait until A has finished, so the total execution time lasts for 15 minutes.
> 
>  
> 
> Is this right? If so, how can I solve this kind of scenarios? If I am wrong,
> what would be the correct interpretation for this?
> 
>  
> 
> Thanks in advance,
> 
>  
> 
> Best





smime.p7s
Description: S/MIME cryptographic signature


Re: View executor logs on YARN mode

2015-01-14 Thread Brett Meyer
You can view the logs for the particular containers on the YARN UI if you go
to the page for a specific node, and then from the Tools menu on the left,
select Local Logs.  There should be a userlogs directory which will contain
the specific application ids for each job that you run.  Inside the
directories for the applications, you can find the specific container logs.


From:  lonely Feb 
Date:  Wednesday, January 14, 2015 at 8:07 AM
To:  "user@spark.apache.org" 
Subject:  View executor logs on YARN mode

Hi all, i sadly found on YARN mode i cannot view executor logs on YARN web
UI nor on SPARK history web UI. On YARN web UI i can only view AppMaster
logs and on SPARK history web UI i can only view Application metric
information. If i want to see whether a executor is being full GC i can only
use "yarn logs" command. It's very unfriendly. BTW, "yarn logs" command
requires a option of containerID which i could not found on YARN web UI
either. I need to grep it in AppMaster log.

I just wonder how do you handle this situation?




smime.p7s
Description: S/MIME cryptographic signature


Location of logs in local mode

2015-01-06 Thread Brett Meyer
I¹m submitting a script using spark-submit in local mode for testing, and
I¹m having trouble figuring out where the logs are stored.  The
documentation indicates that they should be in the work folder in the
directory in which Spark lives on my system, but I see no such folder there.
I¹ve set the SPARK_LOCAL_DIRS and SPARK_LOG_DIR environment variables in
spark-env.sh, but there doesn¹t seem to be any log output generated in the
locations I¹ve specified there either.  I¹m just using spark-submit with
‹master local, I haven¹t run any of the standalone cluster scripts, so I¹m
not sure if there¹s something I¹m missing here as far as a default output
location for logging.

Thanks,
Brett




smime.p7s
Description: S/MIME cryptographic signature


Location of logs in local mode

2014-12-30 Thread Brett Meyer
I¹m submitting a script using spark-submit in local mode for testing, and
I¹m having trouble figuring out where the logs are stored.  The
documentation indicates that they should be in the work folder in the
directory in which Spark lives on my system, but I see no such folder there.
I¹ve set the SPARK_LOCAL_DIRS and SPARK_LOG_DIR environment variables in
spark-env.sh, but there doesn¹t seem to be any log output generated in the
locations I¹ve specified there either.  I¹m just using spark-submit with
‹master local, I haven¹t run any of the standalone cluster scripts, so I¹m
not sure if there¹s something I¹m missing here as far as a default output
location for logging.

Thanks,
Brett




smime.p7s
Description: S/MIME cryptographic signature


How to pass options to KeyConverter using PySpark

2014-12-29 Thread Brett Meyer
I¹m running PySpark on YARN, and I¹m reading in SequenceFiles for which I
have a custom KeyConverter class.  My KeyConverter needs to have some
configuration options passed to it, but I am unable to find a way to get the
options to that class without modifying the Spark source.  Is there a
currently built-in method by which this can be done?  The sequenceFile()
method does not take any arguments for the KeyConverter, and the
SparkContext doesn¹t seem to be accessible from my KeyConverter class, so
I¹m at a loss as to how to accomplish this.

Thanks,
Brett




smime.p7s
Description: S/MIME cryptographic signature


Re: Many retries for Python job

2014-11-21 Thread Brett Meyer
According to the web UI I don¹t see any executors dying during Stage 2.  I
looked over the YARN logs and didn¹t see anything suspicious, but I may not
have been looking closely enough.  Stage 2 seems to complete just fine, it¹s
just when it enters Stage 3 that the results from the previous stage seem to
be missing in many cases and result in FetchFailure errors.  I should
probably also mention that I have the spark.storage.memoryFraction set to
0.2.

From:  Sandy Ryza 
Date:  Friday, November 21, 2014 at 1:41 PM
To:  Brett Meyer 
Cc:  "user@spark.apache.org" 
Subject:  Re: Many retries for Python job

Hi Brett, 

Are you noticing executors dying?  Are you able to check the YARN
NodeManager logs and see whether YARN is killing them for exceeding memory
limits?

-Sandy

On Fri, Nov 21, 2014 at 9:47 AM, Brett Meyer 
wrote:
> I¹m running a Python script with spark-submit on top of YARN on an EMR cluster
> with 30 nodes.  The script reads in approximately 3.9 TB of data from S3, and
> then does some transformations and filtering, followed by some aggregate
> counts.  During Stage 2 of the job, everything looks to complete just fine
> with no executor failures or resubmissions, but when Stage 3 starts up, many
> Stage 2 tasks have to be rerun due to FetchFailure errors.  Actually, I
> usually see at least 3-4 retries on Stage 2 before Stage 3 can successfully
> start.  The whole application eventually completes, but there is an addition
> of about 1+ hour overhead for all of the retries.
> 
> I¹m trying to determine why there were FetchFailure exceptions, since anything
> computed in the job that could not fit in the available memory cache should be
> by default spilled to disk for further retrieval.  I also see some
> "java.net.ConnectException: Connection refused² and "java.io.IOException:
> sendMessageReliably failed without being ACK¹d" errors in the logs after a
> CancelledKeyException followed by a ClosedChannelException, but I have no idea
> why the nodes in the EMR cluster would suddenly stop being able to
> communicate.
> 
> If anyone has ideas as to why the data needs to be rerun several times in this
> job, please let me know as I am fairly bewildered about this behavior.





smime.p7s
Description: S/MIME cryptographic signature


Many retries for Python job

2014-11-21 Thread Brett Meyer
I¹m running a Python script with spark-submit on top of YARN on an EMR
cluster with 30 nodes.  The script reads in approximately 3.9 TB of data
from S3, and then does some transformations and filtering, followed by some
aggregate counts.  During Stage 2 of the job, everything looks to complete
just fine with no executor failures or resubmissions, but when Stage 3
starts up, many Stage 2 tasks have to be rerun due to FetchFailure errors.
Actually, I usually see at least 3-4 retries on Stage 2 before Stage 3 can
successfully start.  The whole application eventually completes, but there
is an addition of about 1+ hour overhead for all of the retries.

I¹m trying to determine why there were FetchFailure exceptions, since
anything computed in the job that could not fit in the available memory
cache should be by default spilled to disk for further retrieval.  I also
see some "java.net.ConnectException: Connection refused² and
"java.io.IOException: sendMessageReliably failed without being ACK¹d" errors
in the logs after a CancelledKeyException followed by a
ClosedChannelException, but I have no idea why the nodes in the EMR cluster
would suddenly stop being able to communicate.

If anyone has ideas as to why the data needs to be rerun several times in
this job, please let me know as I am fairly bewildered about this behavior.




smime.p7s
Description: S/MIME cryptographic signature


Failed jobs showing as SUCCEEDED on web UI

2014-11-11 Thread Brett Meyer
I¹m running a Python script using spark-submit on YARN in an EMR cluster,
and if I have a job that fails due to ExecutorLostFailure or if I kill the
job, it still shows up on the web UI with a FinalStatus of SUCCEEDED.  Is
this due to PySpark, or is there potentially some other issue with the job
failure status not propagating to the logs?




smime.p7s
Description: S/MIME cryptographic signature