Re: Submit many spark applications

2018-05-25 Thread yncxcw
hi, 

please try to reduce the default heap size for the machine you use to submit
applications:

For example:
export _JAVA_OPTIONS="-Xmx512M" 

The submitter which is also a JVM does not need to reserve lots of memory.


Wei 





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: cache OS memory and spark usage of it

2018-04-11 Thread yncxcw
hi, Raúl 

(1)&(2) yes, the OS needs some pressure to release it. For example, if you
have a total 16GB ram in your machine, then you read a file of 8GB and
immediately close it. Noe the page cache would cache 8GB the file data. Then
you start a program requesting memory from OS, the OS will release the page
cache as long as your request goes beyond 8GB.

(3) I think you can configure your JVM with a maximum heap size of 14GB
(-xmx) and leave 2GB memory for OS.  you will have memory elasticity with
this configuration. The JVM will increase memory allocation from OS as long
as new objects are created, but it is bounded by 14GB which will not cause
memory swapping. For example, if your application only needs 8GB memory,
then the rest 8GB can be used for page cache, improving you IO performance.
Otherwise, if your application needs 14GB memory, then the JVM will force OS
to release almost all page cache. In this situation, your IO performance may
not be good, but you can hold more data (e.g, RDD) in your application.


Wei



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: cache OS memory and spark usage of it

2018-04-10 Thread yncxcw
hi, Raúl 

First, the most of the OS memory cache is used by  Page Cache
   which OS use for caching the
recent read/write I/O.

I think the understanding of OS memory cache should be discussed in two
different perspectives. From a perspective of 
user-space (e.g, Spark application), it is not used, since the Spark is not
allocating memory from this part of memory. 
However, from a perspective of OS, it is actually used, because the memory
pages are already allocated for caching the 
I/O pages. For each I/O request, the OS always allocate memory pages to
cache it to expect these cached I/O pages can be reused in near future. 
Recall, you use vim/emacs to open a large file. It is pretty slow when you
open it at the first time. But it will be much faster when you close it and
open it immediately because the file has been cached in file cache at the
first time you open it.

It is hard for Spark to use this part of memory. Because this part of the
memory is managed by OS and is transparent to applications.  The only thing
you can do is that you can continuously allocate memory from OS (by
malloc()), to some certain points which the OS senses some memory pressure,
the OS will voluntarily release the page cache to satisfy your memory
allocation. Another thing is that the memory limit of Spark is limited by
maximum JVM heap size. So your memory request from your Spark application is
actually handled by JVM not the OS.


Hope this answer can help you!


Wei




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark production scenario

2018-03-08 Thread yncxcw
hi, Passion

I don't know an exact solution. But yes, the port each executor chosen to
communicate with driver is random.  I am wondering if it's possible that you
can have a node has two ethernet card, configure one card for intranet for
Spark and configure one card for WAN. Then connect the rests nodes using the
intranet. 

And also, I think you might not use WAN for Spark data transfer since the
amount of data during shuffle is huge. You got to have a high-speed switch
for your cluster.

Hopes this answer can help you!


Wei



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Data loss in spark job

2018-02-27 Thread yncxcw
hi, 

Please check if your os supports memory overcommit. I doubted this caused by
your os bans the memory overcommitment, and the os kills the process when
memory overcommitment is detected (the spark executor is chosen to kill).
This is why you receive sigterm, and executor failed with the signal and
lost all your data.

Please check /proc/sys/vm/overcommit_memory and set it accordingly:

/proc/sys/vm/overcommit_memory
This switch knows 3 different settings:

0: The Linux kernel is free to overcommit memory (this is the default), a
heuristic algorithm is applied to figure out if enough memory is available.
1: The Linux kernel will always overcommit memory, and never check if enough
memory is available. This increases the risk of out-of-memory situations,
but also improves memory-intensive workloads.
2: The Linux kernel will not overcommit memory, and only allocate as much
memory as defined in overcommit_ratio.

Another way is to just decrease the JVM heap size by setting a small -Xmx to
decrease the amount of memory the JVM is requesting the OS to reserve.

Thanks!

Wei



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark EMR executor-core vs Vcores

2018-02-26 Thread yncxcw
hi, all

I also noticed this problem. The reason is that Yarn accounts each executor
for only 1, no matter how many cores you configured. 
Because Yarn only uses memory as the primary metrics for resource
allocation. It means that Yarn will pack as many as executors on each node
as long as the node has 
free memory space.

If you want to enable vcores to be accounted for resource allocation, you
can configure the resource calculator as DominantResoruceCalculator, as
following:

PropertyDescription
yarn.scheduler.capacity.resource-calculator The ResourceCalculator
implementation to be used to compare Resources in the scheduler. The default
i.e. org.apache.hadoop.yarn.util.resource.DefaultResourseCalculator only
uses Memory while DominantResourceCalculator uses Dominant-resource to
compare multi-dimensional resources such as Memory, CPU etc. A Java
ResourceCalculator class name is expected.


Please also refer this article:
https://hortonworks.com/blog/managing-cpu-resources-in-your-hadoop-yarn-clusters/


Thanks!

Wei Chen



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org