Ok you can see that the process 10603 Worker is running as the worker/slave in your drive manager connection to GUI port webui-port 8081 spark://ES01:7077. That you can access through web Also you have process 12420 running as SparkSubmit. that is telling you the JVM you have submitted for this job
12420 ? Sl 18:47 java -cp /opt/spark-1.6.0-bin-hadoop2. 6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly- 1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/ datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2. 6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin- hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/ -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit --master spark://ES01:7077 --conf spark.storage.memoryFraction=0.2 *--executor-memory 4G --num-executors 1 --total-executor-cores 1* /opt/flowSpark/sparkStream/ ForAsk01.py So I don't see any issue here really Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 11 May 2016 at 06:25, 李明伟 <kramer2...@126.com> wrote: > > > [root@ES01 test]# jps > 10409 Master > 12578 CoarseGrainedExecutorBackend > 24089 NameNode > 17705 Jps > 24184 DataNode > 10603 Worker > 12420 SparkSubmit > > > > [root@ES01 test]# ps -awx | grep -i spark | grep java > 10409 ? Sl 1:52 java -cp > /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/ > -Xms4G -Xmx4G -XX:MaxPermSize=256m org.apache.spark.deploy.master.Master > --ip ES01 --port 7077 --webui-port 8080 > 10603 ? Sl 6:50 java -cp > /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/ > -Xms4G -Xmx4G -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker > --webui-port 8081 spark://ES01:7077 > 12420 ? Sl 18:47 java -cp > /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/ > -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit > --master spark://ES01:7077 --conf spark.storage.memoryFraction=0.2 > --executor-memory 4G --num-executors 1 --total-executor-cores 1 > /opt/flowSpark/sparkStream/ForAsk01.py > 12578 ? Sl 38:18 java -cp > /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/ > -Xms4096M -Xmx4096M -Dspark.driver.port=52931 -XX:MaxPermSize=256m > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark:// > CoarseGrainedScheduler@10.79.148.184:52931 --executor-id 0 --hostname > 10.79.148.184 --cores 1 --app-id app-20160511080701-0013 --worker-url > spark://Worker@10.79.148.184:52660 > > > > 在 2016-05-11 13:18:10,"Mich Talebzadeh" <mich.talebza...@gmail.com> 写道: > > what does jps returning? > > jps > 16738 ResourceManager > 14786 Worker > 17059 JobHistoryServer > 12421 QuorumPeerMain > 9061 RunJar > 9286 RunJar > 5190 SparkSubmit > 16806 NodeManager > 16264 DataNode > 16138 NameNode > 16430 SecondaryNameNode > 22036 SparkSubmit > 9557 Jps > 13240 Kafka > 2522 Master > > and > > ps -awx | grep -i spark | grep java > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 11 May 2016 at 03:01, 李明伟 <kramer2...@126.com> wrote: > >> Hi Mich >> >> From the ps command. I can find four process. 10409 is the master and >> 10603 is the worker. 12420 is the driver program and 12578 should be the >> executor (worker). Am I right? >> So you mean the 12420 is actually running both the driver and the worker >> role? >> >> [root@ES01 ~]# ps -awx | grep spark | grep java >> 10409 ? Sl 1:40 java -cp >> /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/ >> -Xms4G -Xmx4G -XX:MaxPermSize=256m org.apache.spark.deploy.master.Master >> --ip ES01 --port 7077 --webui-port 8080 >> 10603 ? Sl 6:00 java -cp >> /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/ >> -Xms4G -Xmx4G -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker >> --webui-port 8081 spark://ES01:7077 >> 12420 ? Sl 6:34 java -cp >> /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/ >> -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit >> --master spark://ES01:7077 --conf spark.storage.memoryFraction=0.2 >> --executor-memory 4G --num-executors 1 --total-executor-cores 1 >> /opt/flowSpark/sparkStream/ForAsk01.py >> 12578 ? Sl 13:16 java -cp >> /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/ >> -Xms4096M -Xmx4096M -Dspark.driver.port=52931 -XX:MaxPermSize=256m >> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark:// >> CoarseGrainedScheduler@10.79.148.184:52931 --executor-id 0 --hostname >> 10.79.148.184 --cores 1 --app-id app-20160511080701-0013 --worker-url >> spark://Worker@10.79.148.184:52660 >> >> >> >> >> >> >> >> At 2016-05-11 09:03:21, "Mich Talebzadeh" <mich.talebza...@gmail.com> >> wrote: >> >> hm, >> >> This is a standalone mode. >> >> When you are running Spark in Standalone mode, you only have one worker >> that lives within the driver JVM process that you start when you start >> spark-shell or spark-submit. >> >> However, since driver-memory setting encapsulates the JVM, you will need >> to set the amount of *driver memory *for any non-default value *before >> starting JVM by providing the new value:* >> >> >> >> >> ${SPARK_HOME}/bin/spark-submit --driver-memory 5g >> >> >> >> >> >> >> >> >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 11 May 2016 at 01:22, 李明伟 <kramer2...@126.com> wrote: >> >>> I actually provided them in submit command here: >>> >>> nohup ./bin/spark-submit --master spark://ES01:7077 --executor-memory >>> 4G --num-executors 1 --total-executor-cores 1 --conf >>> "spark.storage.memoryFraction=0.2" ./mycode.py 1>a.log 2>b.log & >>> >>> >>> >>> >>> >>> >>> >>> At 2016-05-10 21:19:06, "Mich Talebzadeh" <mich.talebza...@gmail.com> >>> wrote: >>> >>> Hi Mingwei, >>> >>> In your Spark conf setting what are you providing for these parameters. *Are >>> you capping them?* >>> >>> For example >>> >>> val conf = new SparkConf(). >>> setAppName("AppName"). >>> setMaster("local[2]"). >>> set("spark.executor.memory", "4G"). >>> set("spark.cores.max", "2"). >>> set("spark.driver.allowMultipleContexts", "true") >>> val sc = new SparkContext(conf) >>> >>> I assume you are running in standalone mode so each worker/aka >>> slave grabs all the available cores and allocates the remaining memory on >>> each host. Do not provide these in >>> >>> Do not provide new values for these parameter meaning overwrite them in >>> >>> *${SPARK_HOME}/bin/spark-submit --* >>> >>> >>> HTH >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> >>> On 10 May 2016 at 03:12, 李明伟 <kramer2...@126.com> wrote: >>> >>>> Hi Mich >>>> >>>> I added some more infor (the spark-env.sh setting and top command >>>> output in that thread.) Can you help to check pleas? >>>> >>>> Regards >>>> Mingwei >>>> >>>> >>>> >>>> >>>> >>>> At 2016-05-09 23:45:19, "Mich Talebzadeh" <mich.talebza...@gmail.com> >>>> wrote: >>>> >>>> I had a look at the thread. >>>> >>>> This is what you have which I gather a standalone box in other words >>>> one worker node >>>> >>>> bin/spark-submit --master spark://ES01:7077 --executor-memory 4G >>>> --num-executors 1 --total-executor-cores 1 ./latest5min.py 1>a.log 2>b.log >>>> >>>> But what I don't understand why is using 80% of your RAM as opposed to >>>> 25% of it (4GB/16GB) right? >>>> >>>> Where else have you set up these parameters for example in >>>> $SPARK_HOME/con/spark-env.sh? >>>> >>>> Can you send the output of /usr/bin/free and top >>>> >>>> HTH >>>> >>>> Dr Mich Talebzadeh >>>> >>>> >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>> >>>> >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> >>>> >>>> On 9 May 2016 at 16:19, 李明伟 <kramer2...@126.com> wrote: >>>> >>>>> Thanks for all the information guys. >>>>> >>>>> I wrote some code to do the test. Not using window. So only >>>>> calculating data for each batch interval. I set the interval to 30 seconds >>>>> also reduce the size of data to about 30 000 lines of csv. >>>>> Means my code should calculation on 30 000 lines of CSV in 30 seconds. >>>>> I think it is not a very heavy workload. But my spark stream code still >>>>> crash. >>>>> >>>>> I send another post to the user list here >>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Why-I-have-memory-leaking-for-such-simple-spark-stream-code-td26904.html >>>>> >>>>> Is it possible for you to have a look please? Very appreciate. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> At 2016-05-09 17:49:22, "Saisai Shao" <sai.sai.s...@gmail.com> wrote: >>>>> >>>>> Pease see the inline comments. >>>>> >>>>> >>>>> On Mon, May 9, 2016 at 5:31 PM, Ashok Kumar <ashok34...@yahoo.com> >>>>> wrote: >>>>> >>>>>> Thank you. >>>>>> >>>>>> So If I create spark streaming then >>>>>> >>>>>> >>>>>> 1. The streams will always need to be cached? It cannot be stored >>>>>> in persistent storage >>>>>> >>>>>> You don't need to cache the stream explicitly if you don't have >>>>> specific requirement, Spark will do it for you depends on different >>>>> streaming sources (Kafka or socket). >>>>> >>>>>> >>>>>> 1. The stream data cached will be distributed among all nodes of >>>>>> Spark among executors >>>>>> 2. As I understand each Spark worker node has one executor that >>>>>> includes cache. So the streaming data is distributed among these work >>>>>> node >>>>>> caches. For example if I have 4 worker nodes each cache will have a >>>>>> quarter >>>>>> of data (this assumes that cache size among worker nodes is the same.) >>>>>> >>>>>> Ideally, it will distributed evenly across the executors, also this >>>>> is target for tuning. Normally it depends on several conditions like >>>>> receiver distribution, partition distribution. >>>>> >>>>> >>>>>> >>>>>> The issue raises if the amount of streaming data does not fit into >>>>>> these 4 caches? Will the job crash? >>>>>> >>>>>> >>>>>> On Monday, 9 May 2016, 10:16, Saisai Shao <sai.sai.s...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> >>>>>> No, each executor only stores part of data in memory (it depends on >>>>>> how the partition are distributed and how many receivers you have). >>>>>> >>>>>> For WindowedDStream, it will obviously cache the data in memory, from >>>>>> my understanding you don't need to call cache() again. >>>>>> >>>>>> On Mon, May 9, 2016 at 5:06 PM, Ashok Kumar <ashok34...@yahoo.com> >>>>>> wrote: >>>>>> >>>>>> hi, >>>>>> >>>>>> so if i have 10gb of streaming data coming in does it require 10gb of >>>>>> memory in each node? >>>>>> >>>>>> also in that case why do we need using >>>>>> >>>>>> dstream.cache() >>>>>> >>>>>> thanks >>>>>> >>>>>> >>>>>> On Monday, 9 May 2016, 9:58, Saisai Shao <sai.sai.s...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> >>>>>> It depends on you to write the Spark application, normally if data is >>>>>> already on the persistent storage, there's no need to be put into memory. >>>>>> The reason why Spark Streaming has to be stored in memory is that >>>>>> streaming >>>>>> source is not persistent source, so you need to have a place to store the >>>>>> data. >>>>>> >>>>>> On Mon, May 9, 2016 at 4:43 PM, 李明伟 <kramer2...@126.com> wrote: >>>>>> >>>>>> Thanks. >>>>>> What if I use batch calculation instead of stream computing? Do I >>>>>> still need that much memory? For example, if the 24 hour data set is 100 >>>>>> GB. Do I also need a 100GB RAM to do the one time batch calculation ? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> At 2016-05-09 15:14:47, "Saisai Shao" <sai.sai.s...@gmail.com> wrote: >>>>>> >>>>>> For window related operators, Spark Streaming will cache the data >>>>>> into memory within this window, in your case your window size is up to 24 >>>>>> hours, which means data has to be in Executor's memory for more than 1 >>>>>> day, >>>>>> this may introduce several problems when memory is not enough. >>>>>> >>>>>> On Mon, May 9, 2016 at 3:01 PM, Mich Talebzadeh < >>>>>> mich.talebza...@gmail.com> wrote: >>>>>> >>>>>> ok terms for Spark Streaming >>>>>> >>>>>> "Batch interval" is the basic interval at which the system with >>>>>> receive the data in batches. >>>>>> This is the interval set when creating a StreamingContext. For >>>>>> example, if you set the batch interval as 300 seconds, then any input >>>>>> DStream will generate RDDs of received data at 300 seconds intervals. >>>>>> A window operator is defined by two parameters - >>>>>> - WindowDuration / WindowsLength - the length of the window >>>>>> - SlideDuration / SlidingInterval - the interval at which the window >>>>>> will slide or move forward >>>>>> >>>>>> >>>>>> Ok so your batch interval is 5 minutes. That is the rate messages are >>>>>> coming in from the source. >>>>>> >>>>>> Then you have these two params >>>>>> >>>>>> // window length - The duration of the window below that must be >>>>>> multiple of batch interval n in = > StreamingContext(sparkConf, >>>>>> Seconds(n)) >>>>>> val windowLength = x = m * n >>>>>> // sliding interval - The interval at which the window operation is >>>>>> performed in other words data is collected within this "previous >>>>>> interval' >>>>>> val slidingInterval = y l x/y = even number >>>>>> >>>>>> Both the window length and the slidingInterval duration must be >>>>>> multiples of the batch interval, as received data is divided into batches >>>>>> of duration "batch interval". >>>>>> >>>>>> If you want to collect 1 hour data then windowLength = 12 * 5 * 60 >>>>>> seconds >>>>>> If you want to collect 24 hour data then windowLength = 24 * 12 * 5 >>>>>> * 60 >>>>>> >>>>>> You sliding window should be set to batch interval = 5 * 60 seconds. >>>>>> In other words that where the aggregates and summaries come for your >>>>>> report. >>>>>> >>>>>> What is your data source here? >>>>>> >>>>>> HTH >>>>>> >>>>>> >>>>>> Dr Mich Talebzadeh >>>>>> >>>>>> LinkedIn * >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>>> >>>>>> http://talebzadehmich.wordpress.com >>>>>> >>>>>> >>>>>> On 9 May 2016 at 04:19, kramer2...@126.com <kramer2...@126.com> >>>>>> wrote: >>>>>> >>>>>> We have some stream data need to be calculated and considering use >>>>>> spark >>>>>> stream to do it. >>>>>> >>>>>> We need to generate three kinds of reports. The reports are based on >>>>>> >>>>>> 1. The last 5 minutes data >>>>>> 2. The last 1 hour data >>>>>> 3. The last 24 hour data >>>>>> >>>>>> The frequency of reports is 5 minutes. >>>>>> >>>>>> After reading the docs, the most obvious way to solve this seems to >>>>>> set up a >>>>>> spark stream with 5 minutes interval and two window which are 1 hour >>>>>> and 1 >>>>>> day. >>>>>> >>>>>> >>>>>> But I am worrying that if the window is too big for one day and one >>>>>> hour. I >>>>>> do not have much experience on spark stream, so what is the window >>>>>> length in >>>>>> your environment? >>>>>> >>>>>> Any official docs talking about this? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> View this message in context: >>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html >>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>> Nabble.com. >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >>> >>> >>> >> >> >> >> >> > > > > >