Re: how to use spark.mesos.constraints
Hi, I am also trying to use the spark.mesos.constraints but it gives me the same error: job has not be accepted by any resources. I am doubting that I should start some additional service like ./sbin/start-mesos-shuffle-service.sh. Am I correct? Thanks, Jia On Tue, Dec 1, 2015 at 5:14 PM, raredielwrote: > I am trying to add mesos constraints to my spark-submit command in my > marathon file I am setting it to spark.mesos.coarse=true. > > Here is an example of a constraint I am trying to set. > > --conf spark.mesos.constraint=cpus:2 > > I want to use the constraints to control the amount of executors are > created > so I can control the total memory of my spark job. > > I've tried many variations of resource constraints, but no matter which > resource or what number, range, etc. I do I always get the error "Initial > job has not accepted any resources; check your cluster UI...". My cluster > has the available resources. Is there any examples I can look at where > people use resource constraints? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/how-to-use-spark-mesos-constraints-tp25541.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?
Hi Peng, I got exactly same error! My shuffle data is also very large. Have you figured out a method to solve that? Thanks, Jia On Fri, Apr 24, 2015 at 7:59 AM, Peng Cheng pc...@uow.edu.au wrote: I'm deploying a Spark data processing job on an EC2 cluster, the job is small for the cluster (16 cores with 120G RAM in total), the largest RDD has only 76k+ rows. But heavily skewed in the middle (thus requires repartitioning) and each row has around 100k of data after serialization. The job always got stuck in repartitioning. Namely, the job will constantly get following errors and retries: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle org.apache.spark.shuffle.FetchFailedException: Error in opening FileSegmentManagedBuffer org.apache.spark.shuffle.FetchFailedException: java.io.FileNotFoundException: /tmp/spark-... I've tried to identify the problem but it seems like both memory and disk consumption of the machine throwing these errors are below 50%. I've also tried different configurations, including: let driver/executor memory use 60% of total memory. let netty to priortize JVM shuffling buffer. increase shuffling streaming buffer to 128m. use KryoSerializer and max out all buffers increase shuffling memoryFraction to 0.4 But none of them works. The small job always trigger the same series of errors and max out retries (upt to 1000 times). How to troubleshoot this thing in such situation? Thanks a lot if you have any clue. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/What-are-the-likely-causes-of-org-apache-spark-shuffle-MetadataFetchFailedException-Missing-an-outpu-tp22646.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Help!!!Map or join one large datasets then suddenly remote Akka client disassociated
Hi folks, Help me! I met a very weird problem. I really need some help!! Here is my situation: Case: Assign keys to two datasets (one is 96GB with 2.7 billion records and one 1.5GB with 30k records) via MapPartitions first, and join them together with their keys. Environment: Standalone Spark on Amazon EC2 Master*1 13GB 8 cores Worker*16 each one 13GB 8 cores (After met this problem, I switched to Worker*16 each one 59GB 8 cores) Read and write on HDFS (same cluster) -- Problem: At the beginning:--- The MapPartitions looks no problem. But when Spark does the Join for two datasets, the console says *ERROR TaskSchedulerImpl: Lost executor 4 on ip-172-31-27-174.us-west-2.compute.internal: remote Akka client disassociated* Then I go back to this worker and check its log There is something like Master said remote Akka client disassociated and asked to kill executor *** and then the worker killed this executor (Sorry I deleted that log and just remember the content.) There is no other errors before the Akka client disassociated (for both of master and worker). Then --- I tried one 62GB dataset with the 1.5 GB dataset. My job worked smoothly. *HOWEVER, I found one thing: If I set the spark.shuffle.memoryFraction to Zero, same error will happen on this 62GB dataset.* Then --- I switched my workers to Worker*16 each one 59GB 8 cores. Error even happened when Spark does the MapPartitions Some metrics I found *When I do the MapPartitions or Join with 96GB data, its shuffle write is around 100GB. And I cached 96GB data and its size is around 530GB.* *Garbage collection time for 96GB dataset when Spark does the Map or Join is around 12 second.* My analysis-- This problem might be caused by large shuffle write data. The large shuffle write caused high I/O on disk. If the shuffle write cannot be done by some timeout period, then the master will think this executor is disassociated. But I don't know how to solve this problem. --- Any help will be appreciated!!! Thanks, Jia
Cannot change the memory of workers
Hi guys, Currently I am running Spark program on Amazon EC2. Each worker has around (less than but near to )2 gb memory. By default, I can see each worker is allocated 976 mb memory as the table shows below on Spark WEB UI. I know this value is from (Total memory minus 1 GB). But I want more than 1 gb in each of my worker. AddressStateCoresMemory ALIVE1 (0 Used)976.0 MB (0.0 B Used)Based on the instruction on Spark website, I made export SPARK_WORKER_MEMORY=1g in spark-env.sh. But it doesn't work. BTW, I can set SPARK_EXECUTOR_MEMORY=1g and it works. Can anyone help me? Is there a requirement that one worker must maintain 1 gb memory for itself aside from the memory for Spark? Thanks, Jia