Re: how to use spark.mesos.constraints

2016-07-26 Thread Jia Yu
Hi,

I am also trying to use the spark.mesos.constraints but it gives me the
same error: job has not be accepted by any resources.

I am doubting that I should start some additional service like
./sbin/start-mesos-shuffle-service.sh. Am I correct?

Thanks,
Jia

On Tue, Dec 1, 2015 at 5:14 PM, rarediel 
wrote:

> I am trying to add mesos constraints to my spark-submit command in my
> marathon file I am setting it to spark.mesos.coarse=true.
>
> Here is an example of a constraint I am trying to set.
>
>  --conf spark.mesos.constraint=cpus:2
>
> I want to use the constraints to control the amount of executors are
> created
> so I can control the total memory of my spark job.
>
> I've tried many variations of resource constraints, but no matter which
> resource or what number, range, etc. I do I always get the error "Initial
> job has not accepted any resources; check your cluster UI...".  My cluster
> has the available resources.  Is there any examples I can look at where
> people use resource constraints?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-use-spark-mesos-constraints-tp25541.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?

2015-06-16 Thread Jia Yu
Hi Peng,

I got exactly same error! My shuffle data is also very large. Have you
figured out a method to solve that?

Thanks,
Jia

On Fri, Apr 24, 2015 at 7:59 AM, Peng Cheng pc...@uow.edu.au wrote:

 I'm deploying a Spark data processing job on an EC2 cluster, the job is
 small
 for the cluster (16 cores with 120G RAM in total), the largest RDD has only
 76k+ rows. But heavily skewed in the middle (thus requires repartitioning)
 and each row has around 100k of data after serialization. The job always
 got
 stuck in repartitioning. Namely, the job will constantly get following
 errors and retries:

 org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
 location for shuffle

 org.apache.spark.shuffle.FetchFailedException: Error in opening
 FileSegmentManagedBuffer

 org.apache.spark.shuffle.FetchFailedException:
 java.io.FileNotFoundException: /tmp/spark-...
 I've tried to identify the problem but it seems like both memory and disk
 consumption of the machine throwing these errors are below 50%. I've also
 tried different configurations, including:

 let driver/executor memory use 60% of total memory.
 let netty to priortize JVM shuffling buffer.
 increase shuffling streaming buffer to 128m.
 use KryoSerializer and max out all buffers
 increase shuffling memoryFraction to 0.4
 But none of them works. The small job always trigger the same series of
 errors and max out retries (upt to 1000 times). How to troubleshoot this
 thing in such situation?

 Thanks a lot if you have any clue.




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/What-are-the-likely-causes-of-org-apache-spark-shuffle-MetadataFetchFailedException-Missing-an-outpu-tp22646.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Help!!!Map or join one large datasets then suddenly remote Akka client disassociated

2015-06-15 Thread Jia Yu
Hi folks,

Help me! I met a very weird problem. I really need some help!! Here is my
situation:

Case: Assign keys to two datasets (one is 96GB with 2.7 billion records and
one 1.5GB with 30k records) via MapPartitions first, and join them together
with their keys.

Environment:

Standalone Spark on Amazon EC2
Master*1 13GB 8 cores
Worker*16  each one 13GB 8 cores


(After met this problem, I switched to
Worker*16  each one 59GB 8 cores)


Read and write on HDFS (same cluster)
--
Problem:

At the beginning:---

The MapPartitions looks no problem. But when Spark does the Join for two
datasets, the console says

*ERROR TaskSchedulerImpl: Lost executor 4 on
ip-172-31-27-174.us-west-2.compute.internal: remote Akka client
disassociated*

Then I go back to this worker and check its log

There is something like Master said remote Akka client disassociated and
asked to kill executor *** and then the worker killed this executor

(Sorry I deleted that log and just remember the content.)

There is no other errors before the Akka client disassociated (for both of
master and worker).

Then ---

I tried one 62GB dataset with the 1.5 GB dataset. My job worked
smoothly. *HOWEVER,
I found one thing: If I set the spark.shuffle.memoryFraction to Zero, same
error will happen on this 62GB dataset.*

Then ---

I switched my workers to Worker*16  each one 59GB 8 cores. Error even
happened when Spark does the MapPartitions

Some metrics I
found

*When I do the MapPartitions or Join with 96GB data, its shuffle write is
around 100GB. And I cached 96GB data and its size is around 530GB.*

*Garbage collection time for 96GB dataset when Spark does the Map or Join
is around 12 second.*

My analysis--

This problem might be caused by large shuffle write data. The large shuffle
write caused high I/O on disk. If the shuffle write cannot be done by some
timeout period, then the master will think this executor is disassociated.

But I don't know how to solve this problem.

---


Any help will be appreciated!!!

Thanks,
Jia


Cannot change the memory of workers

2015-04-07 Thread Jia Yu
Hi guys,

Currently I am running Spark program on Amazon EC2. Each worker has around
(less than but near to )2 gb memory.

By default, I can see each worker is allocated 976 mb memory as the table
shows below on Spark WEB UI. I know this value is from (Total memory minus
1 GB). But I want more than 1 gb in each of my worker.

AddressStateCoresMemory

ALIVE1 (0 Used)976.0 MB (0.0 B Used)Based on the instruction on Spark
website, I made export SPARK_WORKER_MEMORY=1g in spark-env.sh. But it
doesn't work. BTW, I can set SPARK_EXECUTOR_MEMORY=1g and it works.

Can anyone help me? Is there a requirement that one worker must maintain 1
gb memory for itself aside from the memory for Spark?

Thanks,
Jia