Question about executor memory setting

2016-09-27 Thread Dogtail L
Hi all,

May I ask a question about executor memory setting? I was running PageRank
with input size 2.8GB on one workstation for testing. I gave PageRank one
executor.

In case 1, I set --executor-cores to 4, and --executor-memory to 1GB, the
stage (stage 2) completion time is 14 min, the the detailed stage info is
below:


In case 2, I set --executor-cores to 4, and --executor-memory to 6GB, the
stage (stage 2) completion time is 34 min, the the detailed stage info is
below:

​
I am totally confused why when executor-memory gets larger, the stage
completion time is more than two times slower? From the web UI, I found
that when executor memory is 6GB, the shuffle spill (Disk) per task is
smaller, which means fewer IO operations, but weirdly, the task completion
time is longer though. Could anyone give me some hints? Great thanks!


How to decide the number of tasks in Spark?

2016-04-18 Thread Dogtail L
Hi,

When launching a job in Spark, I have great trouble deciding the number of
tasks. Someone says it is better to create a task per HDFS block size,
i.e., make sure one task process 128MB of input data; others suggest that
the number of tasks should be the twice of the total cores available to the
job. Also, I found that someone suggests launching small tasks using Spark,
i.e., make sure each task lasts around 100ms.

I am quite confused about all these suggestions. Is there any general rule
for deciding the number of tasks in Spark? Great thanks!

Best


Question about Spark shuffle read size

2015-11-04 Thread Dogtail L
Hi all,

When I run WordCount using Spark, I find that when I set
"spark.default.parallelism" to different numbers, the Shuffle Write size
and Shuffle Read size will change as well (I read these data from history
server's web UI). Is it because the shuffle write size also include some
metadata size?

Also, my input file for WordCount is approximately 3kB (stored in local
filesystem), and I partitioned it to 10 pieces using textFile function.
However, the web UI shows that WordCount's input data size is 19.5 kB, much
larger than input data size. Why would that happen? Great thanks!


Re: How to compile Spark with customized Hadoop?

2015-10-14 Thread Dogtail L
Hi,

When I publish my version of Hadoop, it is installed in:
/HOME_DIRECTORY/.m2/repository/org/apache/hadoop, but when I compile Spark,
it will fetch Hadoop libraries from
https://repo1.maven.org/maven2/org/apache/hadoop. How can I let Spark fetch
Hadoop libraries from my local M2 cache? Great thanks!

On Fri, Oct 9, 2015 at 5:31 PM, Matei Zaharia <matei.zaha...@gmail.com>
wrote:

> You can publish your version of Hadoop to your Maven cache with mvn
> publish (just give it a different version number, e.g. 2.7.0a) and then
> pass that as the Hadoop version to Spark's build (see
> http://spark.apache.org/docs/latest/building-spark.html).
>
> Matei
>
> On Oct 9, 2015, at 3:10 PM, Dogtail L <spark.ru...@gmail.com> wrote:
>
> Hi all,
>
> I have modified Hadoop source code, and I want to compile Spark with my
> modified Hadoop. Do you know how to do that? Great thanks!
>
>
>


How to compile Spark with customized Hadoop?

2015-10-09 Thread Dogtail L
Hi all,

I have modified Hadoop source code, and I want to compile Spark with my
modified Hadoop. Do you know how to do that? Great thanks!