Re: Spark 1.5.1 Build Failure

2015-10-30 Thread Jia Zhan
Hi,

Have tried tried building it successfully without hadoop?

$build/mnv -DskiptTests clean package

Can you check it build/mvn was started successfully, or it's using your own
mvn? Let us know your jdk version as well.

On Thu, Oct 29, 2015 at 11:34 PM, Raghuveer Chanda <
raghuveer.cha...@gmail.com> wrote:

> Hi,
>
> I am trying to build spark 1.5.1 for hadoop 2.5 but I get the following
> error.
>
>
> *build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.3.2 -DskipTests
> clean package*
>
>
> [INFO] Spark Project Parent POM ... SUCCESS [
>  9.812 s]
> [INFO] Spark Project Launcher . SUCCESS [
> 27.701 s]
> [INFO] Spark Project Networking ... SUCCESS [
> 16.721 s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
>  8.617 s]
> [INFO] Spark Project Unsafe ... SUCCESS [
> 27.124 s]
> [INFO] Spark Project Core . FAILURE [09:08
> min]
>
> Failed to execute goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:testCompile
> (scala-test-compile-first) on project spark-core_2.10: Execution
> scala-test-compile-first of goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:testCompile failed.
> CompileFailed -> [Help 1]
>
>
>
> --
> Regards,
> Raghuveer Chanda
>
>


-- 
Jia Zhan


Re: In-memory computing and cache() in Spark

2015-10-19 Thread Jia Zhan
Hi Igor,

It iterative conducts reduce((a,b)*=>*a+b) which is the action there. I can
see clearly 4 stages (one saveAsTextFile() and three Reduce()) in the web
UI. Don't know what's going there that causes the non-intuitive caching
behavior.

Thanks for help!

On Sun, Oct 18, 2015 at 11:32 PM, Igor Berman <igor.ber...@gmail.com> wrote:

> Does ur iterations really submit job? I dont see any action there
> On Oct 17, 2015 00:03, "Jia Zhan" <zhanjia...@gmail.com> wrote:
>
>> Hi all,
>>
>> I am running Spark locally in one node and trying to sweep the memory
>> size for performance tuning. The machine has 8 CPUs and 16G main memory,
>> the dataset in my local disk is about 10GB. I have several quick questions
>> and appreciate any comments.
>>
>> 1. Spark performs in-memory computing, but without using RDD.cache(),
>> will anything be cached in memory at all? My guess is that, without
>> RDD.cache(), only a small amount of data will be stored in OS buffer cache,
>> and every iteration of computation will still need to fetch most data from
>> disk every time, is that right?
>>
>> 2. To evaluate how caching helps with iterative computation, I wrote a
>> simple program as shown below, which basically consists of one saveAsText()
>> and three reduce() actions/stages. I specify "spark.driver.memory" to
>> "15g", others by default. Then I run three experiments.
>>
>> *   val* *conf* = *new* *SparkConf*().setAppName(*"wordCount"*)
>>
>>*val* *sc* = *new* *SparkContext*(conf)
>>
>>*val* *input* = sc.textFile(*"/InputFiles"*)
>>
>>   *val* *words* = input.flatMap(line *=>* line.split(*" "*)).map(word
>> *=>* (word, *1*)).reduceByKey(_+_).saveAsTextFile(*"/OutputFiles"*)
>>
>>   *val* *ITERATIONS* = *3*
>>
>>   *for* (i *<-* *1* to *ITERATIONS*) {
>>
>>   *val* *totallength* = input.filter(line*=>*line.contains(
>> *"the"*)).map(s*=>*s.length).reduce((a,b)*=>*a+b)
>>
>>   }
>>
>> (I) The first run: no caching at all. The application finishes in ~12
>> minutes (2.6min+3.3min+3.2min+3.3min)
>>
>> (II) The second run, I modified the code so that the input will be
>> cached:
>>  *val input = sc.textFile("/InputFiles").cache()*
>>  The application finishes in ~11 mins!! (5.4min+1.9min+1.9min+2.0min)!
>>  The storage page in Web UI shows 48% of the dataset  is cached,
>> which makes sense due to large java object overhead, and
>> spark.storage.memoryFraction is 0.6 by default.
>>
>> (III) However, the third run, same program as the second one, but I
>> changed "spark.driver.memory" to be "2g".
>>The application finishes in just 3.6 minutes (3.0min + 9s + 9s + 9s)!!
>> And UI shows 6% of the data is cached.
>>
>> *From the results we can see the reduce stages finish in seconds, how
>> could that happen with only 6% cached? Can anyone explain?*
>>
>> I am new to Spark and would appreciate any help on this. Thanks!
>>
>> Jia
>>
>>
>>
>>


-- 
Jia Zhan


Re: In-memory computing and cache() in Spark

2015-10-19 Thread Jia Zhan
Hi Sonal,

I tried changing the size spark.executor.memory but noting changes. It
seems when I run locally in one machine, the RDD is cached in driver memory
instead of executor memory. Here is a related post online:
http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-in-Local-Mode-td22279.html

When I change spark.driver.memory, I can see the change of cached data in
 web UI. Like I mentioned, when I set driver memory to 2G, it says 6% RDD
cached. When set to 15G, it says 48% RDD cached, but with much slower
speed!

On Sun, Oct 18, 2015 at 10:32 PM, Sonal Goyal <sonalgoy...@gmail.com> wrote:

> Hi Jia,
>
> RDDs are cached on the executor, not on the driver. I am assuming you are
> running locally and haven't changed spark.executor.memory?
>
> Sonal
> On Oct 19, 2015 1:58 AM, "Jia Zhan" <zhanjia...@gmail.com> wrote:
>
> Anyone has any clue what's going on.? Why would caching with 2g memory
> much faster than with 15g memory?
>
> Thanks very much!
>
> On Fri, Oct 16, 2015 at 2:02 PM, Jia Zhan <zhanjia...@gmail.com> wrote:
>
>> Hi all,
>>
>> I am running Spark locally in one node and trying to sweep the memory
>> size for performance tuning. The machine has 8 CPUs and 16G main memory,
>> the dataset in my local disk is about 10GB. I have several quick questions
>> and appreciate any comments.
>>
>> 1. Spark performs in-memory computing, but without using RDD.cache(),
>> will anything be cached in memory at all? My guess is that, without
>> RDD.cache(), only a small amount of data will be stored in OS buffer cache,
>> and every iteration of computation will still need to fetch most data from
>> disk every time, is that right?
>>
>> 2. To evaluate how caching helps with iterative computation, I wrote a
>> simple program as shown below, which basically consists of one saveAsText()
>> and three reduce() actions/stages. I specify "spark.driver.memory" to
>> "15g", others by default. Then I run three experiments.
>>
>> *   val* *conf* = *new* *SparkConf*().setAppName(*"wordCount"*)
>>
>>*val* *sc* = *new* *SparkContext*(conf)
>>
>>*val* *input* = sc.textFile(*"/InputFiles"*)
>>
>>   *val* *words* = input.flatMap(line *=>* line.split(*" "*)).map(word
>> *=>* (word, *1*)).reduceByKey(_+_).saveAsTextFile(*"/OutputFiles"*)
>>
>>   *val* *ITERATIONS* = *3*
>>
>>   *for* (i *<-* *1* to *ITERATIONS*) {
>>
>>   *val* *totallength* = input.filter(line*=>*line.contains(
>> *"the"*)).map(s*=>*s.length).reduce((a,b)*=>*a+b)
>>
>>   }
>>
>> (I) The first run: no caching at all. The application finishes in ~12
>> minutes (2.6min+3.3min+3.2min+3.3min)
>>
>> (II) The second run, I modified the code so that the input will be
>> cached:
>>  *val input = sc.textFile("/InputFiles").cache()*
>>  The application finishes in ~11 mins!! (5.4min+1.9min+1.9min+2.0min)!
>>  The storage page in Web UI shows 48% of the dataset  is cached,
>> which makes sense due to large java object overhead, and
>> spark.storage.memoryFraction is 0.6 by default.
>>
>> (III) However, the third run, same program as the second one, but I
>> changed "spark.driver.memory" to be "2g".
>>The application finishes in just 3.6 minutes (3.0min + 9s + 9s + 9s)!!
>> And UI shows 6% of the data is cached.
>>
>> *From the results we can see the reduce stages finish in seconds, how
>> could that happen with only 6% cached? Can anyone explain?*
>>
>> I am new to Spark and would appreciate any help on this. Thanks!
>>
>> Jia
>>
>>
>>
>>
>
>
> --
> Jia Zhan
>
>


-- 
Jia Zhan


Re: In-memory computing and cache() in Spark

2015-10-18 Thread Jia Zhan
Anyone has any clue what's going on.? Why would caching with 2g memory much
faster than with 15g memory?

Thanks very much!

On Fri, Oct 16, 2015 at 2:02 PM, Jia Zhan <zhanjia...@gmail.com> wrote:

> Hi all,
>
> I am running Spark locally in one node and trying to sweep the memory size
> for performance tuning. The machine has 8 CPUs and 16G main memory, the
> dataset in my local disk is about 10GB. I have several quick questions and
> appreciate any comments.
>
> 1. Spark performs in-memory computing, but without using RDD.cache(), will
> anything be cached in memory at all? My guess is that, without RDD.cache(),
> only a small amount of data will be stored in OS buffer cache, and every
> iteration of computation will still need to fetch most data from disk every
> time, is that right?
>
> 2. To evaluate how caching helps with iterative computation, I wrote a
> simple program as shown below, which basically consists of one saveAsText()
> and three reduce() actions/stages. I specify "spark.driver.memory" to
> "15g", others by default. Then I run three experiments.
>
> *   val* *conf* = *new* *SparkConf*().setAppName(*"wordCount"*)
>
>*val* *sc* = *new* *SparkContext*(conf)
>
>*val* *input* = sc.textFile(*"/InputFiles"*)
>
>   *val* *words* = input.flatMap(line *=>* line.split(*" "*)).map(word
> *=>* (word, *1*)).reduceByKey(_+_).saveAsTextFile(*"/OutputFiles"*)
>
>   *val* *ITERATIONS* = *3*
>
>   *for* (i *<-* *1* to *ITERATIONS*) {
>
>   *val* *totallength* = input.filter(line*=>*line.contains(*"the"*
> )).map(s*=>*s.length).reduce((a,b)*=>*a+b)
>
>   }
>
> (I) The first run: no caching at all. The application finishes in ~12
> minutes (2.6min+3.3min+3.2min+3.3min)
>
> (II) The second run, I modified the code so that the input will be cached:
>  *val input = sc.textFile("/InputFiles").cache()*
>  The application finishes in ~11 mins!! (5.4min+1.9min+1.9min+2.0min)!
>  The storage page in Web UI shows 48% of the dataset  is cached, which
> makes sense due to large java object overhead, and
> spark.storage.memoryFraction is 0.6 by default.
>
> (III) However, the third run, same program as the second one, but I
> changed "spark.driver.memory" to be "2g".
>The application finishes in just 3.6 minutes (3.0min + 9s + 9s + 9s)!!
> And UI shows 6% of the data is cached.
>
> *From the results we can see the reduce stages finish in seconds, how
> could that happen with only 6% cached? Can anyone explain?*
>
> I am new to Spark and would appreciate any help on this. Thanks!
>
> Jia
>
>
>
>


-- 
Jia Zhan


In-memory computing and cache() in Spark

2015-10-16 Thread Jia Zhan
Hi all,

I am running Spark locally in one node and trying to sweep the memory size
for performance tuning. The machine has 8 CPUs and 16G main memory, the
dataset in my local disk is about 10GB. I have several quick questions and
appreciate any comments.

1. Spark performs in-memory computing, but without using RDD.cache(), will
anything be cached in memory at all? My guess is that, without RDD.cache(),
only a small amount of data will be stored in OS buffer cache, and every
iteration of computation will still need to fetch most data from disk every
time, is that right?

2. To evaluate how caching helps with iterative computation, I wrote a
simple program as shown below, which basically consists of one saveAsText()
and three reduce() actions/stages. I specify "spark.driver.memory" to
"15g", others by default. Then I run three experiments.

*   val* *conf* = *new* *SparkConf*().setAppName(*"wordCount"*)

   *val* *sc* = *new* *SparkContext*(conf)

   *val* *input* = sc.textFile(*"/InputFiles"*)

  *val* *words* = input.flatMap(line *=>* line.split(*" "*)).map(word
*=>* (word, *1*)).reduceByKey(_+_).saveAsTextFile(*"/OutputFiles"*)

  *val* *ITERATIONS* = *3*

  *for* (i *<-* *1* to *ITERATIONS*) {

  *val* *totallength* = input.filter(line*=>*line.contains(*"the"*
)).map(s*=>*s.length).reduce((a,b)*=>*a+b)

  }

(I) The first run: no caching at all. The application finishes in ~12
minutes (2.6min+3.3min+3.2min+3.3min)

(II) The second run, I modified the code so that the input will be cached:
 *val input = sc.textFile("/InputFiles").cache()*
 The application finishes in ~11 mins!! (5.4min+1.9min+1.9min+2.0min)!
 The storage page in Web UI shows 48% of the dataset  is cached, which
makes sense due to large java object overhead, and
spark.storage.memoryFraction is 0.6 by default.

(III) However, the third run, same program as the second one, but I changed
"spark.driver.memory" to be "2g".
   The application finishes in just 3.6 minutes (3.0min + 9s + 9s + 9s)!!
And UI shows 6% of the data is cached.

*From the results we can see the reduce stages finish in seconds, how could
that happen with only 6% cached? Can anyone explain?*

I am new to Spark and would appreciate any help on this. Thanks!

Jia


Can we gracefully kill stragglers in Spark SQL

2015-09-04 Thread Jia Zhan
Hello all,

I am new to Spark and have been working on a small project trying to tackle
the straggler problems. I ran some SQL queries (GROUPBY) on a small cluster
and observed that some tasks take several minutes while others finish in
seconds.

I know that Spark already has speculation mode but I still see this problem
with speculative mode turned on. Therefore, I modified the code to kill
those stragglers instead of re-executing them, trading accuracy for speed.
As expected, killing stragglers will cause system hang due to the lost
tasks. Can anyone give some guidance on getting this to work? Is it
possible to early terminate some tasks without affecting the overall
execution of the job, with some cost of accuracy?

Appreciate your help!

-- 
Jia Zhan