Re: Spark 1.5.1 Build Failure
Hi, Have tried tried building it successfully without hadoop? $build/mnv -DskiptTests clean package Can you check it build/mvn was started successfully, or it's using your own mvn? Let us know your jdk version as well. On Thu, Oct 29, 2015 at 11:34 PM, Raghuveer Chanda < raghuveer.cha...@gmail.com> wrote: > Hi, > > I am trying to build spark 1.5.1 for hadoop 2.5 but I get the following > error. > > > *build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.3.2 -DskipTests > clean package* > > > [INFO] Spark Project Parent POM ... SUCCESS [ > 9.812 s] > [INFO] Spark Project Launcher . SUCCESS [ > 27.701 s] > [INFO] Spark Project Networking ... SUCCESS [ > 16.721 s] > [INFO] Spark Project Shuffle Streaming Service SUCCESS [ > 8.617 s] > [INFO] Spark Project Unsafe ... SUCCESS [ > 27.124 s] > [INFO] Spark Project Core . FAILURE [09:08 > min] > > Failed to execute goal > net.alchim31.maven:scala-maven-plugin:3.2.2:testCompile > (scala-test-compile-first) on project spark-core_2.10: Execution > scala-test-compile-first of goal > net.alchim31.maven:scala-maven-plugin:3.2.2:testCompile failed. > CompileFailed -> [Help 1] > > > > -- > Regards, > Raghuveer Chanda > > -- Jia Zhan
Re: In-memory computing and cache() in Spark
Hi Igor, It iterative conducts reduce((a,b)*=>*a+b) which is the action there. I can see clearly 4 stages (one saveAsTextFile() and three Reduce()) in the web UI. Don't know what's going there that causes the non-intuitive caching behavior. Thanks for help! On Sun, Oct 18, 2015 at 11:32 PM, Igor Berman <igor.ber...@gmail.com> wrote: > Does ur iterations really submit job? I dont see any action there > On Oct 17, 2015 00:03, "Jia Zhan" <zhanjia...@gmail.com> wrote: > >> Hi all, >> >> I am running Spark locally in one node and trying to sweep the memory >> size for performance tuning. The machine has 8 CPUs and 16G main memory, >> the dataset in my local disk is about 10GB. I have several quick questions >> and appreciate any comments. >> >> 1. Spark performs in-memory computing, but without using RDD.cache(), >> will anything be cached in memory at all? My guess is that, without >> RDD.cache(), only a small amount of data will be stored in OS buffer cache, >> and every iteration of computation will still need to fetch most data from >> disk every time, is that right? >> >> 2. To evaluate how caching helps with iterative computation, I wrote a >> simple program as shown below, which basically consists of one saveAsText() >> and three reduce() actions/stages. I specify "spark.driver.memory" to >> "15g", others by default. Then I run three experiments. >> >> * val* *conf* = *new* *SparkConf*().setAppName(*"wordCount"*) >> >>*val* *sc* = *new* *SparkContext*(conf) >> >>*val* *input* = sc.textFile(*"/InputFiles"*) >> >> *val* *words* = input.flatMap(line *=>* line.split(*" "*)).map(word >> *=>* (word, *1*)).reduceByKey(_+_).saveAsTextFile(*"/OutputFiles"*) >> >> *val* *ITERATIONS* = *3* >> >> *for* (i *<-* *1* to *ITERATIONS*) { >> >> *val* *totallength* = input.filter(line*=>*line.contains( >> *"the"*)).map(s*=>*s.length).reduce((a,b)*=>*a+b) >> >> } >> >> (I) The first run: no caching at all. The application finishes in ~12 >> minutes (2.6min+3.3min+3.2min+3.3min) >> >> (II) The second run, I modified the code so that the input will be >> cached: >> *val input = sc.textFile("/InputFiles").cache()* >> The application finishes in ~11 mins!! (5.4min+1.9min+1.9min+2.0min)! >> The storage page in Web UI shows 48% of the dataset is cached, >> which makes sense due to large java object overhead, and >> spark.storage.memoryFraction is 0.6 by default. >> >> (III) However, the third run, same program as the second one, but I >> changed "spark.driver.memory" to be "2g". >>The application finishes in just 3.6 minutes (3.0min + 9s + 9s + 9s)!! >> And UI shows 6% of the data is cached. >> >> *From the results we can see the reduce stages finish in seconds, how >> could that happen with only 6% cached? Can anyone explain?* >> >> I am new to Spark and would appreciate any help on this. Thanks! >> >> Jia >> >> >> >> -- Jia Zhan
Re: In-memory computing and cache() in Spark
Hi Sonal, I tried changing the size spark.executor.memory but noting changes. It seems when I run locally in one machine, the RDD is cached in driver memory instead of executor memory. Here is a related post online: http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-in-Local-Mode-td22279.html When I change spark.driver.memory, I can see the change of cached data in web UI. Like I mentioned, when I set driver memory to 2G, it says 6% RDD cached. When set to 15G, it says 48% RDD cached, but with much slower speed! On Sun, Oct 18, 2015 at 10:32 PM, Sonal Goyal <sonalgoy...@gmail.com> wrote: > Hi Jia, > > RDDs are cached on the executor, not on the driver. I am assuming you are > running locally and haven't changed spark.executor.memory? > > Sonal > On Oct 19, 2015 1:58 AM, "Jia Zhan" <zhanjia...@gmail.com> wrote: > > Anyone has any clue what's going on.? Why would caching with 2g memory > much faster than with 15g memory? > > Thanks very much! > > On Fri, Oct 16, 2015 at 2:02 PM, Jia Zhan <zhanjia...@gmail.com> wrote: > >> Hi all, >> >> I am running Spark locally in one node and trying to sweep the memory >> size for performance tuning. The machine has 8 CPUs and 16G main memory, >> the dataset in my local disk is about 10GB. I have several quick questions >> and appreciate any comments. >> >> 1. Spark performs in-memory computing, but without using RDD.cache(), >> will anything be cached in memory at all? My guess is that, without >> RDD.cache(), only a small amount of data will be stored in OS buffer cache, >> and every iteration of computation will still need to fetch most data from >> disk every time, is that right? >> >> 2. To evaluate how caching helps with iterative computation, I wrote a >> simple program as shown below, which basically consists of one saveAsText() >> and three reduce() actions/stages. I specify "spark.driver.memory" to >> "15g", others by default. Then I run three experiments. >> >> * val* *conf* = *new* *SparkConf*().setAppName(*"wordCount"*) >> >>*val* *sc* = *new* *SparkContext*(conf) >> >>*val* *input* = sc.textFile(*"/InputFiles"*) >> >> *val* *words* = input.flatMap(line *=>* line.split(*" "*)).map(word >> *=>* (word, *1*)).reduceByKey(_+_).saveAsTextFile(*"/OutputFiles"*) >> >> *val* *ITERATIONS* = *3* >> >> *for* (i *<-* *1* to *ITERATIONS*) { >> >> *val* *totallength* = input.filter(line*=>*line.contains( >> *"the"*)).map(s*=>*s.length).reduce((a,b)*=>*a+b) >> >> } >> >> (I) The first run: no caching at all. The application finishes in ~12 >> minutes (2.6min+3.3min+3.2min+3.3min) >> >> (II) The second run, I modified the code so that the input will be >> cached: >> *val input = sc.textFile("/InputFiles").cache()* >> The application finishes in ~11 mins!! (5.4min+1.9min+1.9min+2.0min)! >> The storage page in Web UI shows 48% of the dataset is cached, >> which makes sense due to large java object overhead, and >> spark.storage.memoryFraction is 0.6 by default. >> >> (III) However, the third run, same program as the second one, but I >> changed "spark.driver.memory" to be "2g". >>The application finishes in just 3.6 minutes (3.0min + 9s + 9s + 9s)!! >> And UI shows 6% of the data is cached. >> >> *From the results we can see the reduce stages finish in seconds, how >> could that happen with only 6% cached? Can anyone explain?* >> >> I am new to Spark and would appreciate any help on this. Thanks! >> >> Jia >> >> >> >> > > > -- > Jia Zhan > > -- Jia Zhan
Re: In-memory computing and cache() in Spark
Anyone has any clue what's going on.? Why would caching with 2g memory much faster than with 15g memory? Thanks very much! On Fri, Oct 16, 2015 at 2:02 PM, Jia Zhan <zhanjia...@gmail.com> wrote: > Hi all, > > I am running Spark locally in one node and trying to sweep the memory size > for performance tuning. The machine has 8 CPUs and 16G main memory, the > dataset in my local disk is about 10GB. I have several quick questions and > appreciate any comments. > > 1. Spark performs in-memory computing, but without using RDD.cache(), will > anything be cached in memory at all? My guess is that, without RDD.cache(), > only a small amount of data will be stored in OS buffer cache, and every > iteration of computation will still need to fetch most data from disk every > time, is that right? > > 2. To evaluate how caching helps with iterative computation, I wrote a > simple program as shown below, which basically consists of one saveAsText() > and three reduce() actions/stages. I specify "spark.driver.memory" to > "15g", others by default. Then I run three experiments. > > * val* *conf* = *new* *SparkConf*().setAppName(*"wordCount"*) > >*val* *sc* = *new* *SparkContext*(conf) > >*val* *input* = sc.textFile(*"/InputFiles"*) > > *val* *words* = input.flatMap(line *=>* line.split(*" "*)).map(word > *=>* (word, *1*)).reduceByKey(_+_).saveAsTextFile(*"/OutputFiles"*) > > *val* *ITERATIONS* = *3* > > *for* (i *<-* *1* to *ITERATIONS*) { > > *val* *totallength* = input.filter(line*=>*line.contains(*"the"* > )).map(s*=>*s.length).reduce((a,b)*=>*a+b) > > } > > (I) The first run: no caching at all. The application finishes in ~12 > minutes (2.6min+3.3min+3.2min+3.3min) > > (II) The second run, I modified the code so that the input will be cached: > *val input = sc.textFile("/InputFiles").cache()* > The application finishes in ~11 mins!! (5.4min+1.9min+1.9min+2.0min)! > The storage page in Web UI shows 48% of the dataset is cached, which > makes sense due to large java object overhead, and > spark.storage.memoryFraction is 0.6 by default. > > (III) However, the third run, same program as the second one, but I > changed "spark.driver.memory" to be "2g". >The application finishes in just 3.6 minutes (3.0min + 9s + 9s + 9s)!! > And UI shows 6% of the data is cached. > > *From the results we can see the reduce stages finish in seconds, how > could that happen with only 6% cached? Can anyone explain?* > > I am new to Spark and would appreciate any help on this. Thanks! > > Jia > > > > -- Jia Zhan
In-memory computing and cache() in Spark
Hi all, I am running Spark locally in one node and trying to sweep the memory size for performance tuning. The machine has 8 CPUs and 16G main memory, the dataset in my local disk is about 10GB. I have several quick questions and appreciate any comments. 1. Spark performs in-memory computing, but without using RDD.cache(), will anything be cached in memory at all? My guess is that, without RDD.cache(), only a small amount of data will be stored in OS buffer cache, and every iteration of computation will still need to fetch most data from disk every time, is that right? 2. To evaluate how caching helps with iterative computation, I wrote a simple program as shown below, which basically consists of one saveAsText() and three reduce() actions/stages. I specify "spark.driver.memory" to "15g", others by default. Then I run three experiments. * val* *conf* = *new* *SparkConf*().setAppName(*"wordCount"*) *val* *sc* = *new* *SparkContext*(conf) *val* *input* = sc.textFile(*"/InputFiles"*) *val* *words* = input.flatMap(line *=>* line.split(*" "*)).map(word *=>* (word, *1*)).reduceByKey(_+_).saveAsTextFile(*"/OutputFiles"*) *val* *ITERATIONS* = *3* *for* (i *<-* *1* to *ITERATIONS*) { *val* *totallength* = input.filter(line*=>*line.contains(*"the"* )).map(s*=>*s.length).reduce((a,b)*=>*a+b) } (I) The first run: no caching at all. The application finishes in ~12 minutes (2.6min+3.3min+3.2min+3.3min) (II) The second run, I modified the code so that the input will be cached: *val input = sc.textFile("/InputFiles").cache()* The application finishes in ~11 mins!! (5.4min+1.9min+1.9min+2.0min)! The storage page in Web UI shows 48% of the dataset is cached, which makes sense due to large java object overhead, and spark.storage.memoryFraction is 0.6 by default. (III) However, the third run, same program as the second one, but I changed "spark.driver.memory" to be "2g". The application finishes in just 3.6 minutes (3.0min + 9s + 9s + 9s)!! And UI shows 6% of the data is cached. *From the results we can see the reduce stages finish in seconds, how could that happen with only 6% cached? Can anyone explain?* I am new to Spark and would appreciate any help on this. Thanks! Jia
Can we gracefully kill stragglers in Spark SQL
Hello all, I am new to Spark and have been working on a small project trying to tackle the straggler problems. I ran some SQL queries (GROUPBY) on a small cluster and observed that some tasks take several minutes while others finish in seconds. I know that Spark already has speculation mode but I still see this problem with speculative mode turned on. Therefore, I modified the code to kill those stragglers instead of re-executing them, trading accuracy for speed. As expected, killing stragglers will cause system hang due to the lost tasks. Can anyone give some guidance on getting this to work? Is it possible to early terminate some tasks without affecting the overall execution of the job, with some cost of accuracy? Appreciate your help! -- Jia Zhan