Re: Spark on Raspberry Pi?

2014-09-11 Thread Chanwit Kaewkasi
We've found that Raspberry Pi is not enough for Hadoop/Spark mainly
because the memory consumption. What we've built is a cluster form
with 22 Cubieboards, each contains 1 GB RAM.

Best regards,

-chanwit

--
Chanwit Kaewkasi
linkedin.com/in/chanwit


On Thu, Sep 11, 2014 at 8:04 PM, Sandeep Singh sand...@techaddict.me wrote:
 Has anyone tried using Raspberry Pi for Spark? How efficient is it to use
 around 10 Pi's for local testing env ?



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Raspberry-Pi-tp13965.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Akka disassociation on Java SE Embedded

2014-06-01 Thread Chanwit Kaewkasi
Hi all,

This is what I found:

1. Like Aaron suggested, an executor will be killed silently when the
OS's memory is running out.
I've found this many times to conclude this it's real. Adding swap and
increasing the JVM heap solved the problem, but you will encounter OS
paging out and full GC.

2. OS paging out and full GC are not likely to affect my benchmark
much while processing data from HDFS. But Akka process's randomly
killed during the network-related stage (for example, sorting). I've
found that an Akka process cannot fetch the result fast enough.
Increasing the block manager timeout helped a lot. I've doubled the
value many times as the network of our ARM cluster is quite slow.

3. We'd like to collect times spent for all stages of our benchmark.
So we always re-run when some tasks failed. Failure happened a lot but
it's understandable as Spark is designed on top of Akka's let-it-crash
philosophy. To make the benchmark run more perfectly (without a task
failure), I called .cache() before calling the transformation of the
next stage. And it helped a lot.

Combined above and others tuning, we can now boost the performance of
our ARM cluster to 2.8 times faster than our first report.

Best regards,

-chanwit

--
Chanwit Kaewkasi
linkedin.com/in/chanwit


On Wed, May 28, 2014 at 1:13 AM, Chanwit Kaewkasi chan...@gmail.com wrote:
 May be that's explaining mine too.
 Thank you very much, Aaron !!

 Best regards,

 -chanwit

 --
 Chanwit Kaewkasi
 linkedin.com/in/chanwit


 On Wed, May 28, 2014 at 12:47 AM, Aaron Davidson ilike...@gmail.com wrote:
 Spark should effectively turn Akka's failure detector off, because we
 historically had problems with GCs and other issues causing disassociations.
 The only thing that should cause these messages nowadays is if the TCP
 connection (which Akka sustains between Actor Systems on different machines)
 actually drops. TCP connections are pretty resilient, so one common cause of
 this is actual Executor failure -- recently, I have experienced a
 similar-sounding problem due to my machine's OOM killer terminating my
 Executors, such that they didn't produce any error output.


 On Thu, May 22, 2014 at 9:19 AM, Chanwit Kaewkasi chan...@gmail.com wrote:

 Hi all,

 On an ARM cluster, I have been testing a wordcount program with JRE 7
 and everything is OK. But when changing to the embedded version of
 Java SE (Oracle's eJRE), the same program cannot complete all
 computing stages.

 It is failed by many Akka's disassociation.

 - I've been trying to increase Akka's timeout but still stuck. I am
 not sure what is the right way to do so? (I suspected that GC pausing
 the world is causing this).

 - Another question is that how could I properly turn on Akka's logging
 to see what's the root cause of this disassociation problem? (If my
 guess about GC is wrong).

 Best regards,

 -chanwit

 --
 Chanwit Kaewkasi
 linkedin.com/in/chanwit




Re: Announcing Spark 1.0.0

2014-05-30 Thread Chanwit Kaewkasi
Congratulations !!

-chanwit

--
Chanwit Kaewkasi
linkedin.com/in/chanwit


On Fri, May 30, 2014 at 5:12 PM, Patrick Wendell pwend...@gmail.com wrote:
 I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0
 is a milestone release as the first in the 1.0 line of releases,
 providing API stability for Spark's core interfaces.

 Spark 1.0.0 is Spark's largest release ever, with contributions from
 117 developers. I'd like to thank everyone involved in this release -
 it was truly a community effort with fixes, features, and
 optimizations contributed from dozens of organizations.

 This release expands Spark's standard libraries, introducing a new SQL
 package (SparkSQL) which lets users integrate SQL queries into
 existing Spark workflows. MLlib, Spark's machine learning library, is
 expanded with sparse vector support and several new algorithms. The
 GraphX and Streaming libraries also introduce new features and
 optimizations. Spark's core engine adds support for secured YARN
 clusters, a unified tool for submitting Spark applications, and
 several performance and stability improvements. Finally, Spark adds
 support for Java 8 lambda syntax and improves coverage of the Java and
 Python API's.

 Those features only scratch the surface - check out the release notes here:
 http://spark.apache.org/releases/spark-release-1-0-0.html

 Note that since release artifacts were posted recently, certain
 mirrors may not have working downloads for a few hours.

 - Patrick


Re: Akka disassociation on Java SE Embedded

2014-05-27 Thread Chanwit Kaewkasi
May be that's explaining mine too.
Thank you very much, Aaron !!

Best regards,

-chanwit

--
Chanwit Kaewkasi
linkedin.com/in/chanwit


On Wed, May 28, 2014 at 12:47 AM, Aaron Davidson ilike...@gmail.com wrote:
 Spark should effectively turn Akka's failure detector off, because we
 historically had problems with GCs and other issues causing disassociations.
 The only thing that should cause these messages nowadays is if the TCP
 connection (which Akka sustains between Actor Systems on different machines)
 actually drops. TCP connections are pretty resilient, so one common cause of
 this is actual Executor failure -- recently, I have experienced a
 similar-sounding problem due to my machine's OOM killer terminating my
 Executors, such that they didn't produce any error output.


 On Thu, May 22, 2014 at 9:19 AM, Chanwit Kaewkasi chan...@gmail.com wrote:

 Hi all,

 On an ARM cluster, I have been testing a wordcount program with JRE 7
 and everything is OK. But when changing to the embedded version of
 Java SE (Oracle's eJRE), the same program cannot complete all
 computing stages.

 It is failed by many Akka's disassociation.

 - I've been trying to increase Akka's timeout but still stuck. I am
 not sure what is the right way to do so? (I suspected that GC pausing
 the world is causing this).

 - Another question is that how could I properly turn on Akka's logging
 to see what's the root cause of this disassociation problem? (If my
 guess about GC is wrong).

 Best regards,

 -chanwit

 --
 Chanwit Kaewkasi
 linkedin.com/in/chanwit




Akka disassociation on Java SE Embedded

2014-05-22 Thread Chanwit Kaewkasi
Hi all,

On an ARM cluster, I have been testing a wordcount program with JRE 7
and everything is OK. But when changing to the embedded version of
Java SE (Oracle's eJRE), the same program cannot complete all
computing stages.

It is failed by many Akka's disassociation.

- I've been trying to increase Akka's timeout but still stuck. I am
not sure what is the right way to do so? (I suspected that GC pausing
the world is causing this).

- Another question is that how could I properly turn on Akka's logging
to see what's the root cause of this disassociation problem? (If my
guess about GC is wrong).

Best regards,

-chanwit

--
Chanwit Kaewkasi
linkedin.com/in/chanwit


Spark to utilize HDFS's mmap caching

2014-05-15 Thread Chanwit Kaewkasi
Hi all,

Can Spark (0.9.x) utilize the caching feature in HDFS 2.3 via
sc.textFile() and other HDFS-related APIs?

http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html

Best regards,

-chanwit

--
Chanwit Kaewkasi
linkedin.com/in/chanwit


Re: Spark to utilize HDFS's mmap caching

2014-05-13 Thread Chanwit Kaewkasi
Great to know that! Thank you, Matei.

Best regards,

-chanwit

--
Chanwit Kaewkasi
linkedin.com/in/chanwit


On Tue, May 13, 2014 at 2:14 AM, Matei Zaharia matei.zaha...@gmail.com wrote:
 That API is something the HDFS administrator uses outside of any application 
 to tell HDFS to cache certain files or directories. But once you've done 
 that, any existing HDFS client accesses them directly from the cache.

 Matei

 On May 12, 2014, at 11:10 AM, Marcelo Vanzin van...@cloudera.com wrote:

 Is that true? I believe that API Chanwit is talking about requires
 explicitly asking for files to be cached in HDFS.

 Spark automatically benefits from the kernel's page cache (i.e. if
 some block is in the kernel's page cache, it will be read more
 quickly). But the explicit HDFS cache is a different thing; Spark
 applications that want to use it would have to explicitly call the
 respective HDFS APIs.

 On Sun, May 11, 2014 at 11:04 PM, Matei Zaharia matei.zaha...@gmail.com 
 wrote:
 Yes, Spark goes through the standard HDFS client and will automatically 
 benefit from this.

 Matei

 On May 8, 2014, at 4:43 AM, Chanwit Kaewkasi chan...@gmail.com wrote:

 Hi all,

 Can Spark (0.9.x) utilize the caching feature in HDFS 2.3 via
 sc.textFile() and other HDFS-related APIs?

 http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html

 Best regards,

 -chanwit

 --
 Chanwit Kaewkasi
 linkedin.com/in/chanwit




 --
 Marcelo