date:20140519

For performance, Spark prefers OracleJDK or OpenJDK?

2014-05-19 Thread Hao Wang

Hi,

Oracle JDK and OpenJDK, which one is better or preferred for Spark?


Regards,
Wang Hao(王灏)

CloudTeam | School of Software Engineering
Shanghai Jiao Tong University
Address:800 Dongchuan Road, Minhang District, Shanghai, 200240
Email:wh.s...@gmail.com

Re: For performance, Spark prefers OracleJDK or OpenJDK?

2014-05-19 Thread Gordon Wang

I would like to say that Oracle JDK may be the better choice. A lot of
hadoop distribution vendors use Oracle JDK instead of Open JDK for
enterprise.


On Mon, May 19, 2014 at 2:50 PM, Hao Wang wh.s...@gmail.com wrote:

 Hi,

 Oracle JDK and OpenJDK, which one is better or preferred for Spark?


 Regards,
 Wang Hao(王灏)

 CloudTeam | School of Software Engineering
 Shanghai Jiao Tong University
 Address:800 Dongchuan Road, Minhang District, Shanghai, 200240
 Email:wh.s...@gmail.com




-- 
Regards
Gordon Wang

Re: sync master with slaves with bittorrent?

2014-05-19 Thread Daniel Mahler

btw is there a command or script to update the slaves from the master?

thanks
Daniel

On Mon, May 19, 2014 at 1:48 AM, Andrew Ash and...@andrewash.com wrote:

If the codebase for Spark's broadcast is pretty self-contained, you could
consider creating a small bootstrap sent out via the doubling rsync
strategy that Mosharaf outlined above (called Tree D=2 in the paper) that
then pulled the larger

Mosharaf, do you have a sense of whether the gains from using Cornet vs
Tree D=2 with rsync outweighs the overhead of using a 2-phase broadcast
mechanism?

Andrew

On Sun, May 18, 2014 at 11:32 PM, Aaron Davidson ilike...@gmail.comwrote:

One issue with using Spark itself is that this rsync is required to get
Spark to work...

Also note that a similar strategy is used for *updating* the spark
cluster on ec2, where the diff aspect is much more important, as you
might only make a small change on the driver node (recompile or
reconfigure) and can get a fast sync.

On Sun, May 18, 2014 at 11:22 PM, Mosharaf Chowdhury
mosharafka...@gmail.com wrote:

What twitter calls murder, unless it has changed since then, is just a
BitTornado wrapper. In 2011, We did some comparison on the performance of
murder and the TorrentBroadcast we have right now for Spark's own broadcast
(Section 7.1 in
http://www.mosharaf.com/wp-content/uploads/orchestra-sigcomm11.pdf).
Spark's implementation was 4.5X faster than murder.

The only issue with using TorrentBroadcast to deploy code/VM is writing
a wrapper around it to read from disk, but it shouldn't be too complicated.
If someone picks it up, I can give some pointers on how to proceed (I've
thought about doing it myself forever, but never ended up actually taking
the time; right now I don't have enough free cycles either)

Otherwise, murder/BitTornado would be better than the current strategy
we have.

A third option would be to use rsync; but instead of rsync-ing to every
slave from the master, one can simply rsync from the master first to one
slave; then use the two sources (master and the first slave) to rsync to
two more; then four and so on. Might be a simpler solution without many
changes.

--
Mosharaf Chowdhury
http://www.mosharaf.com/

On Sun, May 18, 2014 at 11:07 PM, Andrew Ash and...@andrewash.comwrote:

My first thought would be to use libtorrent for this setup, and it
turns out that both Twitter and Facebook do code deploys with a bittorrent
setup. Twitter even released their code as open source:

https://blog.twitter.com/2010/murder-fast-datacenter-code-deploys-using-bittorrent

http://arstechnica.com/business/2012/04/exclusive-a-behind-the-scenes-look-at-facebook-release-engineering/

On Sun, May 18, 2014 at 10:44 PM, Daniel Mahler dmah...@gmail.comwrote:

I am not an expert in this space either. I thought the initial rsync
during launch is really just a straight copy that did not need the tree
diff. So it seemed like having the slaves do the copying among it each
other would be better than having the master copy to everyone directly.
That made me think of bittorrent, though there may well be other systems
that do this.
From the launches I did today it seems that it is taking around 1
minute per slave to launch a cluster, which can be a problem for clusters
with 10s or 100s of slaves, particularly since on ec2 that time has to be
paid for.

On Sun, May 18, 2014 at 11:54 PM, Aaron Davidson
ilike...@gmail.comwrote:

Out of curiosity, do you have a library in mind that would make it
easy to setup a bit torrent network and distribute files in an rsync
(i.e.,
apply a diff to a tree, ideally) fashion? I'm not familiar with this
space,
but we do want to minimize the complexity of our standard ec2 launch
scripts to reduce the chance of something breaking.

On Sun, May 18, 2014 at 9:22 PM, Daniel Mahler dmah...@gmail.comwrote:

I am launching a rather large cluster on ec2.
It seems like the launch is taking forever on

Setting up spark
RSYNC'ing /root/spark to slaves...
...

It seems that bittorrent might be a faster way to replicate
the sizeable spark directory to the slaves
particularly if there is a lot of not very powerful slaves.

Just a thought ...

cheers
Daniel

persist @ disk-only failing

2014-05-19 Thread Sai Prasanna

Hi all,

When i gave the persist level as DISK_ONLY, still Spark tries to use memory
and caches.
Any reason ?
Do i need to override some parameter elsewhere ?

Thanks !

Re: Packaging a spark job using maven

2014-05-19 Thread Laurent T

Hi Eugen,

Thanks for your help. I'm not familiar with the shaded plugin and i was
wondering: does it replace the assembly plugin ? Also, do i have to specify
all the artifacts and sub artifacts in the artifactSet ? Or can i just use a
*:* wildcard and let the maven scopes do their work ? I have a lot of
overlap warnings when i do so.

Thanks for your help.
Regards,
Laurent



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Packaging-a-spark-job-using-maven-tp5615p6024.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Packaging a spark job using maven

2014-05-19 Thread Eugen Cepoi

2014-05-19 10:35 GMT+02:00 Laurent T laurent.thou...@ldmobile.net:

 Hi Eugen,

 Thanks for your help. I'm not familiar with the shaded plugin and i was
 wondering: does it replace the assembly plugin ?


Nope it doesn't replace it. It allows you to make fat jars and other nice
things such as relocating classes to some other package.

I am using it in combination with assembly and jdeb to build deployable
archives (zip and debian). I find that building fat jars with shade plugin
is more powerful and easier that with assembly.


 Also, do i have to specify
 all the artifacts and sub artifacts in the artifactSet ? Or can i just use
 a
 *:* wildcard and let the maven scopes do their work ? I have a lot of
 overlap warnings when i do so.


Indeed you don't have to tell exactly what must be included, I do so, in
order to have at the end a small archive that we can quickly deploy. Have a
look at the doc you have some examples
http://maven.apache.org/plugins/maven-shade-plugin/examples/includes-excludes.html

In short, you remove the includes and instead write the excludes (spark,
hadoop, etc). The overlap is due to same classes being present in different
jars. You can exclude those jars to remove the warnings.

http://stackoverflow.com/questions/19987080/maven-shade-plugin-uber-jar-and-overlapping-classes
http://stackoverflow.com/questions/11824633/maven-shade-plugin-warning-we-have-a-duplicate-how-to-fix

Eugen




 Thanks for your help.
 Regards,
 Laurent



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Packaging-a-spark-job-using-maven-tp5615p6024.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: sync master with slaves with bittorrent?

2014-05-19 Thread Daniel Mahler

On Mon, May 19, 2014 at 2:04 AM, Daniel Mahler dmah...@gmail.com wrote:

I agree that for updating rsync is probably preferable, and it seems like
for that purpose it would also parallelize well since most of the time is
spent computing checksums so the process is not constrained by the total
i/o capacity of the master. However it is a problem for the initial
replication of the master to the slaves. If you are running on ec2 then the
dollar overhead of launching is quadratic in the number of slaves. if you
launch a 100 machine cluster you will wait a 100 minutes, but you will pay
for 1 machine minutes or 167 hours
before anything useful starts to happen.

Launch time does *not* increase linearly with number slaves as I thought I
was seeing.
It would still be nice to have a faster launch though.

cheers
Daniel

On Mon, May 19, 2014 at 1:32 AM, Aaron Davidson ilike...@gmail.comwrote: