For performance, Spark prefers OracleJDK or OpenJDK?

2014-05-19 Thread Hao Wang
Hi,

Oracle JDK and OpenJDK, which one is better or preferred for Spark?


Regards,
Wang Hao(王灏)

CloudTeam | School of Software Engineering
Shanghai Jiao Tong University
Address:800 Dongchuan Road, Minhang District, Shanghai, 200240
Email:wh.s...@gmail.com


Re: For performance, Spark prefers OracleJDK or OpenJDK?

2014-05-19 Thread Gordon Wang
I would like to say that Oracle JDK may be the better choice. A lot of
hadoop distribution vendors use Oracle JDK instead of Open JDK for
enterprise.


On Mon, May 19, 2014 at 2:50 PM, Hao Wang wh.s...@gmail.com wrote:

 Hi,

 Oracle JDK and OpenJDK, which one is better or preferred for Spark?


 Regards,
 Wang Hao(王灏)

 CloudTeam | School of Software Engineering
 Shanghai Jiao Tong University
 Address:800 Dongchuan Road, Minhang District, Shanghai, 200240
 Email:wh.s...@gmail.com




-- 
Regards
Gordon Wang


Re: sync master with slaves with bittorrent?

2014-05-19 Thread Daniel Mahler
btw is there a command or script to update the slaves from the master?

thanks
Daniel


On Mon, May 19, 2014 at 1:48 AM, Andrew Ash and...@andrewash.com wrote:

 If the codebase for Spark's broadcast is pretty self-contained, you could
 consider creating a small bootstrap sent out via the doubling rsync
 strategy that Mosharaf outlined above (called Tree D=2 in the paper) that
 then pulled the larger

 Mosharaf, do you have a sense of whether the gains from using Cornet vs
 Tree D=2 with rsync outweighs the overhead of using a 2-phase broadcast
 mechanism?

 Andrew


 On Sun, May 18, 2014 at 11:32 PM, Aaron Davidson ilike...@gmail.comwrote:

 One issue with using Spark itself is that this rsync is required to get
 Spark to work...

 Also note that a similar strategy is used for *updating* the spark
 cluster on ec2, where the diff aspect is much more important, as you
 might only make a small change on the driver node (recompile or
 reconfigure) and can get a fast sync.


 On Sun, May 18, 2014 at 11:22 PM, Mosharaf Chowdhury 
 mosharafka...@gmail.com wrote:

 What twitter calls murder, unless it has changed since then, is just a
 BitTornado wrapper. In 2011, We did some comparison on the performance of
 murder and the TorrentBroadcast we have right now for Spark's own broadcast
 (Section 7.1 in
 http://www.mosharaf.com/wp-content/uploads/orchestra-sigcomm11.pdf).
 Spark's implementation was 4.5X faster than murder.

 The only issue with using TorrentBroadcast to deploy code/VM is writing
 a wrapper around it to read from disk, but it shouldn't be too complicated.
 If someone picks it up, I can give some pointers on how to proceed (I've
 thought about doing it myself forever, but never ended up actually taking
 the time; right now I don't have enough free cycles either)

 Otherwise, murder/BitTornado would be better than the current strategy
 we have.

 A third option would be to use rsync; but instead of rsync-ing to every
 slave from the master, one can simply rsync from the master first to one
 slave; then use the two sources (master and the first slave) to rsync to
 two more; then four and so on. Might be a simpler solution without many
 changes.

 --
 Mosharaf Chowdhury
 http://www.mosharaf.com/


 On Sun, May 18, 2014 at 11:07 PM, Andrew Ash and...@andrewash.comwrote:

 My first thought would be to use libtorrent for this setup, and it
 turns out that both Twitter and Facebook do code deploys with a bittorrent
 setup.  Twitter even released their code as open source:


 https://blog.twitter.com/2010/murder-fast-datacenter-code-deploys-using-bittorrent


 http://arstechnica.com/business/2012/04/exclusive-a-behind-the-scenes-look-at-facebook-release-engineering/


 On Sun, May 18, 2014 at 10:44 PM, Daniel Mahler dmah...@gmail.comwrote:

 I am not an expert in this space either. I thought the initial rsync
 during launch is really just a straight copy that did not need the tree
 diff. So it seemed like having the slaves do the copying among it each
 other would be better than having the master copy to everyone directly.
 That made me think of bittorrent, though there may well be other systems
 that do this.
 From the launches I did today it seems that it is taking around 1
 minute per slave to launch a cluster, which can be a problem for clusters
 with 10s or 100s of slaves, particularly since on ec2  that time has to be
 paid for.


 On Sun, May 18, 2014 at 11:54 PM, Aaron Davidson 
 ilike...@gmail.comwrote:

 Out of curiosity, do you have a library in mind that would make it
 easy to setup a bit torrent network and distribute files in an rsync 
 (i.e.,
 apply a diff to a tree, ideally) fashion? I'm not familiar with this 
 space,
 but we do want to minimize the complexity of our standard ec2 launch
 scripts to reduce the chance of something breaking.


 On Sun, May 18, 2014 at 9:22 PM, Daniel Mahler dmah...@gmail.comwrote:

 I am launching a rather large cluster on ec2.
 It seems like the launch is taking forever on
 
 Setting up spark
 RSYNC'ing /root/spark to slaves...
 ...

 It seems that bittorrent might be a faster way to replicate
 the sizeable spark directory to the slaves
 particularly if there is a lot of not very powerful slaves.

 Just a thought ...

 cheers
 Daniel










persist @ disk-only failing

2014-05-19 Thread Sai Prasanna
Hi all,

When i gave the persist level as DISK_ONLY, still Spark tries to use memory
and caches.
Any reason ?
Do i need to override some parameter elsewhere ?

Thanks !


Re: Packaging a spark job using maven

2014-05-19 Thread Laurent T
Hi Eugen,

Thanks for your help. I'm not familiar with the shaded plugin and i was
wondering: does it replace the assembly plugin ? Also, do i have to specify
all the artifacts and sub artifacts in the artifactSet ? Or can i just use a
*:* wildcard and let the maven scopes do their work ? I have a lot of
overlap warnings when i do so.

Thanks for your help.
Regards,
Laurent



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Packaging-a-spark-job-using-maven-tp5615p6024.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Packaging a spark job using maven

2014-05-19 Thread Eugen Cepoi
2014-05-19 10:35 GMT+02:00 Laurent T laurent.thou...@ldmobile.net:

 Hi Eugen,

 Thanks for your help. I'm not familiar with the shaded plugin and i was
 wondering: does it replace the assembly plugin ?


Nope it doesn't replace it. It allows you to make fat jars and other nice
things such as relocating classes to some other package.

I am using it in combination with assembly and jdeb to build deployable
archives (zip and debian). I find that building fat jars with shade plugin
is more powerful and easier that with assembly.


 Also, do i have to specify
 all the artifacts and sub artifacts in the artifactSet ? Or can i just use
 a
 *:* wildcard and let the maven scopes do their work ? I have a lot of
 overlap warnings when i do so.


Indeed you don't have to tell exactly what must be included, I do so, in
order to have at the end a small archive that we can quickly deploy. Have a
look at the doc you have some examples
http://maven.apache.org/plugins/maven-shade-plugin/examples/includes-excludes.html

In short, you remove the includes and instead write the excludes (spark,
hadoop, etc). The overlap is due to same classes being present in different
jars. You can exclude those jars to remove the warnings.

http://stackoverflow.com/questions/19987080/maven-shade-plugin-uber-jar-and-overlapping-classes
http://stackoverflow.com/questions/11824633/maven-shade-plugin-warning-we-have-a-duplicate-how-to-fix

Eugen




 Thanks for your help.
 Regards,
 Laurent



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Packaging-a-spark-job-using-maven-tp5615p6024.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: sync master with slaves with bittorrent?

2014-05-19 Thread Daniel Mahler
On Mon, May 19, 2014 at 2:04 AM, Daniel Mahler dmah...@gmail.com wrote:

 I agree that for updating rsync is probably preferable, and it seems like
 for that purpose it would also parallelize well since most of the time is
 spent computing checksums so the process is not constrained by the total
 i/o capacity of the master. However it is a problem for the initial
 replication of the master to the slaves. If you are running on ec2 then the
 dollar overhead of launching is quadratic in the number of slaves. if you
 launch a 100 machine cluster you will wait a 100 minutes, but you will pay
 for 1 machine minutes or 167 hours
 before anything useful starts to  happen.


Launch time does *not* increase linearly with number slaves as I thought I
was seeing.
It would still be nice to have a faster launch though.

cheers
Daniel



 On Mon, May 19, 2014 at 1:32 AM, Aaron Davidson ilike...@gmail.comwrote:

 One issue with using Spark itself is that this rsync is required to get
 Spark to work...

  Also note that a similar strategy is used for *updating* the spark
 cluster on ec2, where the diff aspect is much more important, as you
 might only make a small change on the driver node (recompile or
 reconfigure) and can get a fast sync.


 On Sun, May 18, 2014 at 11:22 PM, Mosharaf Chowdhury 
 mosharafka...@gmail.com wrote:

 What twitter calls murder, unless it has changed since then, is just a
 BitTornado wrapper. In 2011, We did some comparison on the performance of
 murder and the TorrentBroadcast we have right now for Spark's own broadcast
 (Section 7.1 in
 http://www.mosharaf.com/wp-content/uploads/orchestra-sigcomm11.pdf).
 Spark's implementation was 4.5X faster than murder.

 The only issue with using TorrentBroadcast to deploy code/VM is writing
 a wrapper around it to read from disk, but it shouldn't be too complicated.
 If someone picks it up, I can give some pointers on how to proceed (I've
 thought about doing it myself forever, but never ended up actually taking
 the time; right now I don't have enough free cycles either)

 Otherwise, murder/BitTornado would be better than the current strategy
 we have.

 A third option would be to use rsync; but instead of rsync-ing to every
 slave from the master, one can simply rsync from the master first to one
 slave; then use the two sources (master and the first slave) to rsync to
 two more; then four and so on. Might be a simpler solution without many
 changes.

 --
 Mosharaf Chowdhury
 http://www.mosharaf.com/


 On Sun, May 18, 2014 at 11:07 PM, Andrew Ash and...@andrewash.comwrote:

 My first thought would be to use libtorrent for this setup, and it
 turns out that both Twitter and Facebook do code deploys with a bittorrent
 setup.  Twitter even released their code as open source:


 https://blog.twitter.com/2010/murder-fast-datacenter-code-deploys-using-bittorrent


 http://arstechnica.com/business/2012/04/exclusive-a-behind-the-scenes-look-at-facebook-release-engineering/


 On Sun, May 18, 2014 at 10:44 PM, Daniel Mahler dmah...@gmail.comwrote:

 I am not an expert in this space either. I thought the initial rsync
 during launch is really just a straight copy that did not need the tree
 diff. So it seemed like having the slaves do the copying among it each
 other would be better than having the master copy to everyone directly.
 That made me think of bittorrent, though there may well be other systems
 that do this.
 From the launches I did today it seems that it is taking around 1
 minute per slave to launch a cluster, which can be a problem for clusters
 with 10s or 100s of slaves, particularly since on ec2  that time has to be
 paid for.


 On Sun, May 18, 2014 at 11:54 PM, Aaron Davidson 
 ilike...@gmail.comwrote:

 Out of curiosity, do you have a library in mind that would make it
 easy to setup a bit torrent network and distribute files in an rsync 
 (i.e.,
 apply a diff to a tree, ideally) fashion? I'm not familiar with this 
 space,
 but we do want to minimize the complexity of our standard ec2 launch
 scripts to reduce the chance of something breaking.


 On Sun, May 18, 2014 at 9:22 PM, Daniel Mahler dmah...@gmail.comwrote:

 I am launching a rather large cluster on ec2.
 It seems like the launch is taking forever on
 
 Setting up spark
 RSYNC'ing /root/spark to slaves...
 ...

 It seems that bittorrent might be a faster way to replicate
 the sizeable spark directory to the slaves
 particularly if there is a lot of not very powerful slaves.

 Just a thought ...

 cheers
 Daniel










Re: File present but file not found exception

2014-05-19 Thread Koert Kuipers
why does it need to be local file? why not do some filter ops on hdfs file
and save to hdfs, from where you can create rdd?

you can read a small file in on driver program and use sc.parallelize to
turn it into RDD
On May 16, 2014 7:01 PM, Sai Prasanna ansaiprasa...@gmail.com wrote:

 I found that if a file is present in all the nodes in the given path in
 localFS, then reading is possible.

 But is there a way to read if the file is present only in certain nodes ??
 [There should be a way !!]

 *NEED: Wanted to do some filter ops in HDFS file, create a local file of
 the result, create an RDD out of it operate *

 Is there any way out ??

 Thanks in advance !




 On Fri, May 9, 2014 at 12:18 AM, Sai Prasanna ansaiprasa...@gmail.comwrote:

 Hi Everyone,

 I think all are pretty busy, the response time in this group has slightly
 increased.

 But anyways, this is a pretty silly problem, but could not get over.

 I have a file in my localFS, but when i try to create an RDD out of it,
 tasks fails with file not found exception is thrown at the log files.

 *var file = sc.textFile(file:///home/sparkcluster/spark/input.txt);*
 *file.top(1);*

 input.txt exists in the above folder but still Spark coudnt find it. Some
 parameters need to be set ??

 Any help is really appreciated. Thanks !!





specifying worker nodes when using the repl?

2014-05-19 Thread Eric Friedman
Hi

I am working with a Cloudera 5 cluster with 192 nodes and can’t work out how to 
get the spark repo to use more than 2 nodes in an interactive session.

So, this works, but is non-interactive (using yarn-client as MASTER)

/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/bin/spark-class \
  org.apache.spark.deploy.yarn.Client \
  --jar 
/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/examples/lib/spark-examples_2.10-0.9.0-cdh5.0.0.jar
 \
  --class org.apache.spark.examples.SparkPi \
  --args yarn-standalone \
  --args 10 \
  --num-workers 100

There does not appear to be an (obvious?) way to get more than 2 nodes involved 
from the repl.

I am running the REPL like this:

#!/bin/sh

. /etc/spark/conf.cloudera.spark/spark-env.sh

export SPARK_JAR=hdfs://nameservice1/user/spark/share/lib/spark-assembly.jar

export SPARK_WORKER_MEMORY=512m

export MASTER=yarn-client

exec $SPARK_HOME/bin/spark-shell

Now if I comment out the line with `export SPARK_JAR=…’ and run this again, I 
get an error like this:

14/05/19 08:03:41 ERROR Client: Error: You must set SPARK_JAR environment 
variable!
Usage: org.apache.spark.deploy.yarn.Client [options] 
Options:
  --jar JAR_PATH Path to your application's JAR file (required in 
yarn-cluster mode)
  --class CLASS_NAME Name of your application's main class (required)
  --args ARGSArguments to be passed to your application's main 
class.
 Mutliple invocations are possible, each will be 
passed in order.
  --num-workers NUM  Number of workers to start (Default: 2)
  […]

But none of those options are exposed at the `spark-shell’ level.

Thanks in advance for your guidance.

Eric

Re: Yarn configuration file doesn't work when run with yarn-client mode

2014-05-19 Thread Arun Ahuja
I am encountering the same thing.  Basic yarn apps work as does the SparkPi
example, but my custom application gives this result.  I am using
compute-classpath to create the proper classpath for my application, same
with SparkPi - was there a resolution to this issue?

Thanks,
Arun


On Wed, Feb 12, 2014 at 1:28 AM, Nan Zhu zhunanmcg...@gmail.com wrote:

  Hi, all

 When I run my application with  yarn-client mode, it seems that the system
 didn’t load my configuration file correctly, because the local app master
 always tries to register with RM via a default IP

 14/02/12 05:00:23 INFO SparkContext: Added JAR
 target/scala-2.10/rec_system_2.10-1.0.jar at
 http://172.31.37.160:51750/jars/rec_system_2.10-1.0.jar with timestamp
 1392181223818

 14/02/12 05:00:24 INFO RMProxy: Connecting to ResourceManager at /
 0.0.0.0:8032

 14/02/12 05:00:25 INFO Client: Retrying connect to server:
 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is
 RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)

 14/02/12 05:00:26 INFO Client: Retrying connect to server:
 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is
 RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)

 14/02/12 05:00:27 INFO Client: Retrying connect to server:
 0.0.0.0/0.0.0.0:8032. Already tried 2 time(s); retry policy is
 RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)


 However, if I run in a standalone mode, everything works fine
  (YARN_CONF_DIR, SPARK_APP, SPARK_YARN_APP_JAR are all set correctly)

 is it a bug?

 Best,

 --
 Nan Zhu





Re: persist @ disk-only failing

2014-05-19 Thread Matei Zaharia
This is the patch for it: https://github.com/apache/spark/pull/50/. It might be 
possible to backport it to 0.8.

Matei

On May 19, 2014, at 2:04 AM, Sai Prasanna ansaiprasa...@gmail.com wrote:

 Matei, I am using 0.8.1 !!
 
 But is there a way without moving to 0.9.1 to bypass cache ?
 
 
 On Mon, May 19, 2014 at 1:31 PM, Matei Zaharia matei.zaha...@gmail.com 
 wrote:
 What version is this with? We used to build each partition first before 
 writing it out, but this was fixed a while back (0.9.1, but it may also be in 
 0.9.0).
 
 Matei
 
 On May 19, 2014, at 12:41 AM, Sai Prasanna ansaiprasa...@gmail.com wrote:
 
  Hi all,
 
  When i gave the persist level as DISK_ONLY, still Spark tries to use memory 
  and caches.
  Any reason ?
  Do i need to override some parameter elsewhere ?
 
  Thanks !
 
 



Re: specifying worker nodes when using the repl?

2014-05-19 Thread Sandy Ryza
Hi Eric,

Have you tried setting the SPARK_WORKER_INSTANCES env variable before
running spark-shell?
http://spark.apache.org/docs/0.9.0/running-on-yarn.html

-Sandy


On Mon, May 19, 2014 at 8:08 AM, Eric Friedman e...@spottedsnake.netwrote:

 Hi

 I am working with a Cloudera 5 cluster with 192 nodes and can’t work out
 how to get the spark repo to use more than 2 nodes in an interactive
 session.

 So, this works, but is non-interactive (using yarn-client as MASTER)

 /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/bin/spark-class
 \
   org.apache.spark.deploy.yarn.Client \
   --jar
 /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/examples/lib/spark-examples_2.10-0.9.0-cdh5.0.0.jar
 \
   --class org.apache.spark.examples.SparkPi \
   --args yarn-standalone \
   --args 10 \
   *--num-workers 100*

 There does not appear to be an (obvious?) way to get more than 2 nodes
 involved from the repl.

 I am running the REPL like this:

 #!/bin/sh

 . /etc/spark/conf.cloudera.spark/spark-env.sh

 export SPARK_JAR=
 hdfs://nameservice1/user/spark/share/lib/spark-assembly.jar

 export SPARK_WORKER_MEMORY=512m

 export MASTER=yarn-client

 exec $SPARK_HOME/bin/spark-shell

 Now if I comment out the line with `export SPARK_JAR=…’ and run this
 again, I get an error like this:

 14/05/19 08:03:41 ERROR Client: Error: You must set SPARK_JAR environment
 variable!
 Usage: org.apache.spark.deploy.yarn.Client [options]
 Options:
   --jar JAR_PATH Path to your application's JAR file (required
 in yarn-cluster mode)
   --class CLASS_NAME Name of your application's main class
 (required)
   --args ARGSArguments to be passed to your application's
 main class.
  Mutliple invocations are possible, each will
 be passed in order.
   --num-workers NUM  Number of workers to start (Default: 2)
   […]

 But none of those options are exposed at the `spark-shell’ level.

 Thanks in advance for your guidance.

 Eric



Spark Streaming and Shark | Streaming Taking All CPUs

2014-05-19 Thread anishs...@yahoo.co.in
Hi All

I am new to Spark, I was trying to use Spark Streaming and Shark at the same 
time.

I was recieiving messages from Kafka and pushing them to HDFS after minor 
processing.

It was workin fine, but it was taking all the CPUs and at the same time on 
other terminal i tried to access shark but it kept on waiting until i stopped 
listener.

on the web console it was showing all 6 CPUs were taken by Spark Streamin 
Listener and Shark had zero CPU.

(I have 3 node test cluster)

Please suggest

Thanks  regards
--
Anish Sneh
http://in.linkedin.com/in/anishsneh



Re: unsubscribe

2014-05-19 Thread Marcelo Vanzin
Hey Andrew,

Since we're seeing so many of these e-mails, I think it's worth
pointing out that it's not really obvious to find unsubscription
information for the lists.

The community link on the Spark site
(http://spark.apache.org/community.html) does not have instructions
for unsubscribing; it links to a different archive than the one you
posted, which doesn't show that info either.

The only place right now I can see that info (without going to the
generic Apache link you posted) is by looking at the e-mai'ls source,
where there is a header with the unsubscribe address.

So maybe the web site should also list the unsubscribe address, or
link to the Apache archive instead of Nabble?

I know many people might not like it, but maybe the list messages
should have a footer with this administrative info (even if it's just
a link to the archive page)?

On Sun, May 18, 2014 at 1:49 PM, Andrew Ash and...@andrewash.com wrote:
 If you'd like to get off this mailing list, please send an email to
 user-unsubscr...@spark.apache.org, not the regular user@spark.apache.org
 list.

 How to use the Apache mailing list infrastructure is documented here:
 https://www.apache.org/foundation/mailinglists.html
 And the Spark User list specifically can be found here:
 http://mail-archives.apache.org/mod_mbox/spark-user/



-- 
Marcelo


spark ec2 commandline tool error VPC security groups may not be used for a non-VPC launch

2014-05-19 Thread Matt Work Coarr
Hi, I'm attempting to run spark-ec2 launch on AWS.  My AWS instances
would be in our EC2 VPC (which seems to be causing a problem).

The two security groups MyClusterName-master and MyClusterName-slaves have
already been setup with the same ports open as the security group that
spark-ec2 tries to create.  (My company has security rules where I don't
have permissions to create security groups, so they have to be created by
someone else ahead of time.)

I'm getting the error VPC security groups may not be used for a non-VPC
launch when I try to run spark-ec2 launch.

Is there something I need to do to make spark-ec2 launch the master and
slave instances within the VPC?

Here's the command-line and the error that I get...

command-line (I've changed the clustername to something generic):

$SPARK_HOME/ec2/spark-ec2 --key-pair=MyKeyPair
'--identity-file=~/.ssh/id_mysshkey' --slaves=2 --instance-type=m3.large
--region=us-east-1 --zone=us-east-1a --ami=myami --spark-version

=0.9.1 launch MyClusterName


error:

ERROR:boto:400 Bad Request

ERROR:boto:?xml version=1.0 encoding=UTF-8?

ResponseErrorsErrorCodeInvalidParameterCombination/CodeMessageVPC
security groups may not be used for a non-VPC
launch/Message/Error/ErrorsRequestID8374cac5-5869-4f38-a141-2fdaf3b18326/Requ

estID/Response

Setting up security groups...

Searching for existing cluster MyClusterName...

Launching instances...

Traceback (most recent call last):

  File ./spark_ec2.py, line 806, in module

main()

  File ./spark_ec2.py, line 799, in main

real_main()

  File ./spark_ec2.py, line 682, in real_main

conn, opts, cluster_name)

  File ./spark_ec2.py, line 344, in launch_cluster

block_device_map = block_map)

  File
/opt/spark-0.9.1-bin-hadoop1/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/image.py,
line 255, in run

  File
/opt/spark-0.9.1-bin-hadoop1/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/connection.py,
line 678, in run_instances

  File
/opt/spark-0.9.1-bin-hadoop1/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/connection.py,
line 925, in get_object

boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request

?xml version=1.0 encoding=UTF-8?

ResponseErrorsErrorCodeInvalidParameterCombination/CodeMessageVPC
security groups may not be used for a non-VPC
launch/Message/Error/ErrorsRequestID8374cac5-5869-4f38-a141-2fdaf3b18326/RequestID/Response

Thanks for your help!!

Matt


Re: unsubscribe

2014-05-19 Thread Andrew Ash
Agree that the links to the archives should probably point to the Apache
archives rather than Nabble's, so the unsubscribe documentation is clearer.
 Also, an (unsubscribe) link right next to subscribe with the email already
generated could help too.

I'd be one of those highly against a footer on every email.

Who can edit the community page?  http://spark.apache.org/community.html

It's not in the git repo.


On Mon, May 19, 2014 at 10:52 AM, Marcelo Vanzin van...@cloudera.comwrote:

 Hey Andrew,

 Since we're seeing so many of these e-mails, I think it's worth
 pointing out that it's not really obvious to find unsubscription
 information for the lists.

 The community link on the Spark site
 (http://spark.apache.org/community.html) does not have instructions
 for unsubscribing; it links to a different archive than the one you
 posted, which doesn't show that info either.

 The only place right now I can see that info (without going to the
 generic Apache link you posted) is by looking at the e-mai'ls source,
 where there is a header with the unsubscribe address.

 So maybe the web site should also list the unsubscribe address, or
 link to the Apache archive instead of Nabble?

 I know many people might not like it, but maybe the list messages
 should have a footer with this administrative info (even if it's just
 a link to the archive page)?

 On Sun, May 18, 2014 at 1:49 PM, Andrew Ash and...@andrewash.com wrote:
  If you'd like to get off this mailing list, please send an email to
  user-unsubscr...@spark.apache.org, not the regular user@spark.apache.org
  list.
 
  How to use the Apache mailing list infrastructure is documented here:
  https://www.apache.org/foundation/mailinglists.html
  And the Spark User list specifically can be found here:
  http://mail-archives.apache.org/mod_mbox/spark-user/



 --
 Marcelo



Re: sync master with slaves with bittorrent?

2014-05-19 Thread Aaron Davidson
On the ec2 machines, you can update the slaves from the master using
something like ~/spark-ec2/copy-dir ~/spark.

Spark's TorrentBroadcast relies on the Block Manager to distribute blocks,
making it relatively hard to extract.


On Mon, May 19, 2014 at 12:36 AM, Daniel Mahler dmah...@gmail.com wrote:

 btw is there a command or script to update the slaves from the master?

 thanks
 Daniel


 On Mon, May 19, 2014 at 1:48 AM, Andrew Ash and...@andrewash.com wrote:

 If the codebase for Spark's broadcast is pretty self-contained, you could
 consider creating a small bootstrap sent out via the doubling rsync
 strategy that Mosharaf outlined above (called Tree D=2 in the paper) that
 then pulled the larger

 Mosharaf, do you have a sense of whether the gains from using Cornet vs
 Tree D=2 with rsync outweighs the overhead of using a 2-phase broadcast
 mechanism?

 Andrew


 On Sun, May 18, 2014 at 11:32 PM, Aaron Davidson ilike...@gmail.comwrote:

 One issue with using Spark itself is that this rsync is required to get
 Spark to work...

 Also note that a similar strategy is used for *updating* the spark
 cluster on ec2, where the diff aspect is much more important, as you
 might only make a small change on the driver node (recompile or
 reconfigure) and can get a fast sync.


 On Sun, May 18, 2014 at 11:22 PM, Mosharaf Chowdhury 
 mosharafka...@gmail.com wrote:

 What twitter calls murder, unless it has changed since then, is just a
 BitTornado wrapper. In 2011, We did some comparison on the performance of
 murder and the TorrentBroadcast we have right now for Spark's own broadcast
 (Section 7.1 in
 http://www.mosharaf.com/wp-content/uploads/orchestra-sigcomm11.pdf).
 Spark's implementation was 4.5X faster than murder.

 The only issue with using TorrentBroadcast to deploy code/VM is writing
 a wrapper around it to read from disk, but it shouldn't be too complicated.
 If someone picks it up, I can give some pointers on how to proceed (I've
 thought about doing it myself forever, but never ended up actually taking
 the time; right now I don't have enough free cycles either)

 Otherwise, murder/BitTornado would be better than the current strategy
 we have.

 A third option would be to use rsync; but instead of rsync-ing to every
 slave from the master, one can simply rsync from the master first to one
 slave; then use the two sources (master and the first slave) to rsync to
 two more; then four and so on. Might be a simpler solution without many
 changes.

 --
 Mosharaf Chowdhury
 http://www.mosharaf.com/


 On Sun, May 18, 2014 at 11:07 PM, Andrew Ash and...@andrewash.comwrote:

 My first thought would be to use libtorrent for this setup, and it
 turns out that both Twitter and Facebook do code deploys with a bittorrent
 setup.  Twitter even released their code as open source:


 https://blog.twitter.com/2010/murder-fast-datacenter-code-deploys-using-bittorrent


 http://arstechnica.com/business/2012/04/exclusive-a-behind-the-scenes-look-at-facebook-release-engineering/


 On Sun, May 18, 2014 at 10:44 PM, Daniel Mahler dmah...@gmail.comwrote:

 I am not an expert in this space either. I thought the initial rsync
 during launch is really just a straight copy that did not need the tree
 diff. So it seemed like having the slaves do the copying among it each
 other would be better than having the master copy to everyone directly.
 That made me think of bittorrent, though there may well be other systems
 that do this.
 From the launches I did today it seems that it is taking around 1
 minute per slave to launch a cluster, which can be a problem for clusters
 with 10s or 100s of slaves, particularly since on ec2  that time has to 
 be
 paid for.


 On Sun, May 18, 2014 at 11:54 PM, Aaron Davidson 
 ilike...@gmail.comwrote:

 Out of curiosity, do you have a library in mind that would make it
 easy to setup a bit torrent network and distribute files in an rsync 
 (i.e.,
 apply a diff to a tree, ideally) fashion? I'm not familiar with this 
 space,
 but we do want to minimize the complexity of our standard ec2 launch
 scripts to reduce the chance of something breaking.


 On Sun, May 18, 2014 at 9:22 PM, Daniel Mahler dmah...@gmail.comwrote:

 I am launching a rather large cluster on ec2.
 It seems like the launch is taking forever on
 
 Setting up spark
 RSYNC'ing /root/spark to slaves...
 ...

 It seems that bittorrent might be a faster way to replicate
 the sizeable spark directory to the slaves
 particularly if there is a lot of not very powerful slaves.

 Just a thought ...

 cheers
 Daniel











Re: sync master with slaves with bittorrent?

2014-05-19 Thread Mosharaf Chowdhury
Good catch. In that case, using BitTornado/murder would be better.


--
Mosharaf Chowdhury
http://www.mosharaf.com/


On Mon, May 19, 2014 at 11:17 AM, Aaron Davidson ilike...@gmail.com wrote:

 On the ec2 machines, you can update the slaves from the master using
 something like ~/spark-ec2/copy-dir ~/spark.

 Spark's TorrentBroadcast relies on the Block Manager to distribute blocks,
 making it relatively hard to extract.


 On Mon, May 19, 2014 at 12:36 AM, Daniel Mahler dmah...@gmail.com wrote:

 btw is there a command or script to update the slaves from the master?

 thanks
 Daniel


 On Mon, May 19, 2014 at 1:48 AM, Andrew Ash and...@andrewash.com wrote:

 If the codebase for Spark's broadcast is pretty self-contained, you
 could consider creating a small bootstrap sent out via the doubling rsync
 strategy that Mosharaf outlined above (called Tree D=2 in the paper) that
 then pulled the larger

 Mosharaf, do you have a sense of whether the gains from using Cornet vs
 Tree D=2 with rsync outweighs the overhead of using a 2-phase broadcast
 mechanism?

 Andrew


 On Sun, May 18, 2014 at 11:32 PM, Aaron Davidson ilike...@gmail.comwrote:

 One issue with using Spark itself is that this rsync is required to get
 Spark to work...

 Also note that a similar strategy is used for *updating* the spark
 cluster on ec2, where the diff aspect is much more important, as you
 might only make a small change on the driver node (recompile or
 reconfigure) and can get a fast sync.


 On Sun, May 18, 2014 at 11:22 PM, Mosharaf Chowdhury 
 mosharafka...@gmail.com wrote:

 What twitter calls murder, unless it has changed since then, is just a
 BitTornado wrapper. In 2011, We did some comparison on the performance of
 murder and the TorrentBroadcast we have right now for Spark's own 
 broadcast
 (Section 7.1 in
 http://www.mosharaf.com/wp-content/uploads/orchestra-sigcomm11.pdf).
 Spark's implementation was 4.5X faster than murder.

 The only issue with using TorrentBroadcast to deploy code/VM is
 writing a wrapper around it to read from disk, but it shouldn't be too
 complicated. If someone picks it up, I can give some pointers on how to
 proceed (I've thought about doing it myself forever, but never ended up
 actually taking the time; right now I don't have enough free cycles 
 either)

 Otherwise, murder/BitTornado would be better than the current strategy
 we have.

 A third option would be to use rsync; but instead of rsync-ing to
 every slave from the master, one can simply rsync from the master first to
 one slave; then use the two sources (master and the first slave) to rsync
 to two more; then four and so on. Might be a simpler solution without many
 changes.

 --
 Mosharaf Chowdhury
 http://www.mosharaf.com/


 On Sun, May 18, 2014 at 11:07 PM, Andrew Ash and...@andrewash.comwrote:

 My first thought would be to use libtorrent for this setup, and it
 turns out that both Twitter and Facebook do code deploys with a 
 bittorrent
 setup.  Twitter even released their code as open source:


 https://blog.twitter.com/2010/murder-fast-datacenter-code-deploys-using-bittorrent


 http://arstechnica.com/business/2012/04/exclusive-a-behind-the-scenes-look-at-facebook-release-engineering/


 On Sun, May 18, 2014 at 10:44 PM, Daniel Mahler dmah...@gmail.comwrote:

 I am not an expert in this space either. I thought the initial rsync
 during launch is really just a straight copy that did not need the tree
 diff. So it seemed like having the slaves do the copying among it each
 other would be better than having the master copy to everyone directly.
 That made me think of bittorrent, though there may well be other systems
 that do this.
 From the launches I did today it seems that it is taking around 1
 minute per slave to launch a cluster, which can be a problem for 
 clusters
 with 10s or 100s of slaves, particularly since on ec2  that time has to 
 be
 paid for.


 On Sun, May 18, 2014 at 11:54 PM, Aaron Davidson ilike...@gmail.com
  wrote:

 Out of curiosity, do you have a library in mind that would make it
 easy to setup a bit torrent network and distribute files in an rsync 
 (i.e.,
 apply a diff to a tree, ideally) fashion? I'm not familiar with this 
 space,
 but we do want to minimize the complexity of our standard ec2 launch
 scripts to reduce the chance of something breaking.


 On Sun, May 18, 2014 at 9:22 PM, Daniel Mahler 
 dmah...@gmail.comwrote:

 I am launching a rather large cluster on ec2.
 It seems like the launch is taking forever on
 
 Setting up spark
 RSYNC'ing /root/spark to slaves...
 ...

 It seems that bittorrent might be a faster way to replicate
 the sizeable spark directory to the slaves
 particularly if there is a lot of not very powerful slaves.

 Just a thought ...

 cheers
 Daniel












Re: How to run the SVM and LogisticRegression

2014-05-19 Thread yxzhao
Thanks Xiangrui,

But I did not find the directory:
examples/src/main/scala/org/apache/spark/examples/mllib.
Could you give me more detail or show me one example? Thanks a lot.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-the-SVM-and-LogisticRegression-tp5720p6049.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Which component(s) of Spark do not support IPv6?

2014-05-19 Thread queniesun
Besides Hadoop, are there any other components of Spark that do not support
IPv6? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Which-component-s-of-Spark-do-not-support-IPv6-tp6050.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: specifying worker nodes when using the repl?

2014-05-19 Thread Eric Friedman
Sandy, thank you so much — that was indeed my omission!
Eric

On May 19, 2014, at 10:14 AM, Sandy Ryza sandy.r...@cloudera.com wrote:

 Hi Eric,
 
 Have you tried setting the SPARK_WORKER_INSTANCES env variable before running 
 spark-shell?
 http://spark.apache.org/docs/0.9.0/running-on-yarn.html
 
 -Sandy
 
 
 On Mon, May 19, 2014 at 8:08 AM, Eric Friedman e...@spottedsnake.net wrote:
 Hi
 
 I am working with a Cloudera 5 cluster with 192 nodes and can’t work out how 
 to get the spark repo to use more than 2 nodes in an interactive session.
 
 So, this works, but is non-interactive (using yarn-client as MASTER)
 
 /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/bin/spark-class \
   org.apache.spark.deploy.yarn.Client \
   --jar 
 /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/examples/lib/spark-examples_2.10-0.9.0-cdh5.0.0.jar
  \
   --class org.apache.spark.examples.SparkPi \
   --args yarn-standalone \
   --args 10 \
   --num-workers 100
 
 There does not appear to be an (obvious?) way to get more than 2 nodes 
 involved from the repl.
 
 I am running the REPL like this:
 
 #!/bin/sh
 
 . /etc/spark/conf.cloudera.spark/spark-env.sh
 
 export SPARK_JAR=hdfs://nameservice1/user/spark/share/lib/spark-assembly.jar
 
 export SPARK_WORKER_MEMORY=512m
 
 export MASTER=yarn-client
 
 exec $SPARK_HOME/bin/spark-shell
 
 Now if I comment out the line with `export SPARK_JAR=…’ and run this again, I 
 get an error like this:
 
 14/05/19 08:03:41 ERROR Client: Error: You must set SPARK_JAR environment 
 variable!
 Usage: org.apache.spark.deploy.yarn.Client [options] 
 Options:
   --jar JAR_PATH Path to your application's JAR file (required in 
 yarn-cluster mode)
   --class CLASS_NAME Name of your application's main class (required)
   --args ARGSArguments to be passed to your application's 
 main class.
  Mutliple invocations are possible, each will be 
 passed in order.
   --num-workers NUM  Number of workers to start (Default: 2)
   […]
 
 But none of those options are exposed at the `spark-shell’ level.
 
 Thanks in advance for your guidance.
 
 Eric
 



Re: How to compile the examples directory?

2014-05-19 Thread Matei Zaharia
If you’d like to work on just this code for your own changes, it might be best 
to copy it to a separate project. Look at 
http://spark.apache.org/docs/latest/quick-start.html for how to set up a 
standalone job.

Matei

On May 19, 2014, at 4:53 AM, Hao Wang wh.s...@gmail.com wrote:

 Hi,
 
 I am running some examples of Spark on a cluster. Because I need to modify 
 some source code, I have to re-compile the whole Spark using `sbt/sbt 
 assembly`, which takes a long time.
 
 I have tried `mvn package` under the example directory, it failed because of 
 some dependencies problem. Any way to avoid to compile the whole Spark 
 project?
 
 
 Regards,
 Wang Hao(王灏)
 
 CloudTeam | School of Software Engineering
 Shanghai Jiao Tong University
 Address:800 Dongchuan Road, Minhang District, Shanghai, 200240
 Email:wh.s...@gmail.com



combinebykey throw classcastexception

2014-05-19 Thread xiemeilong
I am using CDH5 on a three machines cluster. map data from hbase as (string,
V) pair , then call combineByKey like this: 

.combineByKey[C]( 
  (v:V)=new C(v),   //this line throw java.lang.ClassCastException: C
cannot be cast to V 
  (v:C,v:V)=C, 
  (c1:C,c2:C)=C) 


I am very confused of this, there isn't C to V casting at all.  What's
wrong?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/combinebykey-throw-classcastexception-tp6059.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: advice on maintaining a production spark cluster?

2014-05-19 Thread Matei Zaharia
Which version is this with? I haven’t seen standalone masters lose workers. Is 
there other stuff on the machines that’s killing them, or what errors do you 
see?

Matei

On May 16, 2014, at 9:53 AM, Josh Marcus jmar...@meetup.com wrote:

 Hey folks,
 
 I'm wondering what strategies other folks are using for maintaining and 
 monitoring the stability of stand-alone spark clusters.
 
 Our master very regularly loses workers, and they (as expected) never rejoin 
 the cluster.  This is the same behavior I've seen
 using akka cluster (if that's what spark is using in stand-alone mode) -- are 
 there configuration options we could be setting
 to make the cluster more robust?
 
 We have a custom script which monitors the number of workers (through the web 
 interface) and restarts the cluster when
 necessary, as well as resolving other issues we face (like spark shells left 
 open permanently claiming resources), and it
 works, but it's no where close to a great solution.
 
 What are other folks doing?  Is this something that other folks observe as 
 well?  I suspect that the loss of workers is tied to 
 jobs that run out of memory on the client side or our use of very large 
 broadcast variables, but I don't have an isolated test case.
 I'm open to general answers here: for example, perhaps we should simply be 
 using mesos or yarn instead of stand-alone mode.
 
 --j
 



Re: How to run the SVM and LogisticRegression

2014-05-19 Thread yxzhao
Thanks Xiangrui,

 Sorry I am new for Spark, could you give me more detail about
master or branch-1.0  I do not know what master or branch-1.0 is.
Thanks again.


On Mon, May 19, 2014 at 10:17 PM, Xiangrui Meng [via Apache Spark User
List] ml-node+s1001560n6064...@n3.nabble.com wrote:
 Checkout the master or branch-1.0. Then the examples should be there.
 -Xiangrui

 On Mon, May 19, 2014 at 11:36 AM, yxzhao [hidden email] wrote:

 Thanks Xiangrui,

 But I did not find the directory:
 examples/src/main/scala/org/apache/spark/examples/mllib.
 Could you give me more detail or show me one example? Thanks a lot.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-the-SVM-and-LogisticRegression-tp5720p6049.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


 
 If you reply to this email, your message will be added to the discussion
 below:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-the-SVM-and-LogisticRegression-tp5720p6064.html
 To unsubscribe from How to run the SVM and LogisticRegression, click here.
 NAML




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-the-SVM-and-LogisticRegression-tp5720p6065.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to run the SVM and LogisticRegression

2014-05-19 Thread Andrew Ash
Hi yxzhao,

Those are branches in the source code git repository.  You can get to them
with git checkout branch-1.0 once you've cloned the git repository.

Cheers,
Andrew


On Mon, May 19, 2014 at 8:30 PM, yxzhao yxz...@ualr.edu wrote:

 Thanks Xiangrui,

  Sorry I am new for Spark, could you give me more detail about
 master or branch-1.0  I do not know what master or branch-1.0 is.
 Thanks again.


 On Mon, May 19, 2014 at 10:17 PM, Xiangrui Meng [via Apache Spark User
 List] [hidden email] http://user/SendEmail.jtp?type=nodenode=6065i=0
 wrote:

  Checkout the master or branch-1.0. Then the examples should be there.
  -Xiangrui
 
  On Mon, May 19, 2014 at 11:36 AM, yxzhao [hidden email] wrote:
 
  Thanks Xiangrui,
 
  But I did not find the directory:
  examples/src/main/scala/org/apache/spark/examples/mllib.
  Could you give me more detail or show me one example? Thanks a lot.
 
 
 
  --
  View this message in context:
 
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-the-SVM-and-LogisticRegression-tp5720p6049.html
  Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
 
 
  
  If you reply to this email, your message will be added to the discussion
  below:
 
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-the-SVM-and-LogisticRegression-tp5720p6064.html
  To unsubscribe from How to run the SVM and LogisticRegression, click
 here.
  NAML

 --
 View this message in context: Re: How to run the SVM and
 LogisticRegressionhttp://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-the-SVM-and-LogisticRegression-tp5720p6065.html

 Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.



Spark stalling during shuffle (maybe a memory issue)

2014-05-19 Thread jonathan.keebler
Has anyone observed Spark worker threads stalling during a shuffle phase with
the following message (one per worker host) being echoed to the terminal on
the driver thread?

INFO spark.MapOutputTrackerActor: Asked to send map output locations for
shuffle 0 to [worker host]...


At this point Spark-related activity on the hadoop cluster completely halts
.. there's no network activity, disk IO or CPU activity, and individual
tasks are not completing and the job just sits in this state.  At this point
we just kill the job  a re-start of the Spark server service is required.

Using identical jobs we were able to by-pass this halt point by increasing
available heap memory to the workers, but it's odd we don't get an
out-of-memory error or any error at all.  Upping the memory available isn't
a very satisfying answer to what may be going on :)

We're running Spark 0.9.0 on CDH5.0 in stand-alone mode.

Thanks for any help or ideas you may have!

Cheers,
Jonathan




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Reading from .bz2 files with Spark

2014-05-19 Thread Andrew Ash
Hi Xiangrui, many thanks to you and Sandy for fixing this issue!


On Fri, May 16, 2014 at 10:23 PM, Xiangrui Meng men...@gmail.com wrote:

 Hi Andrew,

 I submitted a patch and verified it solves the problem. You can
 download the patch from
 https://issues.apache.org/jira/browse/HADOOP-10614 .

 Best,
 Xiangrui

 On Fri, May 16, 2014 at 6:48 PM, Xiangrui Meng men...@gmail.com wrote:
  Hi Andrew,
 
  This is the JIRA I created:
  https://issues.apache.org/jira/browse/MAPREDUCE-5893 . Hopefully
  someone wants to work on it.
 
  Best,
  Xiangrui
 
  On Fri, May 16, 2014 at 6:47 PM, Xiangrui Meng men...@gmail.com wrote:
  Hi Andre,
 
  I could reproduce the bug with Hadoop 2.2.0. Some older version of
  Hadoop do not support splittable compression, so you ended up with
  sequential reads. It is easy to reproduce the bug with the following
  setup:
 
  1) Workers are configured with multiple cores.
  2) BZip2 files are big enough or minPartitions is large enough when
  you load the file via sc.textFile(), so that one worker has more than
  one tasks.
 
  Best,
  Xiangrui
 
  On Fri, May 16, 2014 at 4:06 PM, Andrew Ash and...@andrewash.com
 wrote:
  Hi Xiangrui,
 
  // FYI I'm getting your emails late due to the Apache mailing list
 outage
 
  I'm using CDH4.4.0, which I think uses the MapReduce v2 API.  The
 .jars are
  named like this: hadoop-hdfs-2.0.0-cdh4.4.0.jar
 
  I'm also glad you were able to reproduce!  Please paste a link to the
 Hadoop
  bug you file so I can follow along.
 
  Thanks!
  Andrew
 
 
  On Tue, May 13, 2014 at 9:08 AM, Xiangrui Meng men...@gmail.com
 wrote:
 
  Which hadoop version did you use? I'm not sure whether Hadoop v2 fixes
  the problem you described, but it does contain several fixes to bzip2
  format. -Xiangrui
 
  On Wed, May 7, 2014 at 9:19 PM, Andrew Ash and...@andrewash.com
 wrote:
   Hi all,
  
   Is anyone reading and writing to .bz2 files stored in HDFS from
 Spark
   with
   success?
  
  
   I'm finding the following results on a recent commit (756c96 from
 24hr
   ago)
   and CDH 4.4.0:
  
   Works: val r = sc.textFile(/user/aa/myfile.bz2).count
   Doesn't work: val r =
 sc.textFile(/user/aa/myfile.bz2).map((s:String)
   =
   s+|  ).count
  
   Specifically, I'm getting an exception coming out of the bzip2
 libraries
   (see below stacktraces), which is unusual because I'm able to read
 from
   that
   file without an issue using the same libraries via Pig.  It was
   originally
   created from Pig as well.
  
   Digging a little deeper I found this line in the .bz2 decompressor's
   javadoc
   for CBZip2InputStream:
  
   Instances of this class are not threadsafe. [source]
  
  
   My current working theory is that Spark has a much higher level of
   parallelism than Pig/Hadoop does and thus I get these wild
   IndexOutOfBounds
   exceptions much more frequently (as in can't finish a run over a
 little
   2M
   row file) vs hardly at all in other libraries.
  
   The only other reference I could find to the issue was in
 presto-users,
   but
   the recommendation to leave .bz2 for .lzo doesn't help if I
 actually do
   want
   the higher compression levels of .bz2.
  
  
   Would love to hear if I have some kind of configuration issue or if
   there's
   a bug in .bz2 that's fixed in later versions of CDH, or generally
 any
   other
   thoughts on the issue.
  
  
   Thanks!
   Andrew
  
  
  
   Below are examples of some exceptions I'm getting:
  
   14/05/07 15:09:49 WARN scheduler.TaskSetManager: Loss was due to
   java.lang.ArrayIndexOutOfBoundsException
   java.lang.ArrayIndexOutOfBoundsException: 65535
   at
  
  
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.hbCreateDecodeTables(CBZip2InputStream.java:663)
   at
  
  
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.createHuffmanDecodingTables(CBZip2InputStream.java:790)
   at
  
  
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.recvDecodingTables(CBZip2InputStream.java:762)
   at
  
  
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:798)
   at
  
  
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:502)
   at
  
  
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:333)
   at
  
  
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:397)
   at
  
  
 org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:426)
   at java.io.InputStream.read(InputStream.java:101)
   at
  
 org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209)
   at
   org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)
   at
  
  
 org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:203)
   at
  
 

life if an executor

2014-05-19 Thread Koert Kuipers
from looking at the source code i see executors run in their own jvm
subprocesses.

how long to they live for? as long as the worker/slave? or are they tied to
the sparkcontext and life/die with it?

thx


Re: How to run the SVM and LogisticRegression

2014-05-19 Thread yxzhao
 Thanks Andrew,

 Yes, I have downloaded the master code.
 But, actually I just want to know how to run the
classfication algorithms SVM and LogisticRegression implemented under
/spark-0.9.0-incubating/mllib/src/main/scala/org/apache/spark/mllib/classification
. Thanks.

On Mon, May 19, 2014 at 10:37 PM, Andrew Ash [via Apache Spark User
List] ml-node+s1001560n6066...@n3.nabble.com wrote:
 Hi yxzhao,

 Those are branches in the source code git repository.  You can get to them
 with git checkout branch-1.0 once you've cloned the git repository.

 Cheers,
 Andrew


 On Mon, May 19, 2014 at 8:30 PM, yxzhao [hidden email] wrote:

 Thanks Xiangrui,

  Sorry I am new for Spark, could you give me more detail about
 master or branch-1.0  I do not know what master or branch-1.0 is.
 Thanks again.


 On Mon, May 19, 2014 at 10:17 PM, Xiangrui Meng [via Apache Spark User
 List] [hidden email] wrote:

  Checkout the master or branch-1.0. Then the examples should be there.
  -Xiangrui
 
  On Mon, May 19, 2014 at 11:36 AM, yxzhao [hidden email] wrote:

 
  Thanks Xiangrui,
 
  But I did not find the directory:
  examples/src/main/scala/org/apache/spark/examples/mllib.
  Could you give me more detail or show me one example? Thanks a lot.
 
 
 
  --
  View this message in context:
 
  http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-the-SVM-and-LogisticRegression-tp5720p6049.html
  Sent from the Apache Spark User List mailing list archive at
  Nabble.com.
 
 
  
  If you reply to this email, your message will be added to the discussion
  below:
 
  http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-the-SVM-and-LogisticRegression-tp5720p6064.html
  To unsubscribe from How to run the SVM and LogisticRegression, click
  here.
  NAML

 
 View this message in context: Re: How to run the SVM and
 LogisticRegression

 Sent from the Apache Spark User List mailing list archive at Nabble.com.




 
 If you reply to this email, your message will be added to the discussion
 below:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-the-SVM-and-LogisticRegression-tp5720p6066.html
 To unsubscribe from How to run the SVM and LogisticRegression, click here.
 NAML




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-the-SVM-and-LogisticRegression-tp5720p6070.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Problem when sorting big file

2014-05-19 Thread Andrew Ash
Is your RDD of Strings?  If so, you should make sure to use the Kryo
serializer instead of the default Java one.  It stores strings as UTF8
rather than Java's default UTF16 representation, which can save you half
the memory usage in the right situation.

Try setting the persistence level on the RDD to MEMORY_AND_DISK_SER and
possibly also lowering spark.storage.memoryFraction from 0.6 to 0.4 or so.

Andrew


On Thu, May 15, 2014 at 2:55 PM, Gustavo Enrique Salazar Torres 
gsala...@ime.usp.br wrote:

 Hi there:

 I have this dataset (about 12G) which I need to sort by key.
 I used the sortByKey method but when I try to save the file to disk (HDFS
 in this case) it seems that some tasks run out of time because they have
 too much data to save and it can't fit in memory.
 I say this because before the TimeOut exception at the worker there is an
 OOM exception from an specific task.
 My question is: is this a common problem at Spark? has anyone been through
 this issue?
 The cause of the problem seems to be an unbalanced distribution of data
 between tasks.

 I will appreciate any help.

 Thanks
 Gustavo



facebook data mining with Spark

2014-05-19 Thread Joe L
Is there any way to get facebook data into Spark and filter the content of
it?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/facebook-data-mining-with-Spark-tp6072.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Setting queue for spark job on yarn

2014-05-19 Thread Ron Gonzalez
Hi,
  How does one submit a spark job to yarn and specify a queue?
  The code that successfully submits to yarn is:

   val conf = new SparkConf()
   val sc = new SparkContext(yarn-client, Simple App, conf)

   Where do I need to specify the queue?

  Thanks in advance for any help on this...

Thanks,
Ron

Re: life if an executor

2014-05-19 Thread Matei Zaharia
They’re tied to the SparkContext (application) that launched them.

Matei

On May 19, 2014, at 8:44 PM, Koert Kuipers ko...@tresata.com wrote:

 from looking at the source code i see executors run in their own jvm 
 subprocesses.
 
 how long to they live for? as long as the worker/slave? or are they tied to 
 the sparkcontext and life/die with it?
 
 thx



Re: filling missing values in a sequence

2014-05-19 Thread Mohit Jaggi
Thanks Sean. Yes, your solution works :-) I did oversimplify my real
problem, which has other parameters that go along with the sequence.


On Fri, May 16, 2014 at 3:03 AM, Sean Owen so...@cloudera.com wrote:

 Not sure if this is feasible, but this literally does what I think you
 are describing:

 sc.parallelize(rdd1.first to rdd1.last)

 On Tue, May 13, 2014 at 4:56 PM, Mohit Jaggi mohitja...@gmail.com wrote:
  Hi,
  I am trying to find a way to fill in missing values in an RDD. The RDD
 is a
  sorted sequence.
  For example, (1, 2, 3, 5, 8, 11, ...)
  I need to fill in the missing numbers and get (1,2,3,4,5,6,7,8,9,10,11)
 
  One way to do this is to slide and zip
  rdd1 =  sc.parallelize(List(1, 2, 3, 5, 8, 11, ...))
  x = rdd1.first
  rdd2 = rdd1 filter (_ != x)
  rdd3 = rdd2 zip rdd1
  rdd4 = rdd3 flatmap { (x, y) = generate missing elements between x and
 y }
 
  Another method which I think is more efficient is to use
 mapParititions() on
  rdd1 to be able to iterate on elements of rdd1 in each partition.
 However,
  that leaves the boundaries of the partitions to be unfilled. Is there a
  way within the function passed to mapPartitions, to read the first
 element
  in the next partition?
 
  The latter approach also appears to work for a general sliding window
  calculation on the RDD. The former technique requires a lot of sliding
 and
  zipping and I believe it is not efficient. If only I could read the next
  partition...I have tried passing a pointer to rdd1 to the function
 passed to
  mapPartitions but the rdd1 pointer turns out to be NULL, I guess because
  Spark cannot deal with a mapper calling another mapper (since it happens
 on
  a worker not the driver)
 
  Mohit.



Re: life if an executor

2014-05-19 Thread Mohit Jaggi
I guess it needs to be this way to benefit from caching of RDDs in
memory. It would be nice however if the RDD cache can be dissociated from
the JVM heap so that in cases where garbage collection is difficult to
tune, one could choose to discard the JVM and run the next operation in a
few one.


On Mon, May 19, 2014 at 10:06 PM, Matei Zaharia matei.zaha...@gmail.comwrote:

 They’re tied to the SparkContext (application) that launched them.

 Matei

 On May 19, 2014, at 8:44 PM, Koert Kuipers ko...@tresata.com wrote:

 from looking at the source code i see executors run in their own jvm
 subprocesses.

 how long to they live for? as long as the worker/slave? or are they tied
 to the sparkcontext and life/die with it?

 thx





Re: life if an executor

2014-05-19 Thread Tathagata Das
That's one the main motivation in using Tachyon ;)
http://tachyon-project.org/

It gives off heap in-memory caching. And starting Spark 0.9, you can cache
any RDD in Tachyon just by specifying the appropriate StorageLevel.

TD




On Mon, May 19, 2014 at 10:22 PM, Mohit Jaggi mohitja...@gmail.com wrote:

 I guess it needs to be this way to benefit from caching of RDDs in
 memory. It would be nice however if the RDD cache can be dissociated from
 the JVM heap so that in cases where garbage collection is difficult to
 tune, one could choose to discard the JVM and run the next operation in a
 few one.


 On Mon, May 19, 2014 at 10:06 PM, Matei Zaharia 
 matei.zaha...@gmail.comwrote:

 They’re tied to the SparkContext (application) that launched them.

 Matei

 On May 19, 2014, at 8:44 PM, Koert Kuipers ko...@tresata.com wrote:

 from looking at the source code i see executors run in their own jvm
 subprocesses.

 how long to they live for? as long as the worker/slave? or are they tied
 to the sparkcontext and life/die with it?

 thx