For performance, Spark prefers OracleJDK or OpenJDK?
Hi, Oracle JDK and OpenJDK, which one is better or preferred for Spark? Regards, Wang Hao(王灏) CloudTeam | School of Software Engineering Shanghai Jiao Tong University Address:800 Dongchuan Road, Minhang District, Shanghai, 200240 Email:wh.s...@gmail.com
Re: For performance, Spark prefers OracleJDK or OpenJDK?
I would like to say that Oracle JDK may be the better choice. A lot of hadoop distribution vendors use Oracle JDK instead of Open JDK for enterprise. On Mon, May 19, 2014 at 2:50 PM, Hao Wang wh.s...@gmail.com wrote: Hi, Oracle JDK and OpenJDK, which one is better or preferred for Spark? Regards, Wang Hao(王灏) CloudTeam | School of Software Engineering Shanghai Jiao Tong University Address:800 Dongchuan Road, Minhang District, Shanghai, 200240 Email:wh.s...@gmail.com -- Regards Gordon Wang
Re: sync master with slaves with bittorrent?
btw is there a command or script to update the slaves from the master? thanks Daniel On Mon, May 19, 2014 at 1:48 AM, Andrew Ash and...@andrewash.com wrote: If the codebase for Spark's broadcast is pretty self-contained, you could consider creating a small bootstrap sent out via the doubling rsync strategy that Mosharaf outlined above (called Tree D=2 in the paper) that then pulled the larger Mosharaf, do you have a sense of whether the gains from using Cornet vs Tree D=2 with rsync outweighs the overhead of using a 2-phase broadcast mechanism? Andrew On Sun, May 18, 2014 at 11:32 PM, Aaron Davidson ilike...@gmail.comwrote: One issue with using Spark itself is that this rsync is required to get Spark to work... Also note that a similar strategy is used for *updating* the spark cluster on ec2, where the diff aspect is much more important, as you might only make a small change on the driver node (recompile or reconfigure) and can get a fast sync. On Sun, May 18, 2014 at 11:22 PM, Mosharaf Chowdhury mosharafka...@gmail.com wrote: What twitter calls murder, unless it has changed since then, is just a BitTornado wrapper. In 2011, We did some comparison on the performance of murder and the TorrentBroadcast we have right now for Spark's own broadcast (Section 7.1 in http://www.mosharaf.com/wp-content/uploads/orchestra-sigcomm11.pdf). Spark's implementation was 4.5X faster than murder. The only issue with using TorrentBroadcast to deploy code/VM is writing a wrapper around it to read from disk, but it shouldn't be too complicated. If someone picks it up, I can give some pointers on how to proceed (I've thought about doing it myself forever, but never ended up actually taking the time; right now I don't have enough free cycles either) Otherwise, murder/BitTornado would be better than the current strategy we have. A third option would be to use rsync; but instead of rsync-ing to every slave from the master, one can simply rsync from the master first to one slave; then use the two sources (master and the first slave) to rsync to two more; then four and so on. Might be a simpler solution without many changes. -- Mosharaf Chowdhury http://www.mosharaf.com/ On Sun, May 18, 2014 at 11:07 PM, Andrew Ash and...@andrewash.comwrote: My first thought would be to use libtorrent for this setup, and it turns out that both Twitter and Facebook do code deploys with a bittorrent setup. Twitter even released their code as open source: https://blog.twitter.com/2010/murder-fast-datacenter-code-deploys-using-bittorrent http://arstechnica.com/business/2012/04/exclusive-a-behind-the-scenes-look-at-facebook-release-engineering/ On Sun, May 18, 2014 at 10:44 PM, Daniel Mahler dmah...@gmail.comwrote: I am not an expert in this space either. I thought the initial rsync during launch is really just a straight copy that did not need the tree diff. So it seemed like having the slaves do the copying among it each other would be better than having the master copy to everyone directly. That made me think of bittorrent, though there may well be other systems that do this. From the launches I did today it seems that it is taking around 1 minute per slave to launch a cluster, which can be a problem for clusters with 10s or 100s of slaves, particularly since on ec2 that time has to be paid for. On Sun, May 18, 2014 at 11:54 PM, Aaron Davidson ilike...@gmail.comwrote: Out of curiosity, do you have a library in mind that would make it easy to setup a bit torrent network and distribute files in an rsync (i.e., apply a diff to a tree, ideally) fashion? I'm not familiar with this space, but we do want to minimize the complexity of our standard ec2 launch scripts to reduce the chance of something breaking. On Sun, May 18, 2014 at 9:22 PM, Daniel Mahler dmah...@gmail.comwrote: I am launching a rather large cluster on ec2. It seems like the launch is taking forever on Setting up spark RSYNC'ing /root/spark to slaves... ... It seems that bittorrent might be a faster way to replicate the sizeable spark directory to the slaves particularly if there is a lot of not very powerful slaves. Just a thought ... cheers Daniel
persist @ disk-only failing
Hi all, When i gave the persist level as DISK_ONLY, still Spark tries to use memory and caches. Any reason ? Do i need to override some parameter elsewhere ? Thanks !
Re: Packaging a spark job using maven
Hi Eugen, Thanks for your help. I'm not familiar with the shaded plugin and i was wondering: does it replace the assembly plugin ? Also, do i have to specify all the artifacts and sub artifacts in the artifactSet ? Or can i just use a *:* wildcard and let the maven scopes do their work ? I have a lot of overlap warnings when i do so. Thanks for your help. Regards, Laurent -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Packaging-a-spark-job-using-maven-tp5615p6024.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Packaging a spark job using maven
2014-05-19 10:35 GMT+02:00 Laurent T laurent.thou...@ldmobile.net: Hi Eugen, Thanks for your help. I'm not familiar with the shaded plugin and i was wondering: does it replace the assembly plugin ? Nope it doesn't replace it. It allows you to make fat jars and other nice things such as relocating classes to some other package. I am using it in combination with assembly and jdeb to build deployable archives (zip and debian). I find that building fat jars with shade plugin is more powerful and easier that with assembly. Also, do i have to specify all the artifacts and sub artifacts in the artifactSet ? Or can i just use a *:* wildcard and let the maven scopes do their work ? I have a lot of overlap warnings when i do so. Indeed you don't have to tell exactly what must be included, I do so, in order to have at the end a small archive that we can quickly deploy. Have a look at the doc you have some examples http://maven.apache.org/plugins/maven-shade-plugin/examples/includes-excludes.html In short, you remove the includes and instead write the excludes (spark, hadoop, etc). The overlap is due to same classes being present in different jars. You can exclude those jars to remove the warnings. http://stackoverflow.com/questions/19987080/maven-shade-plugin-uber-jar-and-overlapping-classes http://stackoverflow.com/questions/11824633/maven-shade-plugin-warning-we-have-a-duplicate-how-to-fix Eugen Thanks for your help. Regards, Laurent -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Packaging-a-spark-job-using-maven-tp5615p6024.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: sync master with slaves with bittorrent?
On Mon, May 19, 2014 at 2:04 AM, Daniel Mahler dmah...@gmail.com wrote: I agree that for updating rsync is probably preferable, and it seems like for that purpose it would also parallelize well since most of the time is spent computing checksums so the process is not constrained by the total i/o capacity of the master. However it is a problem for the initial replication of the master to the slaves. If you are running on ec2 then the dollar overhead of launching is quadratic in the number of slaves. if you launch a 100 machine cluster you will wait a 100 minutes, but you will pay for 1 machine minutes or 167 hours before anything useful starts to happen. Launch time does *not* increase linearly with number slaves as I thought I was seeing. It would still be nice to have a faster launch though. cheers Daniel On Mon, May 19, 2014 at 1:32 AM, Aaron Davidson ilike...@gmail.comwrote: One issue with using Spark itself is that this rsync is required to get Spark to work... Also note that a similar strategy is used for *updating* the spark cluster on ec2, where the diff aspect is much more important, as you might only make a small change on the driver node (recompile or reconfigure) and can get a fast sync. On Sun, May 18, 2014 at 11:22 PM, Mosharaf Chowdhury mosharafka...@gmail.com wrote: What twitter calls murder, unless it has changed since then, is just a BitTornado wrapper. In 2011, We did some comparison on the performance of murder and the TorrentBroadcast we have right now for Spark's own broadcast (Section 7.1 in http://www.mosharaf.com/wp-content/uploads/orchestra-sigcomm11.pdf). Spark's implementation was 4.5X faster than murder. The only issue with using TorrentBroadcast to deploy code/VM is writing a wrapper around it to read from disk, but it shouldn't be too complicated. If someone picks it up, I can give some pointers on how to proceed (I've thought about doing it myself forever, but never ended up actually taking the time; right now I don't have enough free cycles either) Otherwise, murder/BitTornado would be better than the current strategy we have. A third option would be to use rsync; but instead of rsync-ing to every slave from the master, one can simply rsync from the master first to one slave; then use the two sources (master and the first slave) to rsync to two more; then four and so on. Might be a simpler solution without many changes. -- Mosharaf Chowdhury http://www.mosharaf.com/ On Sun, May 18, 2014 at 11:07 PM, Andrew Ash and...@andrewash.comwrote: My first thought would be to use libtorrent for this setup, and it turns out that both Twitter and Facebook do code deploys with a bittorrent setup. Twitter even released their code as open source: https://blog.twitter.com/2010/murder-fast-datacenter-code-deploys-using-bittorrent http://arstechnica.com/business/2012/04/exclusive-a-behind-the-scenes-look-at-facebook-release-engineering/ On Sun, May 18, 2014 at 10:44 PM, Daniel Mahler dmah...@gmail.comwrote: I am not an expert in this space either. I thought the initial rsync during launch is really just a straight copy that did not need the tree diff. So it seemed like having the slaves do the copying among it each other would be better than having the master copy to everyone directly. That made me think of bittorrent, though there may well be other systems that do this. From the launches I did today it seems that it is taking around 1 minute per slave to launch a cluster, which can be a problem for clusters with 10s or 100s of slaves, particularly since on ec2 that time has to be paid for. On Sun, May 18, 2014 at 11:54 PM, Aaron Davidson ilike...@gmail.comwrote: Out of curiosity, do you have a library in mind that would make it easy to setup a bit torrent network and distribute files in an rsync (i.e., apply a diff to a tree, ideally) fashion? I'm not familiar with this space, but we do want to minimize the complexity of our standard ec2 launch scripts to reduce the chance of something breaking. On Sun, May 18, 2014 at 9:22 PM, Daniel Mahler dmah...@gmail.comwrote: I am launching a rather large cluster on ec2. It seems like the launch is taking forever on Setting up spark RSYNC'ing /root/spark to slaves... ... It seems that bittorrent might be a faster way to replicate the sizeable spark directory to the slaves particularly if there is a lot of not very powerful slaves. Just a thought ... cheers Daniel
Re: File present but file not found exception
why does it need to be local file? why not do some filter ops on hdfs file and save to hdfs, from where you can create rdd? you can read a small file in on driver program and use sc.parallelize to turn it into RDD On May 16, 2014 7:01 PM, Sai Prasanna ansaiprasa...@gmail.com wrote: I found that if a file is present in all the nodes in the given path in localFS, then reading is possible. But is there a way to read if the file is present only in certain nodes ?? [There should be a way !!] *NEED: Wanted to do some filter ops in HDFS file, create a local file of the result, create an RDD out of it operate * Is there any way out ?? Thanks in advance ! On Fri, May 9, 2014 at 12:18 AM, Sai Prasanna ansaiprasa...@gmail.comwrote: Hi Everyone, I think all are pretty busy, the response time in this group has slightly increased. But anyways, this is a pretty silly problem, but could not get over. I have a file in my localFS, but when i try to create an RDD out of it, tasks fails with file not found exception is thrown at the log files. *var file = sc.textFile(file:///home/sparkcluster/spark/input.txt);* *file.top(1);* input.txt exists in the above folder but still Spark coudnt find it. Some parameters need to be set ?? Any help is really appreciated. Thanks !!
specifying worker nodes when using the repl?
Hi I am working with a Cloudera 5 cluster with 192 nodes and can’t work out how to get the spark repo to use more than 2 nodes in an interactive session. So, this works, but is non-interactive (using yarn-client as MASTER) /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/bin/spark-class \ org.apache.spark.deploy.yarn.Client \ --jar /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/examples/lib/spark-examples_2.10-0.9.0-cdh5.0.0.jar \ --class org.apache.spark.examples.SparkPi \ --args yarn-standalone \ --args 10 \ --num-workers 100 There does not appear to be an (obvious?) way to get more than 2 nodes involved from the repl. I am running the REPL like this: #!/bin/sh . /etc/spark/conf.cloudera.spark/spark-env.sh export SPARK_JAR=hdfs://nameservice1/user/spark/share/lib/spark-assembly.jar export SPARK_WORKER_MEMORY=512m export MASTER=yarn-client exec $SPARK_HOME/bin/spark-shell Now if I comment out the line with `export SPARK_JAR=…’ and run this again, I get an error like this: 14/05/19 08:03:41 ERROR Client: Error: You must set SPARK_JAR environment variable! Usage: org.apache.spark.deploy.yarn.Client [options] Options: --jar JAR_PATH Path to your application's JAR file (required in yarn-cluster mode) --class CLASS_NAME Name of your application's main class (required) --args ARGSArguments to be passed to your application's main class. Mutliple invocations are possible, each will be passed in order. --num-workers NUM Number of workers to start (Default: 2) […] But none of those options are exposed at the `spark-shell’ level. Thanks in advance for your guidance. Eric
Re: Yarn configuration file doesn't work when run with yarn-client mode
I am encountering the same thing. Basic yarn apps work as does the SparkPi example, but my custom application gives this result. I am using compute-classpath to create the proper classpath for my application, same with SparkPi - was there a resolution to this issue? Thanks, Arun On Wed, Feb 12, 2014 at 1:28 AM, Nan Zhu zhunanmcg...@gmail.com wrote: Hi, all When I run my application with yarn-client mode, it seems that the system didn’t load my configuration file correctly, because the local app master always tries to register with RM via a default IP 14/02/12 05:00:23 INFO SparkContext: Added JAR target/scala-2.10/rec_system_2.10-1.0.jar at http://172.31.37.160:51750/jars/rec_system_2.10-1.0.jar with timestamp 1392181223818 14/02/12 05:00:24 INFO RMProxy: Connecting to ResourceManager at / 0.0.0.0:8032 14/02/12 05:00:25 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 14/02/12 05:00:26 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 14/02/12 05:00:27 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) However, if I run in a standalone mode, everything works fine (YARN_CONF_DIR, SPARK_APP, SPARK_YARN_APP_JAR are all set correctly) is it a bug? Best, -- Nan Zhu
Re: persist @ disk-only failing
This is the patch for it: https://github.com/apache/spark/pull/50/. It might be possible to backport it to 0.8. Matei On May 19, 2014, at 2:04 AM, Sai Prasanna ansaiprasa...@gmail.com wrote: Matei, I am using 0.8.1 !! But is there a way without moving to 0.9.1 to bypass cache ? On Mon, May 19, 2014 at 1:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: What version is this with? We used to build each partition first before writing it out, but this was fixed a while back (0.9.1, but it may also be in 0.9.0). Matei On May 19, 2014, at 12:41 AM, Sai Prasanna ansaiprasa...@gmail.com wrote: Hi all, When i gave the persist level as DISK_ONLY, still Spark tries to use memory and caches. Any reason ? Do i need to override some parameter elsewhere ? Thanks !
Re: specifying worker nodes when using the repl?
Hi Eric, Have you tried setting the SPARK_WORKER_INSTANCES env variable before running spark-shell? http://spark.apache.org/docs/0.9.0/running-on-yarn.html -Sandy On Mon, May 19, 2014 at 8:08 AM, Eric Friedman e...@spottedsnake.netwrote: Hi I am working with a Cloudera 5 cluster with 192 nodes and can’t work out how to get the spark repo to use more than 2 nodes in an interactive session. So, this works, but is non-interactive (using yarn-client as MASTER) /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/bin/spark-class \ org.apache.spark.deploy.yarn.Client \ --jar /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/examples/lib/spark-examples_2.10-0.9.0-cdh5.0.0.jar \ --class org.apache.spark.examples.SparkPi \ --args yarn-standalone \ --args 10 \ *--num-workers 100* There does not appear to be an (obvious?) way to get more than 2 nodes involved from the repl. I am running the REPL like this: #!/bin/sh . /etc/spark/conf.cloudera.spark/spark-env.sh export SPARK_JAR= hdfs://nameservice1/user/spark/share/lib/spark-assembly.jar export SPARK_WORKER_MEMORY=512m export MASTER=yarn-client exec $SPARK_HOME/bin/spark-shell Now if I comment out the line with `export SPARK_JAR=…’ and run this again, I get an error like this: 14/05/19 08:03:41 ERROR Client: Error: You must set SPARK_JAR environment variable! Usage: org.apache.spark.deploy.yarn.Client [options] Options: --jar JAR_PATH Path to your application's JAR file (required in yarn-cluster mode) --class CLASS_NAME Name of your application's main class (required) --args ARGSArguments to be passed to your application's main class. Mutliple invocations are possible, each will be passed in order. --num-workers NUM Number of workers to start (Default: 2) […] But none of those options are exposed at the `spark-shell’ level. Thanks in advance for your guidance. Eric
Spark Streaming and Shark | Streaming Taking All CPUs
Hi All I am new to Spark, I was trying to use Spark Streaming and Shark at the same time. I was recieiving messages from Kafka and pushing them to HDFS after minor processing. It was workin fine, but it was taking all the CPUs and at the same time on other terminal i tried to access shark but it kept on waiting until i stopped listener. on the web console it was showing all 6 CPUs were taken by Spark Streamin Listener and Shark had zero CPU. (I have 3 node test cluster) Please suggest Thanks regards -- Anish Sneh http://in.linkedin.com/in/anishsneh
Re: unsubscribe
Hey Andrew, Since we're seeing so many of these e-mails, I think it's worth pointing out that it's not really obvious to find unsubscription information for the lists. The community link on the Spark site (http://spark.apache.org/community.html) does not have instructions for unsubscribing; it links to a different archive than the one you posted, which doesn't show that info either. The only place right now I can see that info (without going to the generic Apache link you posted) is by looking at the e-mai'ls source, where there is a header with the unsubscribe address. So maybe the web site should also list the unsubscribe address, or link to the Apache archive instead of Nabble? I know many people might not like it, but maybe the list messages should have a footer with this administrative info (even if it's just a link to the archive page)? On Sun, May 18, 2014 at 1:49 PM, Andrew Ash and...@andrewash.com wrote: If you'd like to get off this mailing list, please send an email to user-unsubscr...@spark.apache.org, not the regular user@spark.apache.org list. How to use the Apache mailing list infrastructure is documented here: https://www.apache.org/foundation/mailinglists.html And the Spark User list specifically can be found here: http://mail-archives.apache.org/mod_mbox/spark-user/ -- Marcelo
spark ec2 commandline tool error VPC security groups may not be used for a non-VPC launch
Hi, I'm attempting to run spark-ec2 launch on AWS. My AWS instances would be in our EC2 VPC (which seems to be causing a problem). The two security groups MyClusterName-master and MyClusterName-slaves have already been setup with the same ports open as the security group that spark-ec2 tries to create. (My company has security rules where I don't have permissions to create security groups, so they have to be created by someone else ahead of time.) I'm getting the error VPC security groups may not be used for a non-VPC launch when I try to run spark-ec2 launch. Is there something I need to do to make spark-ec2 launch the master and slave instances within the VPC? Here's the command-line and the error that I get... command-line (I've changed the clustername to something generic): $SPARK_HOME/ec2/spark-ec2 --key-pair=MyKeyPair '--identity-file=~/.ssh/id_mysshkey' --slaves=2 --instance-type=m3.large --region=us-east-1 --zone=us-east-1a --ami=myami --spark-version =0.9.1 launch MyClusterName error: ERROR:boto:400 Bad Request ERROR:boto:?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeInvalidParameterCombination/CodeMessageVPC security groups may not be used for a non-VPC launch/Message/Error/ErrorsRequestID8374cac5-5869-4f38-a141-2fdaf3b18326/Requ estID/Response Setting up security groups... Searching for existing cluster MyClusterName... Launching instances... Traceback (most recent call last): File ./spark_ec2.py, line 806, in module main() File ./spark_ec2.py, line 799, in main real_main() File ./spark_ec2.py, line 682, in real_main conn, opts, cluster_name) File ./spark_ec2.py, line 344, in launch_cluster block_device_map = block_map) File /opt/spark-0.9.1-bin-hadoop1/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/image.py, line 255, in run File /opt/spark-0.9.1-bin-hadoop1/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/connection.py, line 678, in run_instances File /opt/spark-0.9.1-bin-hadoop1/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/connection.py, line 925, in get_object boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request ?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeInvalidParameterCombination/CodeMessageVPC security groups may not be used for a non-VPC launch/Message/Error/ErrorsRequestID8374cac5-5869-4f38-a141-2fdaf3b18326/RequestID/Response Thanks for your help!! Matt
Re: unsubscribe
Agree that the links to the archives should probably point to the Apache archives rather than Nabble's, so the unsubscribe documentation is clearer. Also, an (unsubscribe) link right next to subscribe with the email already generated could help too. I'd be one of those highly against a footer on every email. Who can edit the community page? http://spark.apache.org/community.html It's not in the git repo. On Mon, May 19, 2014 at 10:52 AM, Marcelo Vanzin van...@cloudera.comwrote: Hey Andrew, Since we're seeing so many of these e-mails, I think it's worth pointing out that it's not really obvious to find unsubscription information for the lists. The community link on the Spark site (http://spark.apache.org/community.html) does not have instructions for unsubscribing; it links to a different archive than the one you posted, which doesn't show that info either. The only place right now I can see that info (without going to the generic Apache link you posted) is by looking at the e-mai'ls source, where there is a header with the unsubscribe address. So maybe the web site should also list the unsubscribe address, or link to the Apache archive instead of Nabble? I know many people might not like it, but maybe the list messages should have a footer with this administrative info (even if it's just a link to the archive page)? On Sun, May 18, 2014 at 1:49 PM, Andrew Ash and...@andrewash.com wrote: If you'd like to get off this mailing list, please send an email to user-unsubscr...@spark.apache.org, not the regular user@spark.apache.org list. How to use the Apache mailing list infrastructure is documented here: https://www.apache.org/foundation/mailinglists.html And the Spark User list specifically can be found here: http://mail-archives.apache.org/mod_mbox/spark-user/ -- Marcelo
Re: sync master with slaves with bittorrent?
On the ec2 machines, you can update the slaves from the master using something like ~/spark-ec2/copy-dir ~/spark. Spark's TorrentBroadcast relies on the Block Manager to distribute blocks, making it relatively hard to extract. On Mon, May 19, 2014 at 12:36 AM, Daniel Mahler dmah...@gmail.com wrote: btw is there a command or script to update the slaves from the master? thanks Daniel On Mon, May 19, 2014 at 1:48 AM, Andrew Ash and...@andrewash.com wrote: If the codebase for Spark's broadcast is pretty self-contained, you could consider creating a small bootstrap sent out via the doubling rsync strategy that Mosharaf outlined above (called Tree D=2 in the paper) that then pulled the larger Mosharaf, do you have a sense of whether the gains from using Cornet vs Tree D=2 with rsync outweighs the overhead of using a 2-phase broadcast mechanism? Andrew On Sun, May 18, 2014 at 11:32 PM, Aaron Davidson ilike...@gmail.comwrote: One issue with using Spark itself is that this rsync is required to get Spark to work... Also note that a similar strategy is used for *updating* the spark cluster on ec2, where the diff aspect is much more important, as you might only make a small change on the driver node (recompile or reconfigure) and can get a fast sync. On Sun, May 18, 2014 at 11:22 PM, Mosharaf Chowdhury mosharafka...@gmail.com wrote: What twitter calls murder, unless it has changed since then, is just a BitTornado wrapper. In 2011, We did some comparison on the performance of murder and the TorrentBroadcast we have right now for Spark's own broadcast (Section 7.1 in http://www.mosharaf.com/wp-content/uploads/orchestra-sigcomm11.pdf). Spark's implementation was 4.5X faster than murder. The only issue with using TorrentBroadcast to deploy code/VM is writing a wrapper around it to read from disk, but it shouldn't be too complicated. If someone picks it up, I can give some pointers on how to proceed (I've thought about doing it myself forever, but never ended up actually taking the time; right now I don't have enough free cycles either) Otherwise, murder/BitTornado would be better than the current strategy we have. A third option would be to use rsync; but instead of rsync-ing to every slave from the master, one can simply rsync from the master first to one slave; then use the two sources (master and the first slave) to rsync to two more; then four and so on. Might be a simpler solution without many changes. -- Mosharaf Chowdhury http://www.mosharaf.com/ On Sun, May 18, 2014 at 11:07 PM, Andrew Ash and...@andrewash.comwrote: My first thought would be to use libtorrent for this setup, and it turns out that both Twitter and Facebook do code deploys with a bittorrent setup. Twitter even released their code as open source: https://blog.twitter.com/2010/murder-fast-datacenter-code-deploys-using-bittorrent http://arstechnica.com/business/2012/04/exclusive-a-behind-the-scenes-look-at-facebook-release-engineering/ On Sun, May 18, 2014 at 10:44 PM, Daniel Mahler dmah...@gmail.comwrote: I am not an expert in this space either. I thought the initial rsync during launch is really just a straight copy that did not need the tree diff. So it seemed like having the slaves do the copying among it each other would be better than having the master copy to everyone directly. That made me think of bittorrent, though there may well be other systems that do this. From the launches I did today it seems that it is taking around 1 minute per slave to launch a cluster, which can be a problem for clusters with 10s or 100s of slaves, particularly since on ec2 that time has to be paid for. On Sun, May 18, 2014 at 11:54 PM, Aaron Davidson ilike...@gmail.comwrote: Out of curiosity, do you have a library in mind that would make it easy to setup a bit torrent network and distribute files in an rsync (i.e., apply a diff to a tree, ideally) fashion? I'm not familiar with this space, but we do want to minimize the complexity of our standard ec2 launch scripts to reduce the chance of something breaking. On Sun, May 18, 2014 at 9:22 PM, Daniel Mahler dmah...@gmail.comwrote: I am launching a rather large cluster on ec2. It seems like the launch is taking forever on Setting up spark RSYNC'ing /root/spark to slaves... ... It seems that bittorrent might be a faster way to replicate the sizeable spark directory to the slaves particularly if there is a lot of not very powerful slaves. Just a thought ... cheers Daniel
Re: sync master with slaves with bittorrent?
Good catch. In that case, using BitTornado/murder would be better. -- Mosharaf Chowdhury http://www.mosharaf.com/ On Mon, May 19, 2014 at 11:17 AM, Aaron Davidson ilike...@gmail.com wrote: On the ec2 machines, you can update the slaves from the master using something like ~/spark-ec2/copy-dir ~/spark. Spark's TorrentBroadcast relies on the Block Manager to distribute blocks, making it relatively hard to extract. On Mon, May 19, 2014 at 12:36 AM, Daniel Mahler dmah...@gmail.com wrote: btw is there a command or script to update the slaves from the master? thanks Daniel On Mon, May 19, 2014 at 1:48 AM, Andrew Ash and...@andrewash.com wrote: If the codebase for Spark's broadcast is pretty self-contained, you could consider creating a small bootstrap sent out via the doubling rsync strategy that Mosharaf outlined above (called Tree D=2 in the paper) that then pulled the larger Mosharaf, do you have a sense of whether the gains from using Cornet vs Tree D=2 with rsync outweighs the overhead of using a 2-phase broadcast mechanism? Andrew On Sun, May 18, 2014 at 11:32 PM, Aaron Davidson ilike...@gmail.comwrote: One issue with using Spark itself is that this rsync is required to get Spark to work... Also note that a similar strategy is used for *updating* the spark cluster on ec2, where the diff aspect is much more important, as you might only make a small change on the driver node (recompile or reconfigure) and can get a fast sync. On Sun, May 18, 2014 at 11:22 PM, Mosharaf Chowdhury mosharafka...@gmail.com wrote: What twitter calls murder, unless it has changed since then, is just a BitTornado wrapper. In 2011, We did some comparison on the performance of murder and the TorrentBroadcast we have right now for Spark's own broadcast (Section 7.1 in http://www.mosharaf.com/wp-content/uploads/orchestra-sigcomm11.pdf). Spark's implementation was 4.5X faster than murder. The only issue with using TorrentBroadcast to deploy code/VM is writing a wrapper around it to read from disk, but it shouldn't be too complicated. If someone picks it up, I can give some pointers on how to proceed (I've thought about doing it myself forever, but never ended up actually taking the time; right now I don't have enough free cycles either) Otherwise, murder/BitTornado would be better than the current strategy we have. A third option would be to use rsync; but instead of rsync-ing to every slave from the master, one can simply rsync from the master first to one slave; then use the two sources (master and the first slave) to rsync to two more; then four and so on. Might be a simpler solution without many changes. -- Mosharaf Chowdhury http://www.mosharaf.com/ On Sun, May 18, 2014 at 11:07 PM, Andrew Ash and...@andrewash.comwrote: My first thought would be to use libtorrent for this setup, and it turns out that both Twitter and Facebook do code deploys with a bittorrent setup. Twitter even released their code as open source: https://blog.twitter.com/2010/murder-fast-datacenter-code-deploys-using-bittorrent http://arstechnica.com/business/2012/04/exclusive-a-behind-the-scenes-look-at-facebook-release-engineering/ On Sun, May 18, 2014 at 10:44 PM, Daniel Mahler dmah...@gmail.comwrote: I am not an expert in this space either. I thought the initial rsync during launch is really just a straight copy that did not need the tree diff. So it seemed like having the slaves do the copying among it each other would be better than having the master copy to everyone directly. That made me think of bittorrent, though there may well be other systems that do this. From the launches I did today it seems that it is taking around 1 minute per slave to launch a cluster, which can be a problem for clusters with 10s or 100s of slaves, particularly since on ec2 that time has to be paid for. On Sun, May 18, 2014 at 11:54 PM, Aaron Davidson ilike...@gmail.com wrote: Out of curiosity, do you have a library in mind that would make it easy to setup a bit torrent network and distribute files in an rsync (i.e., apply a diff to a tree, ideally) fashion? I'm not familiar with this space, but we do want to minimize the complexity of our standard ec2 launch scripts to reduce the chance of something breaking. On Sun, May 18, 2014 at 9:22 PM, Daniel Mahler dmah...@gmail.comwrote: I am launching a rather large cluster on ec2. It seems like the launch is taking forever on Setting up spark RSYNC'ing /root/spark to slaves... ... It seems that bittorrent might be a faster way to replicate the sizeable spark directory to the slaves particularly if there is a lot of not very powerful slaves. Just a thought ... cheers Daniel
Re: How to run the SVM and LogisticRegression
Thanks Xiangrui, But I did not find the directory: examples/src/main/scala/org/apache/spark/examples/mllib. Could you give me more detail or show me one example? Thanks a lot. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-the-SVM-and-LogisticRegression-tp5720p6049.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Which component(s) of Spark do not support IPv6?
Besides Hadoop, are there any other components of Spark that do not support IPv6? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Which-component-s-of-Spark-do-not-support-IPv6-tp6050.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: specifying worker nodes when using the repl?
Sandy, thank you so much — that was indeed my omission! Eric On May 19, 2014, at 10:14 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Eric, Have you tried setting the SPARK_WORKER_INSTANCES env variable before running spark-shell? http://spark.apache.org/docs/0.9.0/running-on-yarn.html -Sandy On Mon, May 19, 2014 at 8:08 AM, Eric Friedman e...@spottedsnake.net wrote: Hi I am working with a Cloudera 5 cluster with 192 nodes and can’t work out how to get the spark repo to use more than 2 nodes in an interactive session. So, this works, but is non-interactive (using yarn-client as MASTER) /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/bin/spark-class \ org.apache.spark.deploy.yarn.Client \ --jar /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/examples/lib/spark-examples_2.10-0.9.0-cdh5.0.0.jar \ --class org.apache.spark.examples.SparkPi \ --args yarn-standalone \ --args 10 \ --num-workers 100 There does not appear to be an (obvious?) way to get more than 2 nodes involved from the repl. I am running the REPL like this: #!/bin/sh . /etc/spark/conf.cloudera.spark/spark-env.sh export SPARK_JAR=hdfs://nameservice1/user/spark/share/lib/spark-assembly.jar export SPARK_WORKER_MEMORY=512m export MASTER=yarn-client exec $SPARK_HOME/bin/spark-shell Now if I comment out the line with `export SPARK_JAR=…’ and run this again, I get an error like this: 14/05/19 08:03:41 ERROR Client: Error: You must set SPARK_JAR environment variable! Usage: org.apache.spark.deploy.yarn.Client [options] Options: --jar JAR_PATH Path to your application's JAR file (required in yarn-cluster mode) --class CLASS_NAME Name of your application's main class (required) --args ARGSArguments to be passed to your application's main class. Mutliple invocations are possible, each will be passed in order. --num-workers NUM Number of workers to start (Default: 2) […] But none of those options are exposed at the `spark-shell’ level. Thanks in advance for your guidance. Eric
Re: How to compile the examples directory?
If you’d like to work on just this code for your own changes, it might be best to copy it to a separate project. Look at http://spark.apache.org/docs/latest/quick-start.html for how to set up a standalone job. Matei On May 19, 2014, at 4:53 AM, Hao Wang wh.s...@gmail.com wrote: Hi, I am running some examples of Spark on a cluster. Because I need to modify some source code, I have to re-compile the whole Spark using `sbt/sbt assembly`, which takes a long time. I have tried `mvn package` under the example directory, it failed because of some dependencies problem. Any way to avoid to compile the whole Spark project? Regards, Wang Hao(王灏) CloudTeam | School of Software Engineering Shanghai Jiao Tong University Address:800 Dongchuan Road, Minhang District, Shanghai, 200240 Email:wh.s...@gmail.com
combinebykey throw classcastexception
I am using CDH5 on a three machines cluster. map data from hbase as (string, V) pair , then call combineByKey like this: .combineByKey[C]( (v:V)=new C(v), //this line throw java.lang.ClassCastException: C cannot be cast to V (v:C,v:V)=C, (c1:C,c2:C)=C) I am very confused of this, there isn't C to V casting at all. What's wrong? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/combinebykey-throw-classcastexception-tp6059.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: advice on maintaining a production spark cluster?
Which version is this with? I haven’t seen standalone masters lose workers. Is there other stuff on the machines that’s killing them, or what errors do you see? Matei On May 16, 2014, at 9:53 AM, Josh Marcus jmar...@meetup.com wrote: Hey folks, I'm wondering what strategies other folks are using for maintaining and monitoring the stability of stand-alone spark clusters. Our master very regularly loses workers, and they (as expected) never rejoin the cluster. This is the same behavior I've seen using akka cluster (if that's what spark is using in stand-alone mode) -- are there configuration options we could be setting to make the cluster more robust? We have a custom script which monitors the number of workers (through the web interface) and restarts the cluster when necessary, as well as resolving other issues we face (like spark shells left open permanently claiming resources), and it works, but it's no where close to a great solution. What are other folks doing? Is this something that other folks observe as well? I suspect that the loss of workers is tied to jobs that run out of memory on the client side or our use of very large broadcast variables, but I don't have an isolated test case. I'm open to general answers here: for example, perhaps we should simply be using mesos or yarn instead of stand-alone mode. --j
Re: How to run the SVM and LogisticRegression
Thanks Xiangrui, Sorry I am new for Spark, could you give me more detail about master or branch-1.0 I do not know what master or branch-1.0 is. Thanks again. On Mon, May 19, 2014 at 10:17 PM, Xiangrui Meng [via Apache Spark User List] ml-node+s1001560n6064...@n3.nabble.com wrote: Checkout the master or branch-1.0. Then the examples should be there. -Xiangrui On Mon, May 19, 2014 at 11:36 AM, yxzhao [hidden email] wrote: Thanks Xiangrui, But I did not find the directory: examples/src/main/scala/org/apache/spark/examples/mllib. Could you give me more detail or show me one example? Thanks a lot. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-the-SVM-and-LogisticRegression-tp5720p6049.html Sent from the Apache Spark User List mailing list archive at Nabble.com. If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-the-SVM-and-LogisticRegression-tp5720p6064.html To unsubscribe from How to run the SVM and LogisticRegression, click here. NAML -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-the-SVM-and-LogisticRegression-tp5720p6065.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: How to run the SVM and LogisticRegression
Hi yxzhao, Those are branches in the source code git repository. You can get to them with git checkout branch-1.0 once you've cloned the git repository. Cheers, Andrew On Mon, May 19, 2014 at 8:30 PM, yxzhao yxz...@ualr.edu wrote: Thanks Xiangrui, Sorry I am new for Spark, could you give me more detail about master or branch-1.0 I do not know what master or branch-1.0 is. Thanks again. On Mon, May 19, 2014 at 10:17 PM, Xiangrui Meng [via Apache Spark User List] [hidden email] http://user/SendEmail.jtp?type=nodenode=6065i=0 wrote: Checkout the master or branch-1.0. Then the examples should be there. -Xiangrui On Mon, May 19, 2014 at 11:36 AM, yxzhao [hidden email] wrote: Thanks Xiangrui, But I did not find the directory: examples/src/main/scala/org/apache/spark/examples/mllib. Could you give me more detail or show me one example? Thanks a lot. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-the-SVM-and-LogisticRegression-tp5720p6049.html Sent from the Apache Spark User List mailing list archive at Nabble.com. If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-the-SVM-and-LogisticRegression-tp5720p6064.html To unsubscribe from How to run the SVM and LogisticRegression, click here. NAML -- View this message in context: Re: How to run the SVM and LogisticRegressionhttp://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-the-SVM-and-LogisticRegression-tp5720p6065.html Sent from the Apache Spark User List mailing list archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.
Spark stalling during shuffle (maybe a memory issue)
Has anyone observed Spark worker threads stalling during a shuffle phase with the following message (one per worker host) being echoed to the terminal on the driver thread? INFO spark.MapOutputTrackerActor: Asked to send map output locations for shuffle 0 to [worker host]... At this point Spark-related activity on the hadoop cluster completely halts .. there's no network activity, disk IO or CPU activity, and individual tasks are not completing and the job just sits in this state. At this point we just kill the job a re-start of the Spark server service is required. Using identical jobs we were able to by-pass this halt point by increasing available heap memory to the workers, but it's odd we don't get an out-of-memory error or any error at all. Upping the memory available isn't a very satisfying answer to what may be going on :) We're running Spark 0.9.0 on CDH5.0 in stand-alone mode. Thanks for any help or ideas you may have! Cheers, Jonathan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Reading from .bz2 files with Spark
Hi Xiangrui, many thanks to you and Sandy for fixing this issue! On Fri, May 16, 2014 at 10:23 PM, Xiangrui Meng men...@gmail.com wrote: Hi Andrew, I submitted a patch and verified it solves the problem. You can download the patch from https://issues.apache.org/jira/browse/HADOOP-10614 . Best, Xiangrui On Fri, May 16, 2014 at 6:48 PM, Xiangrui Meng men...@gmail.com wrote: Hi Andrew, This is the JIRA I created: https://issues.apache.org/jira/browse/MAPREDUCE-5893 . Hopefully someone wants to work on it. Best, Xiangrui On Fri, May 16, 2014 at 6:47 PM, Xiangrui Meng men...@gmail.com wrote: Hi Andre, I could reproduce the bug with Hadoop 2.2.0. Some older version of Hadoop do not support splittable compression, so you ended up with sequential reads. It is easy to reproduce the bug with the following setup: 1) Workers are configured with multiple cores. 2) BZip2 files are big enough or minPartitions is large enough when you load the file via sc.textFile(), so that one worker has more than one tasks. Best, Xiangrui On Fri, May 16, 2014 at 4:06 PM, Andrew Ash and...@andrewash.com wrote: Hi Xiangrui, // FYI I'm getting your emails late due to the Apache mailing list outage I'm using CDH4.4.0, which I think uses the MapReduce v2 API. The .jars are named like this: hadoop-hdfs-2.0.0-cdh4.4.0.jar I'm also glad you were able to reproduce! Please paste a link to the Hadoop bug you file so I can follow along. Thanks! Andrew On Tue, May 13, 2014 at 9:08 AM, Xiangrui Meng men...@gmail.com wrote: Which hadoop version did you use? I'm not sure whether Hadoop v2 fixes the problem you described, but it does contain several fixes to bzip2 format. -Xiangrui On Wed, May 7, 2014 at 9:19 PM, Andrew Ash and...@andrewash.com wrote: Hi all, Is anyone reading and writing to .bz2 files stored in HDFS from Spark with success? I'm finding the following results on a recent commit (756c96 from 24hr ago) and CDH 4.4.0: Works: val r = sc.textFile(/user/aa/myfile.bz2).count Doesn't work: val r = sc.textFile(/user/aa/myfile.bz2).map((s:String) = s+| ).count Specifically, I'm getting an exception coming out of the bzip2 libraries (see below stacktraces), which is unusual because I'm able to read from that file without an issue using the same libraries via Pig. It was originally created from Pig as well. Digging a little deeper I found this line in the .bz2 decompressor's javadoc for CBZip2InputStream: Instances of this class are not threadsafe. [source] My current working theory is that Spark has a much higher level of parallelism than Pig/Hadoop does and thus I get these wild IndexOutOfBounds exceptions much more frequently (as in can't finish a run over a little 2M row file) vs hardly at all in other libraries. The only other reference I could find to the issue was in presto-users, but the recommendation to leave .bz2 for .lzo doesn't help if I actually do want the higher compression levels of .bz2. Would love to hear if I have some kind of configuration issue or if there's a bug in .bz2 that's fixed in later versions of CDH, or generally any other thoughts on the issue. Thanks! Andrew Below are examples of some exceptions I'm getting: 14/05/07 15:09:49 WARN scheduler.TaskSetManager: Loss was due to java.lang.ArrayIndexOutOfBoundsException java.lang.ArrayIndexOutOfBoundsException: 65535 at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.hbCreateDecodeTables(CBZip2InputStream.java:663) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.createHuffmanDecodingTables(CBZip2InputStream.java:790) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.recvDecodingTables(CBZip2InputStream.java:762) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:798) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:502) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:333) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:397) at org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:426) at java.io.InputStream.read(InputStream.java:101) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:203) at
life if an executor
from looking at the source code i see executors run in their own jvm subprocesses. how long to they live for? as long as the worker/slave? or are they tied to the sparkcontext and life/die with it? thx
Re: How to run the SVM and LogisticRegression
Thanks Andrew, Yes, I have downloaded the master code. But, actually I just want to know how to run the classfication algorithms SVM and LogisticRegression implemented under /spark-0.9.0-incubating/mllib/src/main/scala/org/apache/spark/mllib/classification . Thanks. On Mon, May 19, 2014 at 10:37 PM, Andrew Ash [via Apache Spark User List] ml-node+s1001560n6066...@n3.nabble.com wrote: Hi yxzhao, Those are branches in the source code git repository. You can get to them with git checkout branch-1.0 once you've cloned the git repository. Cheers, Andrew On Mon, May 19, 2014 at 8:30 PM, yxzhao [hidden email] wrote: Thanks Xiangrui, Sorry I am new for Spark, could you give me more detail about master or branch-1.0 I do not know what master or branch-1.0 is. Thanks again. On Mon, May 19, 2014 at 10:17 PM, Xiangrui Meng [via Apache Spark User List] [hidden email] wrote: Checkout the master or branch-1.0. Then the examples should be there. -Xiangrui On Mon, May 19, 2014 at 11:36 AM, yxzhao [hidden email] wrote: Thanks Xiangrui, But I did not find the directory: examples/src/main/scala/org/apache/spark/examples/mllib. Could you give me more detail or show me one example? Thanks a lot. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-the-SVM-and-LogisticRegression-tp5720p6049.html Sent from the Apache Spark User List mailing list archive at Nabble.com. If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-the-SVM-and-LogisticRegression-tp5720p6064.html To unsubscribe from How to run the SVM and LogisticRegression, click here. NAML View this message in context: Re: How to run the SVM and LogisticRegression Sent from the Apache Spark User List mailing list archive at Nabble.com. If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-the-SVM-and-LogisticRegression-tp5720p6066.html To unsubscribe from How to run the SVM and LogisticRegression, click here. NAML -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-the-SVM-and-LogisticRegression-tp5720p6070.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Problem when sorting big file
Is your RDD of Strings? If so, you should make sure to use the Kryo serializer instead of the default Java one. It stores strings as UTF8 rather than Java's default UTF16 representation, which can save you half the memory usage in the right situation. Try setting the persistence level on the RDD to MEMORY_AND_DISK_SER and possibly also lowering spark.storage.memoryFraction from 0.6 to 0.4 or so. Andrew On Thu, May 15, 2014 at 2:55 PM, Gustavo Enrique Salazar Torres gsala...@ime.usp.br wrote: Hi there: I have this dataset (about 12G) which I need to sort by key. I used the sortByKey method but when I try to save the file to disk (HDFS in this case) it seems that some tasks run out of time because they have too much data to save and it can't fit in memory. I say this because before the TimeOut exception at the worker there is an OOM exception from an specific task. My question is: is this a common problem at Spark? has anyone been through this issue? The cause of the problem seems to be an unbalanced distribution of data between tasks. I will appreciate any help. Thanks Gustavo
facebook data mining with Spark
Is there any way to get facebook data into Spark and filter the content of it? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/facebook-data-mining-with-Spark-tp6072.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Setting queue for spark job on yarn
Hi, How does one submit a spark job to yarn and specify a queue? The code that successfully submits to yarn is: val conf = new SparkConf() val sc = new SparkContext(yarn-client, Simple App, conf) Where do I need to specify the queue? Thanks in advance for any help on this... Thanks, Ron
Re: life if an executor
They’re tied to the SparkContext (application) that launched them. Matei On May 19, 2014, at 8:44 PM, Koert Kuipers ko...@tresata.com wrote: from looking at the source code i see executors run in their own jvm subprocesses. how long to they live for? as long as the worker/slave? or are they tied to the sparkcontext and life/die with it? thx
Re: filling missing values in a sequence
Thanks Sean. Yes, your solution works :-) I did oversimplify my real problem, which has other parameters that go along with the sequence. On Fri, May 16, 2014 at 3:03 AM, Sean Owen so...@cloudera.com wrote: Not sure if this is feasible, but this literally does what I think you are describing: sc.parallelize(rdd1.first to rdd1.last) On Tue, May 13, 2014 at 4:56 PM, Mohit Jaggi mohitja...@gmail.com wrote: Hi, I am trying to find a way to fill in missing values in an RDD. The RDD is a sorted sequence. For example, (1, 2, 3, 5, 8, 11, ...) I need to fill in the missing numbers and get (1,2,3,4,5,6,7,8,9,10,11) One way to do this is to slide and zip rdd1 = sc.parallelize(List(1, 2, 3, 5, 8, 11, ...)) x = rdd1.first rdd2 = rdd1 filter (_ != x) rdd3 = rdd2 zip rdd1 rdd4 = rdd3 flatmap { (x, y) = generate missing elements between x and y } Another method which I think is more efficient is to use mapParititions() on rdd1 to be able to iterate on elements of rdd1 in each partition. However, that leaves the boundaries of the partitions to be unfilled. Is there a way within the function passed to mapPartitions, to read the first element in the next partition? The latter approach also appears to work for a general sliding window calculation on the RDD. The former technique requires a lot of sliding and zipping and I believe it is not efficient. If only I could read the next partition...I have tried passing a pointer to rdd1 to the function passed to mapPartitions but the rdd1 pointer turns out to be NULL, I guess because Spark cannot deal with a mapper calling another mapper (since it happens on a worker not the driver) Mohit.
Re: life if an executor
I guess it needs to be this way to benefit from caching of RDDs in memory. It would be nice however if the RDD cache can be dissociated from the JVM heap so that in cases where garbage collection is difficult to tune, one could choose to discard the JVM and run the next operation in a few one. On Mon, May 19, 2014 at 10:06 PM, Matei Zaharia matei.zaha...@gmail.comwrote: They’re tied to the SparkContext (application) that launched them. Matei On May 19, 2014, at 8:44 PM, Koert Kuipers ko...@tresata.com wrote: from looking at the source code i see executors run in their own jvm subprocesses. how long to they live for? as long as the worker/slave? or are they tied to the sparkcontext and life/die with it? thx
Re: life if an executor
That's one the main motivation in using Tachyon ;) http://tachyon-project.org/ It gives off heap in-memory caching. And starting Spark 0.9, you can cache any RDD in Tachyon just by specifying the appropriate StorageLevel. TD On Mon, May 19, 2014 at 10:22 PM, Mohit Jaggi mohitja...@gmail.com wrote: I guess it needs to be this way to benefit from caching of RDDs in memory. It would be nice however if the RDD cache can be dissociated from the JVM heap so that in cases where garbage collection is difficult to tune, one could choose to discard the JVM and run the next operation in a few one. On Mon, May 19, 2014 at 10:06 PM, Matei Zaharia matei.zaha...@gmail.comwrote: They’re tied to the SparkContext (application) that launched them. Matei On May 19, 2014, at 8:44 PM, Koert Kuipers ko...@tresata.com wrote: from looking at the source code i see executors run in their own jvm subprocesses. how long to they live for? as long as the worker/slave? or are they tied to the sparkcontext and life/die with it? thx