Re: Spark 1.4.0 - Using SparkR on EC2 Instance
That’s correct. We were setting up a Spark EC2 cluster from the command line, then installing RStudio Server, logging into that through the web interface and attempting to initialize the cluster within RStudio. We have made some progress on this outside of the thread - I will see what I can compile and share as a potential walkthrough. On Jul 8, 2015, at 9:25 PM, BenPorter [via Apache Spark User List] ml-node+s1001560n23732...@n3.nabble.com wrote: RedOakMark - just to make sure I understand what you did. You ran the EC2 script on a local machine to spin up the cluster, but then did not try to run anything in R/RStudio from your local machine. Instead you installed RStudio on the driver and ran it as a local cluster from that driver. Is that correct? Otherwise, you make no reference to the master/EC2 server in this code, so I have to assume that means you were running this directly from the master. Thanks, Ben If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-4-0-Using-SparkR-on-EC2-Instance-tp23506p23732.html http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-4-0-Using-SparkR-on-EC2-Instance-tp23506p23732.html To unsubscribe from Spark 1.4.0 - Using SparkR on EC2 Instance, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=23506code=bWFya0ByZWRvYWtzdHJhdGVnaWMuY29tfDIzNTA2fDE0OTQ4NTQ4ODQ=. NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-4-0-Using-SparkR-on-EC2-Instance-tp23506p23742.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark 1.4.0 - Using SparkR on EC2 Instance
The API exported in the 1.4 release is different from the one used in the 2014 demo. Please see the latest documentation at http://people.apache.org/~pwendell/spark-releases/latest/sparkr.html or Chris's demo from Spark Summit at https://spark-summit.org/2015/events/a-data-frame-abstraction-layer-for-sparkr/ Thanks Shivaram On Tue, Jun 30, 2015 at 7:40 AM, Nicholas Sharkey nicholasshar...@gmail.com wrote: Good morning Sivaram, I believe I have our setup close but I'm getting an error on the last step of the word count example from the Spark Summit https://spark-summit.org/2014/wp-content/uploads/2014/07/SparkR-SparkSummit.pdf . Off the top of your head can you think of where this error (below, and attached) is coming from? I can get into the details of how I setup this machine if needed, but wanted to keep the initial question short. Thanks. *Begin Code* *library(SparkR)* *# sc - sparkR.init(local[2])* *sc - sparkR.init(http://ec2-54-171-173-195.eu-west-1.compute.amazonaws.com:[2];)* *lines - textFile(sc, mytextfile.txt) # hi hi all all all one one one one* *words - flatMap(lines,* * function(line){* * strsplit(line, )[[1]]* * })* *wordcount - lapply(words,* *function(word){* * list(word, 1)* *})* *counts - reduceByKey(wordcount, +, numPartitions=2)* *# Error in (function (classes, fdef, mtable) : * *# unable to find an inherited method for function ‘reduceByKey’ for signature ‘PipelinedRDD, character, numeric’* *End Code * On Fri, Jun 26, 2015 at 7:04 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: My workflow as to install RStudio on a cluster launched using Spark EC2 scripts. However I did a bunch of tweaking after that (like copying the spark installation over etc.). When I get some time I'll try to write the steps down in the JIRA. Thanks Shivaram On Fri, Jun 26, 2015 at 10:21 AM, m...@redoakstrategic.com wrote: So you created an EC2 instance with RStudio installed first, then installed Spark under that same username? That makes sense, I just want to verify your work flow. Thank you again for your willingness to help! On Fri, Jun 26, 2015 at 10:13 AM -0700, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I was using RStudio on the master node of the same cluster in the demo. However I had installed Spark under the user `rstudio` (i.e. /home/rstudio) and that will make the permissions work correctly. You will need to copy the config files from /root/spark/conf after installing Spark though and it might need some more manual tweaks. Thanks Shivaram On Fri, Jun 26, 2015 at 9:59 AM, Mark Stephenson m...@redoakstrategic.com wrote: Thanks! In your demo video, were you using RStudio to hit a separate EC2 Spark cluster? I noticed that it appeared your browser that you were using EC2 at that time, so I was just curious. It appears that might be one of the possible workarounds - fire up a separate EC2 instance with RStudio Server that initializes the spark context against a separate Spark cluster. On Jun 26, 2015, at 11:46 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: We don't have a documented way to use RStudio on EC2 right now. We have a ticket open at https://issues.apache.org/jira/browse/SPARK-8596 to discuss work-arounds and potential solutions for this. Thanks Shivaram On Fri, Jun 26, 2015 at 6:27 AM, RedOakMark m...@redoakstrategic.com wrote: Good morning, I am having a bit of trouble finalizing the installation and usage of the newest Spark version 1.4.0, deploying to an Amazon EC2 instance and using RStudio to run on top of it. Using these instructions ( http://spark.apache.org/docs/latest/ec2-scripts.html http://spark.apache.org/docs/latest/ec2-scripts.html ) we can fire up an EC2 instance (which we have been successful doing - we have gotten the cluster to launch from the command line without an issue). Then, I installed RStudio Server on the same EC2 instance (the master) and successfully logged into it (using the test/test user) through the web browser. This is where I get stuck - within RStudio, when I try to reference/find the folder that SparkR was installed, to load the SparkR library and initialize a SparkContext, I get permissions errors on the folders, or the library cannot be found because I cannot find the folder in which the library is sitting. Has anyone successfully launched and utilized SparkR 1.4.0 in this way, with RStudio Server running on top of the master instance? Are we on the right track, or should we manually launch a cluster and attempt to connect to it from another instance running R? Thank you in advance! Mark -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-4-0-Using-SparkR-on-EC2-Instance-tp23506.html Sent from
Re: Spark 1.4.0 - Using SparkR on EC2 Instance
Are you using the SparkR from the latest Spark 1.4 release ? The function was not available in the older AMPLab version Shivaram On Tue, Jun 30, 2015 at 1:43 PM, Nicholas Sharkey nicholasshar...@gmail.com wrote: Any idea why I can't get the sparkRSQL.init function to work? The other parts of SparkR seems like it's working fine. And yes, the SparkR library is loaded. Thanks. sc - sparkR.init(master= http://ec2-52-18-1-4.eu-west-1.compute.amazonaws.com;) ... sqlContext - sparkRSQL.init(sc) Error: could not find function sparkRSQL.init On Tue, Jun 30, 2015 at 10:56 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: The API exported in the 1.4 release is different from the one used in the 2014 demo. Please see the latest documentation at http://people.apache.org/~pwendell/spark-releases/latest/sparkr.html or Chris's demo from Spark Summit at https://spark-summit.org/2015/events/a-data-frame-abstraction-layer-for-sparkr/ Thanks Shivaram On Tue, Jun 30, 2015 at 7:40 AM, Nicholas Sharkey nicholasshar...@gmail.com wrote: Good morning Sivaram, I believe I have our setup close but I'm getting an error on the last step of the word count example from the Spark Summit https://spark-summit.org/2014/wp-content/uploads/2014/07/SparkR-SparkSummit.pdf . Off the top of your head can you think of where this error (below, and attached) is coming from? I can get into the details of how I setup this machine if needed, but wanted to keep the initial question short. Thanks. *Begin Code* *library(SparkR)* *# sc - sparkR.init(local[2])* *sc - sparkR.init(http://ec2-54-171-173-195.eu-west-1.compute.amazonaws.com:[2];)* *lines - textFile(sc, mytextfile.txt) # hi hi all all all one one one one* *words - flatMap(lines,* * function(line){* * strsplit(line, )[[1]]* * })* *wordcount - lapply(words,* *function(word){* * list(word, 1)* *})* *counts - reduceByKey(wordcount, +, numPartitions=2)* *# Error in (function (classes, fdef, mtable) : * *# unable to find an inherited method for function ‘reduceByKey’ for signature ‘PipelinedRDD, character, numeric’* *End Code * On Fri, Jun 26, 2015 at 7:04 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: My workflow as to install RStudio on a cluster launched using Spark EC2 scripts. However I did a bunch of tweaking after that (like copying the spark installation over etc.). When I get some time I'll try to write the steps down in the JIRA. Thanks Shivaram On Fri, Jun 26, 2015 at 10:21 AM, m...@redoakstrategic.com wrote: So you created an EC2 instance with RStudio installed first, then installed Spark under that same username? That makes sense, I just want to verify your work flow. Thank you again for your willingness to help! On Fri, Jun 26, 2015 at 10:13 AM -0700, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I was using RStudio on the master node of the same cluster in the demo. However I had installed Spark under the user `rstudio` (i.e. /home/rstudio) and that will make the permissions work correctly. You will need to copy the config files from /root/spark/conf after installing Spark though and it might need some more manual tweaks. Thanks Shivaram On Fri, Jun 26, 2015 at 9:59 AM, Mark Stephenson m...@redoakstrategic.com wrote: Thanks! In your demo video, were you using RStudio to hit a separate EC2 Spark cluster? I noticed that it appeared your browser that you were using EC2 at that time, so I was just curious. It appears that might be one of the possible workarounds - fire up a separate EC2 instance with RStudio Server that initializes the spark context against a separate Spark cluster. On Jun 26, 2015, at 11:46 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: We don't have a documented way to use RStudio on EC2 right now. We have a ticket open at https://issues.apache.org/jira/browse/SPARK-8596 to discuss work-arounds and potential solutions for this. Thanks Shivaram On Fri, Jun 26, 2015 at 6:27 AM, RedOakMark m...@redoakstrategic.com wrote: Good morning, I am having a bit of trouble finalizing the installation and usage of the newest Spark version 1.4.0, deploying to an Amazon EC2 instance and using RStudio to run on top of it. Using these instructions ( http://spark.apache.org/docs/latest/ec2-scripts.html http://spark.apache.org/docs/latest/ec2-scripts.html ) we can fire up an EC2 instance (which we have been successful doing - we have gotten the cluster to launch from the command line without an issue). Then, I installed RStudio Server on the same EC2 instance (the master) and successfully logged into it (using the test/test user) through the web browser. This is where I get stuck - within RStudio, when I try to reference/find the folder that
Re: Spark 1.4.0 - Using SparkR on EC2 Instance
For anyone monitoring the thread, I was able to successfully install and run a small Spark cluster and model using this method: First, make sure that the username being used to login to RStudio Server is the one that was used to install Spark on the EC2 instance. Thanks to Shivaram for his help here. Login to RStudio and ensure that these references are used - set the library location to the folder where spark is installed. In my case, ~/home/rstudio/spark. # # This line loads SparkR (the R package) from the installed directory library(SparkR, lib.loc=./spark/R/lib) The edits to this line were important, so that Spark knew where the install folder was located when initializing the cluster. # Initialize the Spark local cluster in R, as ‘sc’ sc - sparkR.init(local[2], SparkR, ./spark) From here, we ran a basic model using Spark, from RStudio, which ran successfully. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-4-0-Using-SparkR-on-EC2-Instance-tp23506p23514.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark 1.4.0 - Using SparkR on EC2 Instance
Thanks Mark for the update. For those interested Vincent Warmerdam also has some details on making the /root/spark installation work at https://issues.apache.org/jira/browse/SPARK-8596?focusedCommentId=14604328page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14604328 Shivaram On Sat, Jun 27, 2015 at 12:23 PM, RedOakMark m...@redoakstrategic.com wrote: For anyone monitoring the thread, I was able to successfully install and run a small Spark cluster and model using this method: First, make sure that the username being used to login to RStudio Server is the one that was used to install Spark on the EC2 instance. Thanks to Shivaram for his help here. Login to RStudio and ensure that these references are used - set the library location to the folder where spark is installed. In my case, ~/home/rstudio/spark. # # This line loads SparkR (the R package) from the installed directory library(SparkR, lib.loc=./spark/R/lib) The edits to this line were important, so that Spark knew where the install folder was located when initializing the cluster. # Initialize the Spark local cluster in R, as ‘sc’ sc - sparkR.init(local[2], SparkR, ./spark) From here, we ran a basic model using Spark, from RStudio, which ran successfully. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-4-0-Using-SparkR-on-EC2-Instance-tp23506p23514.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark 1.4.0 - Using SparkR on EC2 Instance
Good morning, I am having a bit of trouble finalizing the installation and usage of the newest Spark version 1.4.0, deploying to an Amazon EC2 instance and using RStudio to run on top of it. Using these instructions ( http://spark.apache.org/docs/latest/ec2-scripts.html http://spark.apache.org/docs/latest/ec2-scripts.html ) we can fire up an EC2 instance (which we have been successful doing - we have gotten the cluster to launch from the command line without an issue). Then, I installed RStudio Server on the same EC2 instance (the master) and successfully logged into it (using the test/test user) through the web browser. This is where I get stuck - within RStudio, when I try to reference/find the folder that SparkR was installed, to load the SparkR library and initialize a SparkContext, I get permissions errors on the folders, or the library cannot be found because I cannot find the folder in which the library is sitting. Has anyone successfully launched and utilized SparkR 1.4.0 in this way, with RStudio Server running on top of the master instance? Are we on the right track, or should we manually launch a cluster and attempt to connect to it from another instance running R? Thank you in advance! Mark -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-4-0-Using-SparkR-on-EC2-Instance-tp23506.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark 1.4.0 - Using SparkR on EC2 Instance
We don't have a documented way to use RStudio on EC2 right now. We have a ticket open at https://issues.apache.org/jira/browse/SPARK-8596 to discuss work-arounds and potential solutions for this. Thanks Shivaram On Fri, Jun 26, 2015 at 6:27 AM, RedOakMark m...@redoakstrategic.com wrote: Good morning, I am having a bit of trouble finalizing the installation and usage of the newest Spark version 1.4.0, deploying to an Amazon EC2 instance and using RStudio to run on top of it. Using these instructions ( http://spark.apache.org/docs/latest/ec2-scripts.html http://spark.apache.org/docs/latest/ec2-scripts.html ) we can fire up an EC2 instance (which we have been successful doing - we have gotten the cluster to launch from the command line without an issue). Then, I installed RStudio Server on the same EC2 instance (the master) and successfully logged into it (using the test/test user) through the web browser. This is where I get stuck - within RStudio, when I try to reference/find the folder that SparkR was installed, to load the SparkR library and initialize a SparkContext, I get permissions errors on the folders, or the library cannot be found because I cannot find the folder in which the library is sitting. Has anyone successfully launched and utilized SparkR 1.4.0 in this way, with RStudio Server running on top of the master instance? Are we on the right track, or should we manually launch a cluster and attempt to connect to it from another instance running R? Thank you in advance! Mark -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-4-0-Using-SparkR-on-EC2-Instance-tp23506.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark 1.4.0 - Using SparkR on EC2 Instance
So you created an EC2 instance with RStudio installed first, then installed Spark under that same username? That makes sense, I just want to verify your work flow. Thank you again for your willingness to help! On Fri, Jun 26, 2015 at 10:13 AM -0700, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I was using RStudio on the master node of the same cluster in the demo. However I had installed Spark under the user `rstudio` (i.e. /home/rstudio) and that will make the permissions work correctly. You will need to copy the config files from /root/spark/conf after installing Spark though and it might need some more manual tweaks. ThanksShivaram On Fri, Jun 26, 2015 at 9:59 AM, Mark Stephenson m...@redoakstrategic.com wrote: Thanks! In your demo video, were you using RStudio to hit a separate EC2 Spark cluster? I noticed that it appeared your browser that you were using EC2 at that time, so I was just curious. It appears that might be one of the possible workarounds - fire up a separate EC2 instance with RStudio Server that initializes the spark context against a separate Spark cluster. On Jun 26, 2015, at 11:46 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: We don't have a documented way to use RStudio on EC2 right now. We have a ticket open at https://issues.apache.org/jira/browse/SPARK-8596 to discuss work-arounds and potential solutions for this. ThanksShivaram On Fri, Jun 26, 2015 at 6:27 AM, RedOakMark m...@redoakstrategic.com wrote: Good morning, I am having a bit of trouble finalizing the installation and usage of the newest Spark version 1.4.0, deploying to an Amazon EC2 instance and using RStudio to run on top of it. Using these instructions ( http://spark.apache.org/docs/latest/ec2-scripts.html http://spark.apache.org/docs/latest/ec2-scripts.html ) we can fire up an EC2 instance (which we have been successful doing - we have gotten the cluster to launch from the command line without an issue). Then, I installed RStudio Server on the same EC2 instance (the master) and successfully logged into it (using the test/test user) through the web browser. This is where I get stuck - within RStudio, when I try to reference/find the folder that SparkR was installed, to load the SparkR library and initialize a SparkContext, I get permissions errors on the folders, or the library cannot be found because I cannot find the folder in which the library is sitting. Has anyone successfully launched and utilized SparkR 1.4.0 in this way, with RStudio Server running on top of the master instance? Are we on the right track, or should we manually launch a cluster and attempt to connect to it from another instance running R? Thank you in advance! Mark -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-4-0-Using-SparkR-on-EC2-Instance-tp23506.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark 1.4.0 - Using SparkR on EC2 Instance
Thanks! In your demo video, were you using RStudio to hit a separate EC2 Spark cluster? I noticed that it appeared your browser that you were using EC2 at that time, so I was just curious. It appears that might be one of the possible workarounds - fire up a separate EC2 instance with RStudio Server that initializes the spark context against a separate Spark cluster. On Jun 26, 2015, at 11:46 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: We don't have a documented way to use RStudio on EC2 right now. We have a ticket open at https://issues.apache.org/jira/browse/SPARK-8596 https://issues.apache.org/jira/browse/SPARK-8596 to discuss work-arounds and potential solutions for this. Thanks Shivaram On Fri, Jun 26, 2015 at 6:27 AM, RedOakMark m...@redoakstrategic.com mailto:m...@redoakstrategic.com wrote: Good morning, I am having a bit of trouble finalizing the installation and usage of the newest Spark version 1.4.0, deploying to an Amazon EC2 instance and using RStudio to run on top of it. Using these instructions ( http://spark.apache.org/docs/latest/ec2-scripts.html http://spark.apache.org/docs/latest/ec2-scripts.html http://spark.apache.org/docs/latest/ec2-scripts.html http://spark.apache.org/docs/latest/ec2-scripts.html ) we can fire up an EC2 instance (which we have been successful doing - we have gotten the cluster to launch from the command line without an issue). Then, I installed RStudio Server on the same EC2 instance (the master) and successfully logged into it (using the test/test user) through the web browser. This is where I get stuck - within RStudio, when I try to reference/find the folder that SparkR was installed, to load the SparkR library and initialize a SparkContext, I get permissions errors on the folders, or the library cannot be found because I cannot find the folder in which the library is sitting. Has anyone successfully launched and utilized SparkR 1.4.0 in this way, with RStudio Server running on top of the master instance? Are we on the right track, or should we manually launch a cluster and attempt to connect to it from another instance running R? Thank you in advance! Mark -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-4-0-Using-SparkR-on-EC2-Instance-tp23506.html http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-4-0-Using-SparkR-on-EC2-Instance-tp23506.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org