Spark MOOC - early access
*Hi Spark Devs and Users,BerkeleyX and Databricks are currently developing two Spark-related MOOC on edX (intro https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x, ml https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x), the first of which starts on June 1st. Together these courses have over 75K enrolled students!To help students perform exercises course content, we have created a Vagrant box that contains Spark and IPython (running on Ubuntu 32-bit). This will simplify user setup and helps us support them. We are writing to give you early access to the VM environment and the first assignment, and to request your help to test out the VM/assignment before we unleash it to 75K people (see instructions below). We’ve provided instructions below. We’re happy to help if you have any difficulties getting the VM setup; please feel free to contact me (marco.s...@gmail.com marco.s...@gmail.com) with any issues, comments, or questions.Sincerely,Marco ShawSpark MOOC TA_(This is being sent as an HTML formatted email. Some of the links have been duplicated just in case.)1. Install VirtualBox here https://www.virtualbox.org/wiki/Downloads on your OS (see Windows tutorial here https://www.youtube.com/watch?v=06Sf-m64fcY (https://www.youtube.com/watch?v=06Sf-m64fcY https://www.youtube.com/watch?v=06Sf-m64fcY))2. Install Vagrant here https://www.vagrantup.com/downloads.html on your OS (see Windows tutorial here https://www.youtube.com/watch?v=LZVS23BaA1I (https://www.youtube.com/watch?v=LZVS23BaA1I https://www.youtube.com/watch?v=LZVS23BaA1I))3) Install virtual machine using the following steps: (see Windows tutorial here https://www.youtube.com/watch?v=ZuJCqHC7IYc (https://www.youtube.com/watch?v=ZuJCqHC7IYc https://www.youtube.com/watch?v=ZuJCqHC7IYc))a. Create a custom directory (e.g. c:\users\marco\myvagrant or /home/marco/myvagrant)b. Download the file https://raw.githubusercontent.com/spark-mooc/mooc-setup/master/Vagrantfile to the custom directory (NOTE: It must be named exactly Vagrantfile with no extension)c. Open a DOS prompt (Windows) or terminal (Mac/Linux) to the custom directory and issue the command vagrant up4) Perform basic commands in VM as described below: (see Windows tutorial here https://www.youtube.com/watch?v=bkteLH77IR0 (https://www.youtube.com/watch?v=bkteLH77IR0 https://www.youtube.com/watch?v=bkteLH77IR0))a. To start the VM, from a DOS prompt (Windows) or terminal (Mac/Linux), issue the command vagrant up.b. To stop the VM, from a DOS prompt (Windows) or terminal (Mac/Linux), issue the command vagrant halt.c. To erase or delete the VM, from a DOS prompt (Windows) or terminal (Mac/Linux), issue the command vagrant destroy.d. Once the VM is running, to access the notebook, open a web browser to http://localhost:8001 http://localhost:8001/.5) Using test notebook as described below: (see Windows tutorial here https://www.youtube.com/watch?v=mlfAmyF3Q-s (https://www.youtube.com/watch?v=mlfAmyF3Q-s https://www.youtube.com/watch?v=mlfAmyF3Q-s))a. To start the VM, from a DOS prompt (Windows) or terminal (Mac/Linux), issue the command vagrant up.b. Once the VM is running, to access the notebook, open a web browser to http://localhost:8001 http://localhost:8001/.c. Upload this IPython notebook: https://raw.githubusercontent.com/spark-mooc/mooc-setup/master/vm_test_student.ipynb https://raw.githubusercontent.com/spark-mooc/mooc-setup/master/vm_test_student.ipynb.d. Run through the notebook.6) Play around with the first MOOC assignment (email Marco for details when you get to this point).7) Please answer the following questionsa. What machine are you using (OS, RAM, CPU, age)?b. How long did the entire process take?c. How long did the VM download take? Relatedly, where are you located?d. Do you have any other comments/suggestions?*
Need some guidance
**Learning the ropes** I'm trying to grasp the concept of using the pipeline in pySpark... Simplified example: list=[(1,alpha),(1,beta),(1,foo),(1,alpha),(2,alpha),(2,alpha),(2,bar),(3,foo)] Desired outcome: [(1,3),(2,2),(3,1)] Basically for each key, I want the number of unique values. I've tried different approaches, but am I really using Spark effectively? I wondered if I would do something like: input=sc.parallelize(list) input.groupByKey().collect() Then I wondered if I could do something like a foreach over each key value, and then map the actual values and reduce them. Pseudo-code: input.groupbykey() .keys .foreach(_.values .map(lambda x: x,1) .reducebykey(lambda a,b:a+b) .count() ) I was somehow hoping that the key would get the current value of count, and thus be the count of the unique keys, which is exactly what I think I'm looking for. Am I way off base on how I could accomplish this? Marco
Re: Spark Team - Paco Nathan said that your team can help
Hi, Let me reword your request so you understand how (too) generic your question is Hi, I have $10,000, please find me some means of transportation so I can get to work. Please provide (a lot) more details. If you can't, consider using one of the pre-built express VMs from either Cloudera, Hortonworks or MapR, for example. Marco On Jan 22, 2015, at 7:36 AM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Apache-Spark team , What are the system requirements installing Hadoop and Apache Spark? I have attached the screen shot of Gparted. Thanks and regards, Sudipta -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099 Screenshot - Wednesday 21 January 2015 - 10:55:29 IST.png - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Team - Paco Nathan said that your team can help
Sudipta - Please don't ever come here or post here again. On Thu, Jan 22, 2015 at 1:25 PM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Nicos, Taking forward your argument,please be a smart a$$ and dont use unprofessional language just for the sake of being a moderator. Paco Nathan is respected for the dignity he carries in sharing his knowledge and making it available free for a$$es like us right! So just mind your tongue next time you put such a$$ in your mouth. Best Regards, Sudipta On Thu, Jan 22, 2015 at 10:39 PM, Nicos Kekchidis ikon...@me.com wrote: Folks, Just a gentle reminder we owe to ourselves: - this is a public forum and we need to behave accordingly, it is not place to vent frustration in rude way - getting attention here is an earned privilege and not entitlement - this is not a “Platinum Support” department of your vendor rather and open source collaboration forum where people volunteer their time to pay attention to your needs - there are still many gray areas so be patient and articulate questions in as much details as possible if you want to get quick help and not just be perceived as a smart a$$ FYI - Paco Nathan is a well respected Spark evangelist and many people, including myself, owe to his passion for jumping on Spark platform promise. People like Sean Owen keep us believing in things when we feel like hitting the dead-end. Please, be respectful of what connections you are prized with and act civilized. Have a great day! - Nicos On Jan 22, 2015, at 7:49 AM, Sean Owen so...@cloudera.com wrote: Yes, this isn't a well-formed question, and got maybe the response it deserved, but the tone is veering off the rails. I just got a much ruder reply from Sudipta privately, which I will not forward. Sudipta, I suggest you take the responses you've gotten so far as about as much answer as can be had here and do some work yourself, and come back with much more specific questions, and it will all be helpful and polite again. On Thu, Jan 22, 2015 at 2:51 PM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Marco, Thanks for the confirmation. Please let me know what are the lot more detail you need to answer a very specific question WHAT IS THE MINIMUM HARDWARE CONFIGURATION REQUIRED TO BUILT HDFS+ MAPREDUCE+SPARK+YARN on a system? Please let me know if you need any further information and if you dont know please drive across with the $1 to Sir Paco Nathan and get me the answer. Thanks and Regards, Sudipta On Thu, Jan 22, 2015 at 5:33 PM, Marco Shaw marco.s...@gmail.com wrote: Hi, Let me reword your request so you understand how (too) generic your question is Hi, I have $10,000, please find me some means of transportation so I can get to work. Please provide (a lot) more details. If you can't, consider using one of the pre-built express VMs from either Cloudera, Hortonworks or MapR, for example. Marco On Jan 22, 2015, at 7:36 AM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Apache-Spark team , What are the system requirements installing Hadoop and Apache Spark? I have attached the screen shot of Gparted. Thanks and regards, Sudipta -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099 Screenshot - Wednesday 21 January 2015 - 10:55:29 IST.png - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099 - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099
Re: Spark Team - Paco Nathan said that your team can help
(Starting over...) The best place to look for the requirements would be at the individual pages of each technology. As for absolute minimum requirements, I would suggest 50GB of disk space and at least 8GB of memory. This is the absolute minimum. Architecting a solution like you are looking for is very complex. If you are just looking for a proof-of-concept consider a Docker image or going to Cloudera/Hortonworks/MapR and look for their express VMs which can usually run on Oracle Virtualbox or VMware. Marco On Thu, Jan 22, 2015 at 7:36 AM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Apache-Spark team , What are the system requirements installing Hadoop and Apache Spark? I have attached the screen shot of Gparted. Thanks and regards, Sudipta -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099 - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: DeepLearning and Spark ?
Pretty vague on details: http://www.datasciencecentral.com/m/blogpost?id=6448529%3ABlogPost%3A227199 On Jan 9, 2015, at 11:39 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, DeepLearning algorithms are popular and achieve many state of the art performance in several real world machine learning problems. Currently there are no DL implementation in spark and I wonder if there is an ongoing work on this topics. We can do DL in spark Sparkling water and H2O but this adds an additional software stack. Deeplearning4j seems to implements a distributed version of many popural DL algorithm. Porting DL4j in Spark can be interesting. Google describes an implementation of a large scale DL in this paper http://research.google.com/archive/large_deep_networks_nips2012.html. Based on model parallelism and data parallelism. So, I'm trying to imaging what should be a good design for DL algorithm in Spark ? Spark already have RDD (for data parallelism). Can GraphX be used for the model parallelism (as DNN are generally designed as DAG) ? And what about using GPUs to do local parallelism (mecanism to push partition into GPU memory ) ? What do you think about this ? Cheers, Jao
Re: when will the spark 1.3.0 be released?
When it is ready. On Dec 16, 2014, at 11:43 PM, 张建轶 zhangjia...@youku.com wrote: Hi £¡ when will the spark 1.3.0 be released£¿ I want to use new LDA feature. Thank you!B‹CB•È[œÝXœØÜšX™KK[XZ[ˆ\Ù\‹][œÝXœØÜšX™PÜ\šË˜\XÚK›Ü™ÃB‘›ÜˆY][Û˜[ÛÛ[X[™ËK[XZ[ˆ\Ù\‹Z[Ü\šË˜\XÚK›Ü™ÃBƒB - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Starting with spark
First thing... Go into the Cloudera Manager and make sure that the Spark service (master?) is started. Marco On Thu, Jul 24, 2014 at 7:53 AM, Sameer Sayyed sam.sayyed...@gmail.com wrote: Hello All, I am new user of spark, I am using *cloudera-quickstart-vm-5.0.0-0-vmware* for execute sample examples of Spark. I am very sorry for silly and basic question. I am not able to deploy and execute sample examples of spark. please suggest me *how to start with spark*. Please help me Thanks in advance. Regards, Sam
Re: Running Spark on Microsoft Azure HDInsight
I'm a Spark and HDInsight novice, so I could be wrong... HDInsight is based on HDP2, so my guess here is that you have the option of installing/configuring Spark in cluster mode (YARN) or in standalone mode and package the Spark binaries with your job. Everything I seem to look at is related to UNIX shell scripts. So, one might need to pull apart some of these scripts to pick out how to run this on Windows. Interesting project... Marco On Mon, Jul 14, 2014 at 8:00 AM, Niek Tax niek...@gmail.com wrote: Hi everyone, Currently I am working on parallelizing a machine learning algorithm using a Microsoft HDInsight cluster. I tried running my algorithm on Hadoop MapReduce, but since my algorithm is iterative the job scheduling overhead and data loading overhead severely limits the performance of my algorithm in terms of training time. Since recently, HDInsight supports Hadoop 2 with YARN, which I thought would allow me to use run Spark jobs, which seem more fitting for my task. So far I have not been able however to find how I can run Apache Spark jobs on a HDInsight cluster. It seems like remote job submission (which would have my preference) is not possible for Spark on HDInsight, as REST endpoints for Oozie and templeton do not seem to support submission of Spark jobs. I also tried to RDP to the headnode for job submission from the headnode. On the headnode drives I can find other new YARN computation models like Tez and I also managed to run Tez jobs on it through YARN. However, Spark seems to be missing. Does this mean that HDInsight currently does not support Spark, even though it supports Hadoop versions with YARN? Or do I need to install Spark on the HDInsight cluster first, in some way? Or is there maybe something else that I'm missing and can I run Spark jobs on HDInsight some other way? Many thanks in advance! Kind regards, Niek Tax
Re: Running Spark on Microsoft Azure HDInsight
Looks like going with cluster mode is not a good idea: http://azure.microsoft.com/en-us/documentation/articles/hdinsight-administer-use-management-portal/ Seems like a non-HDInsight VM might be needed to make it the Spark master node. Marco On Mon, Jul 14, 2014 at 12:43 PM, Marco Shaw marco.s...@gmail.com wrote: I'm a Spark and HDInsight novice, so I could be wrong... HDInsight is based on HDP2, so my guess here is that you have the option of installing/configuring Spark in cluster mode (YARN) or in standalone mode and package the Spark binaries with your job. Everything I seem to look at is related to UNIX shell scripts. So, one might need to pull apart some of these scripts to pick out how to run this on Windows. Interesting project... Marco On Mon, Jul 14, 2014 at 8:00 AM, Niek Tax niek...@gmail.com wrote: Hi everyone, Currently I am working on parallelizing a machine learning algorithm using a Microsoft HDInsight cluster. I tried running my algorithm on Hadoop MapReduce, but since my algorithm is iterative the job scheduling overhead and data loading overhead severely limits the performance of my algorithm in terms of training time. Since recently, HDInsight supports Hadoop 2 with YARN, which I thought would allow me to use run Spark jobs, which seem more fitting for my task. So far I have not been able however to find how I can run Apache Spark jobs on a HDInsight cluster. It seems like remote job submission (which would have my preference) is not possible for Spark on HDInsight, as REST endpoints for Oozie and templeton do not seem to support submission of Spark jobs. I also tried to RDP to the headnode for job submission from the headnode. On the headnode drives I can find other new YARN computation models like Tez and I also managed to run Tez jobs on it through YARN. However, Spark seems to be missing. Does this mean that HDInsight currently does not support Spark, even though it supports Hadoop versions with YARN? Or do I need to install Spark on the HDInsight cluster first, in some way? Or is there maybe something else that I'm missing and can I run Spark jobs on HDInsight some other way? Many thanks in advance! Kind regards, Niek Tax
Re: Unable to run Spark 1.0 SparkPi on HDP 2.0
Can you provide links to the sections that are confusing? My understanding, the HDP1 binaries do not need YARN, while the HDP2 binaries do. Now, you can also install Hortonworks Spark RPM... For production, in my opinion, RPMs are better for manageability. On Jul 6, 2014, at 5:39 PM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Hello, thanks for your message... I'm confused, Hortonworhs suggest install spark rpm on each node, but on Spark main page said that yarn enough and I don't need to install it... What the difference? sent from my HTC On Jul 6, 2014 8:34 PM, vs vinayshu...@gmail.com wrote: Konstantin, HWRK provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you can try from http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf Let me know if you see issues with the tech preview. spark PI example on HDP 2.0 I downloaded spark 1.0 pre-build from http://spark.apache.org/downloads.html (for HDP2) The run example from spark web-site: ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 2g --executor-memory 2g --executor-cores 1 ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 2 I got error: Application application_1404470405736_0044 failed 3 times due to AM Container for appattempt_1404470405736_0044_03 exited with exitCode: 1 due to: Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:464) at org.apache.hadoop.util.Shell.run(Shell.java:379) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) .Failing this attempt.. Failing the application. Unknown/unsupported param List(--executor-memory, 2048, --executor-cores, 1, --num-executors, 3) Usage: org.apache.spark.deploy.yarn.ApplicationMaster [options] Options: --jar JAR_PATH Path to your application's JAR file (required) --class CLASS_NAME Name of your application's main class (required) ...bla-bla-bla -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-run-Spark-1-0-SparkPi-on-HDP-2-0-tp8802p8873.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Unable to run Spark 1.0 SparkPi on HDP 2.0
That is confusing based on the context you provided. This might take more time than I can spare to try to understand. For sure, you need to add Spark to run it in/on the HDP 2.1 express VM. Cloudera's CDH 5 express VM includes Spark, but the service isn't running by default. I can't remember for MapR... Marco On Jul 6, 2014, at 6:33 PM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Marco, Hortonworks provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you can try from http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf HDP 2.1 means YARN, at the same time they propose ti install rpm On other hand, http://spark.apache.org/ said Integrated with Hadoop Spark can run on Hadoop 2's YARN cluster manager, and can read any existing Hadoop data. If you have a Hadoop 2 cluster, you can run Spark without any installation needed. And this is confusing for me... do I need rpm installation on not?... Thank you, Konstantin Kudryavtsev On Sun, Jul 6, 2014 at 10:56 PM, Marco Shaw marco.s...@gmail.com wrote: Can you provide links to the sections that are confusing? My understanding, the HDP1 binaries do not need YARN, while the HDP2 binaries do. Now, you can also install Hortonworks Spark RPM... For production, in my opinion, RPMs are better for manageability. On Jul 6, 2014, at 5:39 PM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Hello, thanks for your message... I'm confused, Hortonworhs suggest install spark rpm on each node, but on Spark main page said that yarn enough and I don't need to install it... What the difference? sent from my HTC On Jul 6, 2014 8:34 PM, vs vinayshu...@gmail.com wrote: Konstantin, HWRK provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you can try from http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf Let me know if you see issues with the tech preview. spark PI example on HDP 2.0 I downloaded spark 1.0 pre-build from http://spark.apache.org/downloads.html (for HDP2) The run example from spark web-site: ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 2g --executor-memory 2g --executor-cores 1 ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 2 I got error: Application application_1404470405736_0044 failed 3 times due to AM Container for appattempt_1404470405736_0044_03 exited with exitCode: 1 due to: Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:464) at org.apache.hadoop.util.Shell.run(Shell.java:379) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) .Failing this attempt.. Failing the application. Unknown/unsupported param List(--executor-memory, 2048, --executor-cores, 1, --num-executors, 3) Usage: org.apache.spark.deploy.yarn.ApplicationMaster [options] Options: --jar JAR_PATH Path to your application's JAR file (required) --class CLASS_NAME Name of your application's main class (required) ...bla-bla-bla -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-run-Spark-1-0-SparkPi-on-HDP-2-0-tp8802p8873.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark Summit 2014 Day 2 Video Streams?
They are recorded... For example, 2013: http://spark-summit.org/2013 I'm assuming the 2014 videos will be up in 1-2 weeks. Marco On Tue, Jul 1, 2014 at 3:18 PM, Soumya Simanta soumya.sima...@gmail.com wrote: Are these sessions recorded ? On Tue, Jul 1, 2014 at 9:47 AM, Alexis Roos alexis.r...@gmail.com wrote: *General Session / Keynotes : http://www.ustream.tv/channel/spark-summit-2014 http://www.ustream.tv/channel/spark-summit-2014 Track A : http://www.ustream.tv/channel/track-a1 http://www.ustream.tv/channel/track-a1Track B: http://www.ustream.tv/channel/track-b1 http://www.ustream.tv/channel/track-b1 Track C: http://www.ustream.tv/channel/track-c1 http://www.ustream.tv/channel/track-c1* On Tue, Jul 1, 2014 at 9:37 AM, Aditya Varun Chadha adic...@gmail.com wrote: I attended yesterday on ustream.tv, but can't find the links to today's streams anywhere. help! -- Aditya Varun Chadha | http://www.adichad.com | +91 81308 02929 (M)
Re: Spark vs Google cloud dataflow
Dean: Some interesting information... Do you know where I can read more about these coming changes to Scalding/Cascading? On Jun 27, 2014, at 9:40 AM, Dean Wampler deanwamp...@gmail.com wrote: ... and to be clear on the point, Summingbird is not limited to MapReduce. It abstracts over Scalding (which abstracts over Cascading, which is being moved from MR to Spark) and over Storm for event processing. On Fri, Jun 27, 2014 at 7:16 AM, Sean Owen so...@cloudera.com wrote: On Thu, Jun 26, 2014 at 9:15 AM, Aureliano Buendia buendia...@gmail.com wrote: Summingbird is for map/reduce. Dataflow is the third generation of google's map/reduce, and it generalizes map/reduce the way Spark does. See more about this here: http://youtu.be/wtLJPvx7-ys?t=2h37m8s Yes, my point was that Summingbird is similar in that it is a higher-level service for batch/streaming computation, not that it is similar for being MapReduce-based. It seems Dataflow is based on this paper: http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf FlumeJava maps to Crunch in the Hadoop ecosystem. I think Dataflows is more than that but yeah that seems to be some of the 'language'. It is similar in that it is a distributed collection abstraction. -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com
Re: Spark vs Google cloud dataflow
Sorry. Never mind... I guess that's what Summingbird is all about. Never heard of it. On Jun 27, 2014, at 7:10 PM, Marco Shaw marco.s...@gmail.com wrote: Dean: Some interesting information... Do you know where I can read more about these coming changes to Scalding/Cascading? On Jun 27, 2014, at 9:40 AM, Dean Wampler deanwamp...@gmail.com wrote: ... and to be clear on the point, Summingbird is not limited to MapReduce. It abstracts over Scalding (which abstracts over Cascading, which is being moved from MR to Spark) and over Storm for event processing. On Fri, Jun 27, 2014 at 7:16 AM, Sean Owen so...@cloudera.com wrote: On Thu, Jun 26, 2014 at 9:15 AM, Aureliano Buendia buendia...@gmail.com wrote: Summingbird is for map/reduce. Dataflow is the third generation of google's map/reduce, and it generalizes map/reduce the way Spark does. See more about this here: http://youtu.be/wtLJPvx7-ys?t=2h37m8s Yes, my point was that Summingbird is similar in that it is a higher-level service for batch/streaming computation, not that it is similar for being MapReduce-based. It seems Dataflow is based on this paper: http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf FlumeJava maps to Crunch in the Hadoop ecosystem. I think Dataflows is more than that but yeah that seems to be some of the 'language'. It is similar in that it is a distributed collection abstraction. -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com
Re: How to Run Machine Learning Examples
About run-example, I've tried MapR, Hortonworks and Cloudera distributions with there Spark packages and none seem to package it. Am I missing something? Is this only provided with the Spark project pre-built binaries or from source installs? Marco On May 22, 2014, at 5:04 PM, Stephen Boesch java...@gmail.com wrote: There is a bin/run-example.sh example-class [args] 2014-05-22 12:48 GMT-07:00 yxzhao yxz...@ualr.edu: I want to run the LR, SVM, and NaiveBayes algorithms implemented in the following directory on my data set. But I did not find the sample command line to run them. Anybody help? Thanks. spark-0.9.0-incubating/mllib/src/main/scala/org/apache/spark/mllib/classification -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-Run-Machine-Learning-Examples-tp6277.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Express VMs - good idea?
Hi, I've wanted to play with Spark. I wanted to fast track things and just use one of the vendor's express VMs. I've tried Cloudera CDH 5.0 and Hortonworks HDP 2.1. I've not written down all of my issues, but for certain, when I try to run spark-shell it doesn't work. Cloudera seems to crash, and both complain when I try to use SparkContext in a simple Scala command. So, just a basic question on whether anyone has had success getting these express VMs to work properly with Spark *out of the box* (HDP does required you install Spark manually). I know Cloudera recommends 8GB of RAM, but I've been running it with 4GB. Could it be that 4GB is just not enough, and causing issues or have others had success using these Hadoop 2.x pre-built VMs with Spark 0.9.x? Marco