Spark MOOC - early access

2015-05-21 Thread Marco Shaw
*Hi Spark Devs and Users,BerkeleyX and Databricks are currently developing
two Spark-related MOOC on edX (intro
https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x,
ml
https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x),
the first of which starts on June 1st.  Together these courses have over
75K enrolled students!To help students perform exercises course content, we
have created a Vagrant box that contains Spark and IPython (running on
Ubuntu 32-bit).  This will simplify user setup and helps us support them.
We are writing to give you early access to the VM environment and the first
assignment, and to request your help to test out the VM/assignment before
we unleash it to 75K people (see instructions below). We’ve provided
instructions below.  We’re happy to help if you have any difficulties
getting the VM setup; please feel free to contact me (marco.s...@gmail.com
marco.s...@gmail.com)  with any issues, comments, or
questions.Sincerely,Marco ShawSpark MOOC TA_(This is being sent
as an HTML formatted email.  Some of the links have been duplicated just in
case.)1. Install VirtualBox here
https://www.virtualbox.org/wiki/Downloads on your OS (see Windows
tutorial here https://www.youtube.com/watch?v=06Sf-m64fcY
(https://www.youtube.com/watch?v=06Sf-m64fcY
https://www.youtube.com/watch?v=06Sf-m64fcY))2. Install Vagrant here
https://www.vagrantup.com/downloads.html on your OS (see Windows tutorial
here https://www.youtube.com/watch?v=LZVS23BaA1I
(https://www.youtube.com/watch?v=LZVS23BaA1I
https://www.youtube.com/watch?v=LZVS23BaA1I))3) Install virtual machine
using the following steps: (see Windows tutorial here
https://www.youtube.com/watch?v=ZuJCqHC7IYc
(https://www.youtube.com/watch?v=ZuJCqHC7IYc
https://www.youtube.com/watch?v=ZuJCqHC7IYc))a. Create a custom directory
(e.g. c:\users\marco\myvagrant or /home/marco/myvagrant)b. Download the
file
https://raw.githubusercontent.com/spark-mooc/mooc-setup/master/Vagrantfile
to the custom directory (NOTE: It must be named exactly Vagrantfile with
no extension)c. Open a DOS prompt (Windows) or terminal (Mac/Linux) to the
custom directory and issue the command vagrant up4) Perform basic
commands in VM as described below: (see Windows tutorial here
https://www.youtube.com/watch?v=bkteLH77IR0
(https://www.youtube.com/watch?v=bkteLH77IR0
https://www.youtube.com/watch?v=bkteLH77IR0))a. To start the VM, from a
DOS prompt (Windows) or terminal (Mac/Linux), issue the command vagrant
up.b. To stop the VM, from a DOS prompt (Windows) or terminal (Mac/Linux),
issue the command vagrant halt.c. To erase or delete the VM, from a DOS
prompt (Windows) or terminal (Mac/Linux), issue the command vagrant
destroy.d. Once the VM is running, to access the notebook, open a web
browser to http://localhost:8001 http://localhost:8001/.5) Using test
notebook as described below: (see Windows tutorial here
https://www.youtube.com/watch?v=mlfAmyF3Q-s
(https://www.youtube.com/watch?v=mlfAmyF3Q-s
https://www.youtube.com/watch?v=mlfAmyF3Q-s))a. To start the VM, from a
DOS prompt (Windows) or terminal (Mac/Linux), issue the command vagrant
up.b. Once the VM is running, to access the notebook, open a web browser
to http://localhost:8001 http://localhost:8001/.c. Upload this IPython
notebook:
https://raw.githubusercontent.com/spark-mooc/mooc-setup/master/vm_test_student.ipynb
https://raw.githubusercontent.com/spark-mooc/mooc-setup/master/vm_test_student.ipynb.d.
Run through the notebook.6) Play around with the first MOOC assignment
(email Marco for details when you get to this point).7) Please answer the
following questionsa. What machine are you using (OS, RAM, CPU, age)?b. How
long did the entire process take?c. How long did the VM download take?
Relatedly, where are you located?d. Do you have any other
comments/suggestions?*


Need some guidance

2015-04-13 Thread Marco Shaw
**Learning the ropes**

I'm trying to grasp the concept of using the pipeline in pySpark...

Simplified example:

list=[(1,alpha),(1,beta),(1,foo),(1,alpha),(2,alpha),(2,alpha),(2,bar),(3,foo)]

Desired outcome:
[(1,3),(2,2),(3,1)]

Basically for each key, I want the number of unique values.

I've tried different approaches, but am I really using Spark effectively?
I wondered if I would do something like:
 input=sc.parallelize(list)
 input.groupByKey().collect()

Then I wondered if I could do something like a foreach over each key value,
and then map the actual values and reduce them.  Pseudo-code:

input.groupbykey()
.keys
.foreach(_.values
.map(lambda x: x,1)
.reducebykey(lambda a,b:a+b)
.count()
)

I was somehow hoping that the key would get the current value of count, and
thus be the count of the unique keys, which is exactly what I think I'm
looking for.

Am I way off base on how I could accomplish this?

Marco


Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Marco Shaw
Hi,

Let me reword your request so you understand how (too) generic your question 
is

Hi, I have $10,000, please find me some means of transportation so I can get 
to work.

Please provide (a lot) more details. If you can't, consider using one of the 
pre-built express VMs from either Cloudera, Hortonworks or MapR, for example. 

Marco



 On Jan 22, 2015, at 7:36 AM, Sudipta Banerjee asudipta.baner...@gmail.com 
 wrote:
 
 
 
 Hi Apache-Spark team ,
 
 What are the system requirements installing Hadoop and Apache Spark?
 I have attached the screen shot of Gparted.
 
 
 Thanks and regards,
 Sudipta 
 
 
 
 
 -- 
 Sudipta Banerjee
 Consultant, Business Analytics and Cloud Based Architecture 
 Call me +919019578099
 Screenshot - Wednesday 21 January 2015 - 10:55:29 IST.png
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Marco Shaw
Sudipta - Please don't ever come here or post here again.

On Thu, Jan 22, 2015 at 1:25 PM, Sudipta Banerjee 
asudipta.baner...@gmail.com wrote:

 Hi Nicos, Taking forward your argument,please be a smart a$$ and dont use
 unprofessional language just for the sake of being a moderator.
 Paco Nathan is respected for the dignity he carries in sharing his
 knowledge and making it available free for a$$es like us right!
 So just mind your tongue next time you put such a$$ in your mouth.

 Best Regards,
 Sudipta

 On Thu, Jan 22, 2015 at 10:39 PM, Nicos Kekchidis ikon...@me.com wrote:

 Folks,
 Just a gentle reminder we owe to ourselves:
 - this is a public forum and we need to behave accordingly, it is not
 place to vent frustration in rude way
 - getting attention here is an earned privilege and not entitlement
 - this is not a “Platinum Support” department of your vendor rather and
 open source collaboration forum where people volunteer their time to pay
 attention to your needs
 - there are still many gray areas so be patient and articulate questions
 in as much details as possible if you want to get quick help and not just
 be perceived as a smart a$$

 FYI - Paco Nathan is a well respected Spark evangelist and many people,
 including myself, owe to his passion for jumping on Spark platform promise.
 People like Sean Owen keep us believing in things when we feel like hitting
 the dead-end.

 Please, be respectful of what connections you are prized with and act
 civilized.

 Have a great day!
 - Nicos


  On Jan 22, 2015, at 7:49 AM, Sean Owen so...@cloudera.com wrote:
 
  Yes, this isn't a well-formed question, and got maybe the response it
  deserved, but the tone is veering off the rails. I just got a much
  ruder reply from Sudipta privately, which I will not forward. Sudipta,
  I suggest you take the responses you've gotten so far as about as much
  answer as can be had here and do some work yourself, and come back
  with much more specific questions, and it will all be helpful and
  polite again.
 
  On Thu, Jan 22, 2015 at 2:51 PM, Sudipta Banerjee
  asudipta.baner...@gmail.com wrote:
  Hi Marco,
 
  Thanks for the confirmation. Please let me know what are the lot more
 detail
  you need to answer a very specific question  WHAT IS THE MINIMUM
 HARDWARE
  CONFIGURATION REQUIRED TO BUILT HDFS+ MAPREDUCE+SPARK+YARN  on a
 system?
  Please let me know if you need any further information and if you dont
 know
  please drive across with the $1 to Sir Paco Nathan and get me the
  answer.
 
  Thanks and Regards,
  Sudipta
 
  On Thu, Jan 22, 2015 at 5:33 PM, Marco Shaw marco.s...@gmail.com
 wrote:
 
  Hi,
 
  Let me reword your request so you understand how (too) generic your
  question is
 
  Hi, I have $10,000, please find me some means of transportation so I
 can
  get to work.
 
  Please provide (a lot) more details. If you can't, consider using one
 of
  the pre-built express VMs from either Cloudera, Hortonworks or MapR,
 for
  example.
 
  Marco
 
 
 
  On Jan 22, 2015, at 7:36 AM, Sudipta Banerjee
  asudipta.baner...@gmail.com wrote:
 
 
 
  Hi Apache-Spark team ,
 
  What are the system requirements installing Hadoop and Apache Spark?
  I have attached the screen shot of Gparted.
 
 
  Thanks and regards,
  Sudipta
 
 
 
 
  --
  Sudipta Banerjee
  Consultant, Business Analytics and Cloud Based Architecture
  Call me +919019578099
  Screenshot - Wednesday 21 January 2015 - 10:55:29 IST.png
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 
 
 
  --
  Sudipta Banerjee
  Consultant, Business Analytics and Cloud Based Architecture
  Call me +919019578099
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 




 --
 Sudipta Banerjee
 Consultant, Business Analytics and Cloud Based Architecture
 Call me +919019578099



Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Marco Shaw
(Starting over...)

The best place to look for the requirements would be at the individual
pages of each technology.

As for absolute minimum requirements, I would suggest 50GB of disk space
and at least 8GB of memory.  This is the absolute minimum.

Architecting a solution like you are looking for is very complex.  If you
are just looking for a proof-of-concept consider a Docker image or going to
Cloudera/Hortonworks/MapR and look for their express VMs which can
usually run on Oracle Virtualbox or VMware.

Marco


On Thu, Jan 22, 2015 at 7:36 AM, Sudipta Banerjee 
asudipta.baner...@gmail.com wrote:



 Hi Apache-Spark team ,

 What are the system requirements installing Hadoop and Apache Spark?
 I have attached the screen shot of Gparted.


 Thanks and regards,
 Sudipta




 --
 Sudipta Banerjee
 Consultant, Business Analytics and Cloud Based Architecture
 Call me +919019578099


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



Re: DeepLearning and Spark ?

2015-01-09 Thread Marco Shaw
Pretty vague on details:

http://www.datasciencecentral.com/m/blogpost?id=6448529%3ABlogPost%3A227199


 On Jan 9, 2015, at 11:39 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:
 
 Hi all,
 
 DeepLearning algorithms are popular and achieve many state of the art 
 performance in several real world machine learning problems. Currently there 
 are no DL implementation in spark and I wonder if there is an ongoing work on 
 this topics.
 
 We can do DL in spark Sparkling water and H2O but this adds an additional 
 software stack.
 
 Deeplearning4j seems to implements a distributed version of many popural DL 
 algorithm. Porting DL4j in Spark can be interesting.
 
 Google describes an implementation of a large scale DL in this paper 
 http://research.google.com/archive/large_deep_networks_nips2012.html. Based 
 on model parallelism and data parallelism.
 
 So, I'm trying to imaging what should be a good design for DL algorithm in 
 Spark ? Spark already have RDD (for data parallelism). Can GraphX be used for 
 the model parallelism (as DNN are generally designed as DAG) ? And what about 
 using GPUs to do local parallelism (mecanism to push partition into GPU 
 memory ) ? 
 
 
 What do you think about this ?
 
 
 Cheers,
 
 Jao
 


Re: when will the spark 1.3.0 be released?

2014-12-16 Thread Marco Shaw
When it is ready. 



 On Dec 16, 2014, at 11:43 PM, 张建轶 zhangjia...@youku.com wrote:
 
 Hi £¡
 
 when will the spark 1.3.0 be released£¿
 I want to use new LDA feature.
 Thank 
 you!B‹CB•È[œÝXœØÜšX™KK[XZ[ˆ\Ù\‹][œÝXœØÜšX™PÜ\šË˜\XÚK›Ü™ÃB‘›ÜˆY][Û˜[ÛÛ[X[™ËK[XZ[ˆ\Ù\‹Z[Ü\šË˜\XÚK›Ü™ÃBƒB

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Starting with spark

2014-07-24 Thread Marco Shaw
First thing...  Go into the Cloudera Manager and make sure that the Spark
service (master?) is started.

Marco


On Thu, Jul 24, 2014 at 7:53 AM, Sameer Sayyed sam.sayyed...@gmail.com
wrote:

 Hello All,

 I am new user of spark, I am using *cloudera-quickstart-vm-5.0.0-0-vmware*
 for execute sample examples of Spark.
 I am very sorry for silly and basic question.
 I am not able to deploy and execute sample examples of spark.

 please suggest me *how to start with spark*.

 Please help me
 Thanks in advance.

 Regards,
 Sam



Re: Running Spark on Microsoft Azure HDInsight

2014-07-14 Thread Marco Shaw
I'm a Spark and HDInsight novice, so I could be wrong...

HDInsight is based on HDP2, so my guess here is that you have the option of
installing/configuring Spark in cluster mode (YARN) or in standalone mode
and package the Spark binaries with your job.

Everything I seem to look at is related to UNIX shell scripts.  So, one
might need to pull apart some of these scripts to pick out how to run this
on Windows.

Interesting project...

Marco



On Mon, Jul 14, 2014 at 8:00 AM, Niek Tax niek...@gmail.com wrote:

 Hi everyone,

 Currently I am working on parallelizing a machine learning algorithm using
 a Microsoft HDInsight cluster. I tried running my algorithm on Hadoop
 MapReduce, but since my algorithm is iterative the job scheduling overhead
 and data loading overhead severely limits the performance of my algorithm
 in terms of training time.

 Since recently, HDInsight supports Hadoop 2 with YARN, which I thought
 would allow me to use run Spark jobs, which seem more fitting for my task. So
 far I have not been able however to find how I can run Apache Spark jobs on
 a HDInsight cluster.

 It seems like remote job submission (which would have my preference) is
 not possible for Spark on HDInsight, as REST endpoints for Oozie and
 templeton do not seem to support submission of Spark jobs. I also tried to
 RDP to the headnode for job submission from the headnode. On the headnode
 drives I can find other new YARN computation models like Tez and I also
 managed to run Tez jobs on it through YARN. However, Spark seems to be
 missing. Does this mean that HDInsight currently does not support Spark,
 even though it supports Hadoop versions with YARN? Or do I need to install
 Spark on the HDInsight cluster first, in some way? Or is there maybe
 something else that I'm missing and can I run Spark jobs on HDInsight some
 other way?

 Many thanks in advance!


 Kind regards,

 Niek Tax



Re: Running Spark on Microsoft Azure HDInsight

2014-07-14 Thread Marco Shaw
Looks like going with cluster mode is not a good idea:
http://azure.microsoft.com/en-us/documentation/articles/hdinsight-administer-use-management-portal/

Seems like a non-HDInsight VM might be needed to make it the Spark master
node.

Marco



On Mon, Jul 14, 2014 at 12:43 PM, Marco Shaw marco.s...@gmail.com wrote:

 I'm a Spark and HDInsight novice, so I could be wrong...

 HDInsight is based on HDP2, so my guess here is that you have the option
 of installing/configuring Spark in cluster mode (YARN) or in standalone
 mode and package the Spark binaries with your job.

 Everything I seem to look at is related to UNIX shell scripts.  So, one
 might need to pull apart some of these scripts to pick out how to run this
 on Windows.

 Interesting project...

 Marco



 On Mon, Jul 14, 2014 at 8:00 AM, Niek Tax niek...@gmail.com wrote:

 Hi everyone,

 Currently I am working on parallelizing a machine learning algorithm
 using a Microsoft HDInsight cluster. I tried running my algorithm on Hadoop
 MapReduce, but since my algorithm is iterative the job scheduling overhead
 and data loading overhead severely limits the performance of my algorithm
 in terms of training time.

 Since recently, HDInsight supports Hadoop 2 with YARN, which I thought
 would allow me to use run Spark jobs, which seem more fitting for my task. So
 far I have not been able however to find how I can run Apache Spark jobs on
 a HDInsight cluster.

 It seems like remote job submission (which would have my preference) is
 not possible for Spark on HDInsight, as REST endpoints for Oozie and
 templeton do not seem to support submission of Spark jobs. I also tried to
 RDP to the headnode for job submission from the headnode. On the headnode
 drives I can find other new YARN computation models like Tez and I also
 managed to run Tez jobs on it through YARN. However, Spark seems to be
 missing. Does this mean that HDInsight currently does not support Spark,
 even though it supports Hadoop versions with YARN? Or do I need to install
 Spark on the HDInsight cluster first, in some way? Or is there maybe
 something else that I'm missing and can I run Spark jobs on HDInsight some
 other way?

 Many thanks in advance!


 Kind regards,

 Niek Tax





Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-06 Thread Marco Shaw
Can you provide links to the sections that are confusing?

My understanding, the HDP1 binaries do not need YARN, while the HDP2 binaries 
do. 

Now, you can also install Hortonworks Spark RPM...

For production, in my opinion, RPMs are better for manageability. 

 On Jul 6, 2014, at 5:39 PM, Konstantin Kudryavtsev 
 kudryavtsev.konstan...@gmail.com wrote:
 
 Hello, thanks for your message... I'm confused, Hortonworhs suggest install 
 spark rpm on each node, but on Spark main page said that yarn enough and I 
 don't need to install it... What the difference?
 
 sent from my HTC
 
 On Jul 6, 2014 8:34 PM, vs vinayshu...@gmail.com wrote:
 Konstantin,
 
 HWRK provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you can try
 from
 http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf
 
 Let me know if you see issues with the tech preview.
 
 spark PI example on HDP 2.0
 
 I downloaded spark 1.0 pre-build from http://spark.apache.org/downloads.html
 (for HDP2)
 The run example from spark web-site:
 ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
 yarn-cluster --num-executors 3 --driver-memory 2g --executor-memory 2g
 --executor-cores 1 ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 2
 
 I got error:
 Application application_1404470405736_0044 failed 3 times due to AM
 Container for appattempt_1404470405736_0044_03 exited with exitCode: 1
 due to: Exception from container-launch:
 org.apache.hadoop.util.Shell$ExitCodeException:
 at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
 at org.apache.hadoop.util.Shell.run(Shell.java:379)
 at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
 at
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
 at
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
 at
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 .Failing this attempt.. Failing the application.
 
 Unknown/unsupported param List(--executor-memory, 2048, --executor-cores, 1,
 --num-executors, 3)
 Usage: org.apache.spark.deploy.yarn.ApplicationMaster [options]
 Options:
   --jar JAR_PATH   Path to your application's JAR file (required)
   --class CLASS_NAME   Name of your application's main class (required)
 ...bla-bla-bla
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-run-Spark-1-0-SparkPi-on-HDP-2-0-tp8802p8873.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-06 Thread Marco Shaw
That is confusing based on the context you provided. 

This might take more time than I can spare to try to understand. 

For sure, you need to add Spark to run it in/on the HDP 2.1 express VM. 

Cloudera's CDH 5 express VM includes Spark, but the service isn't running by 
default. 

I can't remember for MapR...

Marco

 On Jul 6, 2014, at 6:33 PM, Konstantin Kudryavtsev 
 kudryavtsev.konstan...@gmail.com wrote:
 
 Marco,
 
 Hortonworks provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you can 
 try
 from
 http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf  
 HDP 2.1 means YARN, at the same time they propose ti install rpm
 
 On other hand, http://spark.apache.org/ said 
 Integrated with Hadoop
 Spark can run on Hadoop 2's YARN cluster manager, and can read any existing 
 Hadoop data.
 
 If you have a Hadoop 2 cluster, you can run Spark without any installation 
 needed. 
 
 And this is confusing for me... do I need rpm installation on not?...
 
 
 Thank you,
 Konstantin Kudryavtsev
 
 
 On Sun, Jul 6, 2014 at 10:56 PM, Marco Shaw marco.s...@gmail.com wrote:
 Can you provide links to the sections that are confusing?
 
 My understanding, the HDP1 binaries do not need YARN, while the HDP2 
 binaries do. 
 
 Now, you can also install Hortonworks Spark RPM...
 
 For production, in my opinion, RPMs are better for manageability. 
 
 On Jul 6, 2014, at 5:39 PM, Konstantin Kudryavtsev 
 kudryavtsev.konstan...@gmail.com wrote:
 
 Hello, thanks for your message... I'm confused, Hortonworhs suggest install 
 spark rpm on each node, but on Spark main page said that yarn enough and I 
 don't need to install it... What the difference?
 
 sent from my HTC
 
 On Jul 6, 2014 8:34 PM, vs vinayshu...@gmail.com wrote:
 Konstantin,
 
 HWRK provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you can try
 from
 http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf
 
 Let me know if you see issues with the tech preview.
 
 spark PI example on HDP 2.0
 
 I downloaded spark 1.0 pre-build from 
 http://spark.apache.org/downloads.html
 (for HDP2)
 The run example from spark web-site:
 ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
 yarn-cluster --num-executors 3 --driver-memory 2g --executor-memory 2g
 --executor-cores 1 ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 2
 
 I got error:
 Application application_1404470405736_0044 failed 3 times due to AM
 Container for appattempt_1404470405736_0044_03 exited with exitCode: 1
 due to: Exception from container-launch:
 org.apache.hadoop.util.Shell$ExitCodeException:
 at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
 at org.apache.hadoop.util.Shell.run(Shell.java:379)
 at 
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
 at
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
 at
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
 at
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 .Failing this attempt.. Failing the application.
 
 Unknown/unsupported param List(--executor-memory, 2048, --executor-cores, 
 1,
 --num-executors, 3)
 Usage: org.apache.spark.deploy.yarn.ApplicationMaster [options]
 Options:
   --jar JAR_PATH   Path to your application's JAR file (required)
   --class CLASS_NAME   Name of your application's main class (required)
 ...bla-bla-bla
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-run-Spark-1-0-SparkPi-on-HDP-2-0-tp8802p8873.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 


Re: Spark Summit 2014 Day 2 Video Streams?

2014-07-01 Thread Marco Shaw
They are recorded...  For example, 2013: http://spark-summit.org/2013

I'm assuming the 2014 videos will be up in 1-2 weeks.

Marco


On Tue, Jul 1, 2014 at 3:18 PM, Soumya Simanta soumya.sima...@gmail.com
wrote:

 Are these sessions recorded ?


 On Tue, Jul 1, 2014 at 9:47 AM, Alexis Roos alexis.r...@gmail.com wrote:







 *General Session / Keynotes :
  http://www.ustream.tv/channel/spark-summit-2014
 http://www.ustream.tv/channel/spark-summit-2014 Track A
 : http://www.ustream.tv/channel/track-a1
 http://www.ustream.tv/channel/track-a1Track
 B: http://www.ustream.tv/channel/track-b1
 http://www.ustream.tv/channel/track-b1 Track
 C: http://www.ustream.tv/channel/track-c1
 http://www.ustream.tv/channel/track-c1*


 On Tue, Jul 1, 2014 at 9:37 AM, Aditya Varun Chadha adic...@gmail.com
 wrote:

 I attended yesterday on ustream.tv, but can't find the links to today's
 streams anywhere. help!

 --
 Aditya Varun Chadha | http://www.adichad.com | +91 81308 02929 (M)






Re: Spark vs Google cloud dataflow

2014-06-27 Thread Marco Shaw
Dean: Some interesting information... Do you know where I can read more about 
these coming changes to Scalding/Cascading?

 On Jun 27, 2014, at 9:40 AM, Dean Wampler deanwamp...@gmail.com wrote:
 
 ... and to be clear on the point, Summingbird is not limited to MapReduce. It 
 abstracts over Scalding (which abstracts over Cascading, which is being moved 
 from MR to Spark) and over Storm for event processing.
 
 
 On Fri, Jun 27, 2014 at 7:16 AM, Sean Owen so...@cloudera.com wrote:
 On Thu, Jun 26, 2014 at 9:15 AM, Aureliano Buendia buendia...@gmail.com 
 wrote:
  Summingbird is for map/reduce. Dataflow is the third generation of google's
  map/reduce, and it generalizes map/reduce the way Spark does. See more 
  about
  this here: http://youtu.be/wtLJPvx7-ys?t=2h37m8s
 
 Yes, my point was that Summingbird is similar in that it is a
 higher-level service for batch/streaming computation, not that it is
 similar for being MapReduce-based.
 
  It seems Dataflow is based on this paper:
  http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf
 
 FlumeJava maps to Crunch in the Hadoop ecosystem. I think Dataflows is
 more than that but yeah that seems to be some of the 'language'. It is
 similar in that it is a distributed collection abstraction.
 
 
 
 -- 
 Dean Wampler, Ph.D.
 Typesafe
 @deanwampler
 http://typesafe.com
 http://polyglotprogramming.com


Re: Spark vs Google cloud dataflow

2014-06-27 Thread Marco Shaw
Sorry. Never mind...  I guess that's what Summingbird is all about. Never 
heard of it. 

 On Jun 27, 2014, at 7:10 PM, Marco Shaw marco.s...@gmail.com wrote:
 
 Dean: Some interesting information... Do you know where I can read more about 
 these coming changes to Scalding/Cascading?
 
 On Jun 27, 2014, at 9:40 AM, Dean Wampler deanwamp...@gmail.com wrote:
 
 ... and to be clear on the point, Summingbird is not limited to MapReduce. 
 It abstracts over Scalding (which abstracts over Cascading, which is being 
 moved from MR to Spark) and over Storm for event processing.
 
 
 On Fri, Jun 27, 2014 at 7:16 AM, Sean Owen so...@cloudera.com wrote:
 On Thu, Jun 26, 2014 at 9:15 AM, Aureliano Buendia buendia...@gmail.com 
 wrote:
  Summingbird is for map/reduce. Dataflow is the third generation of 
  google's
  map/reduce, and it generalizes map/reduce the way Spark does. See more 
  about
  this here: http://youtu.be/wtLJPvx7-ys?t=2h37m8s
 
 Yes, my point was that Summingbird is similar in that it is a
 higher-level service for batch/streaming computation, not that it is
 similar for being MapReduce-based.
 
  It seems Dataflow is based on this paper:
  http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf
 
 FlumeJava maps to Crunch in the Hadoop ecosystem. I think Dataflows is
 more than that but yeah that seems to be some of the 'language'. It is
 similar in that it is a distributed collection abstraction.
 
 
 
 -- 
 Dean Wampler, Ph.D.
 Typesafe
 @deanwampler
 http://typesafe.com
 http://polyglotprogramming.com


Re: How to Run Machine Learning Examples

2014-05-22 Thread Marco Shaw
About run-example, I've tried MapR, Hortonworks and Cloudera distributions with 
there Spark packages and none seem to package it. 

Am I missing something?  Is this only provided with the Spark project pre-built 
binaries or from source installs?

Marco

 On May 22, 2014, at 5:04 PM, Stephen Boesch java...@gmail.com wrote:
 
 
 There is a bin/run-example.sh example-class [args]
 
 
 2014-05-22 12:48 GMT-07:00 yxzhao yxz...@ualr.edu:
 I want to run the LR, SVM, and NaiveBayes algorithms implemented in the
 following directory on my data set. But I did not find the sample command
 line to run them. Anybody help? Thanks.
 spark-0.9.0-incubating/mllib/src/main/scala/org/apache/spark/mllib/classification
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-Run-Machine-Learning-Examples-tp6277.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 


Express VMs - good idea?

2014-05-14 Thread Marco Shaw
Hi,

I've wanted to play with Spark.  I wanted to fast track things and just use
one of the vendor's express VMs.  I've tried Cloudera CDH 5.0 and
Hortonworks HDP 2.1.

I've not written down all of my issues, but for certain, when I try to run
spark-shell it doesn't work.  Cloudera seems to crash, and both complain
when I try to use SparkContext in a simple Scala command.

So, just a basic question on whether anyone has had success getting these
express VMs to work properly with Spark *out of the box* (HDP does required
you install Spark manually).

I know Cloudera recommends 8GB of RAM, but I've been running it with 4GB.

Could it be that 4GB is just not enough, and causing issues or have others
had success using these Hadoop 2.x pre-built VMs with Spark 0.9.x?

Marco