That is mostly the YARN overhead. You're starting up a container for the AM
and executors, at least. That still sounds pretty slow, but the defaults
aren't tuned for fast startup.
On May 11, 2015 7:00 PM, "Su She" <suhsheka...@gmail.com> wrote:

> Got it to work on the cluster by changing the master to yarn-cluster
> instead of local! I do have a couple follow up questions...
>
> This is the example I was trying to
> run:
> https://github.com/holdenk/learning-spark-examples/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala
>
> 1) The example still takes about 1 min 15 seconds to run (my cluster
> has 3 m3.large nodes). This seems really long for building a model
> based off data that is about 10 lines long. Is this normal?
>
> 2) Any guesses as to why it was able to run in the cluster, but not
> locally?
>
> Thanks for the help!
>
>
> On Mon, Apr 27, 2015 at 11:48 AM, Su She <suhsheka...@gmail.com> wrote:
> > Hello Xiangrui,
> >
> > I am using this spark-submit command (as I do for all other jobs):
> >
> >
> /opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark/bin/spark-submit
> > --class MLlib --master local[2] --jars $(echo
> > /home/ec2-user/sparkApps/learning-spark/lib/*.jar | tr ' ' ',')
> > /home/ec2-user/sparkApps/learning-spark/target/simple-project-1.1.jar
> >
> > Thank you for the help!
> >
> > Best,
> >
> > Su
> >
> >
> > On Mon, Apr 27, 2015 at 9:58 AM, Xiangrui Meng <men...@gmail.com> wrote:
> >> How did you run the example app? Did you use spark-submit? -Xiangrui
> >>
> >> On Thu, Apr 23, 2015 at 2:27 PM, Su She <suhsheka...@gmail.com> wrote:
> >>> Sorry, accidentally sent the last email before finishing.
> >>>
> >>> I had asked this question before, but wanted to ask again as I think
> >>> it is now related to my pom file or project setup. Really appreciate
> the help!
> >>>
> >>> I have been trying on/off for the past month to try to run this MLlib
> >>> example:
> https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala
> >>>
> >>> I am able to build the project successfully. When I run it, it returns:
> >>>
> >>> features in spam: 8
> >>> features in ham: 7
> >>>
> >>> and then freezes. According to the UI, the description of the job is
> >>> "count at DataValidators.scala.38. This corresponds to this line in
> >>> the code:
> >>>
> >>> val model = lrLearner.run(trainingData)
> >>>
> >>> I've tried just about everything I can think of...changed numFeatures
> >>> from 1 -> 10,000, set executor memory to 1g, set up a new cluster, at
> >>> this point I think I might have missed dependencies as that has
> >>> usually been the problem in other spark apps I have tried to run. This
> >>> is my pom file, that I have used for other successful spark apps.
> >>> Please let me know if you think I need any additional dependencies or
> >>> there are incompatibility issues, or a pom.xml that is better to use.
> >>> Thank you!
> >>>
> >>> Cluster information:
> >>>
> >>> Spark version: 1.2.0-SNAPSHOT (in my older cluster it is 1.2.0)
> >>> java version "1.7.0_25"
> >>> Scala version: 2.10.4
> >>> hadoop version: hadoop 2.5.0-cdh5.3.3 (older cluster was 5.3.0)
> >>>
> >>>
> >>>
> >>> <project xmlns = "http://maven.apache.org/POM/4.0.0";
> >>> xmlns:xsi="http://w3.org/2001/XMLSchema-instance"; xsi:schemaLocation
> >>> ="http://maven.apache.org/POM/4.0.0
> >>> http://maven.apache.org/maven-v4_0_0.xsd";>
> >>>         <groupId> edu.berkely</groupId>
> >>>         <artifactId> simple-project </artifactId>
> >>>         <modelVersion> 4.0.0</modelVersion>
> >>>         <name> Simple Project </name>
> >>>         <packaging> jar </packaging>
> >>>         <version> 1.0 </version>
> >>> <repositories>
> >>>         <repository>
> >>>         <id>cloudera</id>
> >>>         <url>
> http://repository.cloudera.com/artifactory/cloudera-repos/</url>
> >>>         </repository>
> >>>
> >>>                 <repository>
> >>>                 <id>scala-tools.org</id>
> >>>                 <name>Scala-tools Maven2 Repository</name>
> >>>                 <url>http://scala-tools.org/repo-releases</url>
> >>>                 </repository>
> >>>
> >>> </repositories>
> >>>
> >>> <pluginRepositories>
> >>>         <pluginRepository>
> >>>                 <id>scala-tools.org</id>
> >>>                 <name>Scala-tools Maven2 Repository</name>
> >>>                 <url>http://scala-tools.org/repo-releases</url>
> >>>         </pluginRepository>
> >>> </pluginRepositories>
> >>>
> >>> <build>
> >>>         <plugins>
> >>>                 <plugin>
> >>>                         <groupId>org.scala-tools</groupId>
> >>>                         <artifactId>maven-scala-plugin</artifactId>
> >>>                         <executions>
> >>>
> >>>                                 <execution>
> >>>                                         <id>compile</id>
> >>>                                         <goals>
> >>>                                                 <goal>compile</goal>
> >>>                                         </goals>
> >>>                                         <phase>compile</phase>
> >>>                                 </execution>
> >>>                                 <execution>
> >>>                                         <id>test-compile</id>
> >>>                                         <goals>
> >>>
>  <goal>testCompile</goal>
> >>>                                         </goals>
> >>>                                         <phase>test-compile</phase>
> >>>                                 </execution>
> >>>                 <execution>
> >>>                    <phase>process-resources</phase>
> >>>                    <goals>
> >>>                      <goal>compile</goal>
> >>>                    </goals>
> >>>                 </execution>
> >>>                         </executions>
> >>>                 </plugin>
> >>>                 <plugin>
> >>>                         <artifactId>maven-compiler-plugin</artifactId>
> >>>                         <configuration>
> >>>                                 <source>1.7</source>
> >>>                                 <target>1.7</target>
> >>>                         </configuration>
> >>>                 </plugin>
> >>>         </plugins>
> >>> </build>
> >>>
> >>>
> >>> <dependencies>
> >>>         <dependency> <!--Spark dependency -->
> >>>         <groupId> org.apache.spark</groupId>
> >>>         <artifactId>spark-core_2.10</artifactId>
> >>>         <version>1.2.0-cdh5.3.0</version>
> >>>         </dependency>
> >>>
> >>>         <dependency>
> >>>         <groupId>org.apache.hadoop</groupId>
> >>>         <artifactId>hadoop-client</artifactId>
> >>>         <version>2.5.0-mr1-cdh5.3.0</version>
> >>>         </dependency>
> >>>
> >>>         <dependency>
> >>>         <groupId>org.scala-lang</groupId>
> >>>         <artifactId>scala-library</artifactId>
> >>>         <version>2.10.4</version>
> >>>         </dependency>
> >>>
> >>>         <dependency>
> >>>         <groupId>org.scala-lang</groupId>
> >>>         <artifactId>scala-compiler</artifactId>
> >>>         <version>2.10.4</version>
> >>>         </dependency>
> >>>
> >>>         <dependency>
> >>>         <groupId>com.101tec</groupId>
> >>>         <artifactId>zkclient</artifactId>
> >>>         <version>0.3</version>
> >>>         </dependency>
> >>>
> >>>          <dependency>
> >>>          <groupId>com.yammer.metrics</groupId>
> >>>          <artifactId>metrics-core</artifactId>
> >>>          <version>2.2.0</version>
> >>>          </dependency>
> >>>
> >>>
> >>>         <dependency>
> >>>         <groupId>org.apache.hadoop</groupId>
> >>>         <artifactId>hadoop-yarn-server-web-proxy</artifactId>
> >>>         <version>2.5.0</version>
> >>>         </dependency>
> >>>
> >>>         <dependency>
> >>>         <groupId>org.apache.thrift</groupId>
> >>>         <artifactId>libthrift</artifactId>
> >>>         <version>0.9.2</version>
> >>>         </dependency>
> >>>
> >>>         <dependency>
> >>>         <groupId>com.google.guava</groupId>
> >>>         <artifactId>guava</artifactId>
> >>>         <version>18.0</version>
> >>>         </dependency>
> >>>
> >>>          <dependency>
> >>>         <groupId>junit</groupId>
> >>>         <artifactId>junit</artifactId>
> >>>         <version>3.8.1</version>
> >>>         <scope>test</scope>
> >>>         </dependency>
> >>>
> >>>         <dependency>
> >>>         <groupId>org.apache.spark</groupId>
> >>>         <artifactId>spark-mllib_2.10</artifactId>
> >>>         <version>1.2.0</version>
> >>>         </dependency>
> >>>
> >>>         <dependency>
> >>>         <groupId>org.scalanlp</groupId>
> >>>         <artifactId>breeze-math_2.10</artifactId>
> >>>         <version>0.4</version>
> >>>         </dependency>
> >>>
> >>>         <dependency>
> >>>         <groupId>com.googlecode.netlib-java</groupId>
> >>>         <artifactId>netlib-java</artifactId>
> >>>         <version>1.0</version>
> >>>         </dependency>
> >>>
> >>>         <dependency>
> >>>         <groupId>org.jblas</groupId>
> >>>         <artifactId>jblas</artifactId>
> >>>         <version>1.2.3</version>
> >>>         </dependency>
> >>>
> >>> </dependencies>
> >>>
> >>> </project>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >>> For additional commands, e-mail: user-h...@spark.apache.org
> >>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to