Re: Use mvn to build Spark 1.2.0 failed

2014-12-22 Thread Sean Owen
I just tried the exact same command and do not see any error. Maybe
you can make sure you're starting from a clean extraction of the
distro, and check your environment. I'm on OSX, Maven 3.2, Java 8 but
I don't know that any of those would be relevant.

On Mon, Dec 22, 2014 at 4:10 AM, wyphao.2007 wyphao.2...@163.com wrote:
 Hi all, Today download Spark source from 
 http://spark.apache.org/downloads.html page, and I use


  ./make-distribution.sh --tgz -Phadoop-2.2 -Pyarn -DskipTests 
 -Dhadoop.version=2.2.0 -Phive


 to build the release, but I encountered an exception as follow:


 [INFO] --- build-helper-maven-plugin:1.8:add-source (add-scala-sources) @ 
 spark-parent ---
 [INFO] Source directory: /home/q/spark/spark-1.2.0/src/main/scala added.
 [INFO]
 [INFO] --- maven-remote-resources-plugin:1.5:process (default) @ spark-parent 
 ---
 [INFO] 
 
 [INFO] Reactor Summary:
 [INFO]
 [INFO] Spark Project Parent POM .. FAILURE [1.015s]
 [INFO] Spark Project Networking .. SKIPPED
 [INFO] Spark Project Shuffle Streaming Service ... SKIPPED
 [INFO] Spark Project Core  SKIPPED
 [INFO] Spark Project Bagel ... SKIPPED
 [INFO] Spark Project GraphX .. SKIPPED
 [INFO] Spark Project Streaming ... SKIPPED
 [INFO] Spark Project Catalyst  SKIPPED
 [INFO] Spark Project SQL . SKIPPED
 [INFO] Spark Project ML Library .. SKIPPED
 [INFO] Spark Project Tools ... SKIPPED
 [INFO] Spark Project Hive  SKIPPED
 [INFO] Spark Project REPL  SKIPPED
 [INFO] Spark Project YARN Parent POM . SKIPPED
 [INFO] Spark Project YARN Stable API . SKIPPED
 [INFO] Spark Project Assembly  SKIPPED
 [INFO] Spark Project External Twitter  SKIPPED
 [INFO] Spark Project External Flume Sink . SKIPPED
 [INFO] Spark Project External Flume .. SKIPPED
 [INFO] Spark Project External MQTT ... SKIPPED
 [INFO] Spark Project External ZeroMQ . SKIPPED
 [INFO] Spark Project External Kafka .. SKIPPED
 [INFO] Spark Project Examples  SKIPPED
 [INFO] Spark Project YARN Shuffle Service  SKIPPED
 [INFO] 
 
 [INFO] BUILD FAILURE
 [INFO] 
 
 [INFO] Total time: 1.644s
 [INFO] Finished at: Mon Dec 22 10:56:35 CST 2014
 [INFO] Final Memory: 21M/481M
 [INFO] 
 
 [ERROR] Failed to execute goal 
 org.apache.maven.plugins:maven-remote-resources-plugin:1.5:process (default) 
 on project spark-parent: Error finding remote resources manifests: 
 /home/q/spark/spark-1.2.0/target/maven-shared-archive-resources/META-INF/NOTICE
  (No such file or directory) - [Help 1]
 [ERROR]
 [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
 switch.
 [ERROR] Re-run Maven using the -X switch to enable full debug logging.
 [ERROR]
 [ERROR] For more information about the errors and possible solutions, please 
 read the following articles:
 [ERROR] [Help 1] 
 http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException


 but the NOTICE file is in the download spark release:


 [wyp@spark  /home/q/spark/spark-1.2.0]$ ll
 total 248
 drwxrwxr-x 3 1000 1000  4096 Dec 10 18:02 assembly
 drwxrwxr-x 3 1000 1000  4096 Dec 10 18:02 bagel
 drwxrwxr-x 2 1000 1000  4096 Dec 10 18:02 bin
 drwxrwxr-x 2 1000 1000  4096 Dec 10 18:02 conf
 -rw-rw-r-- 1 1000 1000   663 Dec 10 18:02 CONTRIBUTING.md
 drwxrwxr-x 3 1000 1000  4096 Dec 10 18:02 core
 drwxrwxr-x 3 1000 1000  4096 Dec 10 18:02 data
 drwxrwxr-x 4 1000 1000  4096 Dec 10 18:02 dev
 drwxrwxr-x 3 1000 1000  4096 Dec 10 18:02 docker
 drwxrwxr-x 7 1000 1000  4096 Dec 10 18:02 docs
 drwxrwxr-x 4 1000 1000  4096 Dec 10 18:02 ec2
 drwxrwxr-x 4 1000 1000  4096 Dec 10 18:02 examples
 drwxrwxr-x 8 1000 1000  4096 Dec 10 18:02 external
 drwxrwxr-x 5 1000 1000  4096 Dec 10 18:02 extras
 drwxrwxr-x 4 1000 1000  4096 Dec 10 18:02 graphx
 -rw-rw-r-- 1 1000 1000 45242 Dec 10 18:02 LICENSE
 -rwxrwxr-x 1 1000 1000  7941 Dec 10 18:02 make-distribution.sh
 drwxrwxr-x 3 1000 1000  4096 Dec 10 18:02 mllib
 drwxrwxr-x 5 1000 1000  4096 Dec 10 18:02 network
 -rw-rw-r-- 1 1000 1000 22559 Dec 10 18:02 NOTICE
 -rw-rw-r-- 1 1000 1000 49002 Dec 10 18:02 pom.xml
 drwxrwxr-x 4 1000 1000  4096 Dec 10 18:02 project
 drwxrwxr-x 6 1000 1000  4096 Dec 10 18:02 python
 -rw-rw-r-- 1 1000 1000  3645 Dec 10 18:02 

Spark exception when sending message to akka actor

2014-12-22 Thread Priya Ch
Hi All,

I have akka remote actors running on 2 nodes. I submitted spark application
from node1. In the spark code, in one of the rdd, i am sending message to
actor running on node1. My Spark code is as follows:




class ActorClient extends Actor with Serializable
{
  import context._

  val currentActor: ActorSelection =
context.system.actorSelection(akka.tcp://
ActorSystem@192.168.145.183:2551/user/MasterActor)
  implicit val timeout = Timeout(10 seconds)


  def receive =
  {
  case msg:String = { if(msg.contains(Spark))
   { currentActor ! msg
 sender ! Local
   }
   else
   {
println(Received..+msg)
val future=currentActor ? msg
val result = Await.result(future,
timeout.duration).asInstanceOf[String]
if(result.contains(ACK))
  sender ! OK
   }
 }
  case PoisonPill = context.stop(self)
  }
}

object SparkExec extends Serializable
{

  implicit val timeout = Timeout(10 seconds)
   val actorSystem=ActorSystem(ClientActorSystem)
   val
actor=actorSystem.actorOf(Props(classOf[ActorClient]),name=ClientActor)

 def main(args:Array[String]) =
  {

 val conf = new SparkConf().setAppName(DeepLearningSpark)

 val sc=new SparkContext(conf)

val
textrdd=sc.textFile(hdfs://IMPETUS-DSRV02:9000/deeplearning/sample24k.csv)
val rdd1=textrddmap{ line = println(In Map...)

   val future = actor ? Hello..Spark
   val result =
Await.result(future,timeout.duration).asInstanceOf[String]
   if(result.contains(Local)){
 println(Recieved in map+result)
  //actorSystem.shutdown
  }
  (10)
 }


 val rdd2=rdd1.map{ x =
 val future=actor ? Done
 val result = Await.result(future,
timeout.duration).asInstanceOf[String]
  if(result.contains(OK))
  {
   actorSystem.stop(remoteActor)
   actorSystem.shutdown
  }
 (2) }
 rdd2.saveAsTextFile(/home/padma/SparkAkkaOut)
}

}

In my ActorClientActor, through actorSelection, identifying the remote
actor and sending the message. Once the messages are sent, in *rdd2*, after
receiving ack from remote actor, i am killing the actor ActorClient and
shutting down the ActorSystem.

The above code is throwing the following exception:




14/12/22 19:04:36 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 0.0
(TID 1, IMPETUS-DSRV05.impetus.co.in):
java.lang.ExceptionInInitializerError:
com.impetus.spark.SparkExec$$anonfun$2.apply(SparkExec.scala:166)
com.impetus.spark.SparkExec$$anonfun$2.apply(SparkExec.scala:159)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:984)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
java.lang.Thread.run(Thread.java:722)
14/12/22 19:04:36 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0
(TID 0, IMPETUS-DSRV05.impetus.co.in): java.lang.NoClassDefFoundError:
Could not initialize class com.impetus.spark.SparkExec$
com.impetus.spark.SparkExec$$anonfun$2.apply(SparkExec.scala:166)
com.impetus.spark.SparkExec$$anonfun$2.apply(SparkExec.scala:159)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:984)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)


Tuning Spark Streaming jobs

2014-12-22 Thread Gerard Maas
Hi,

After facing issues with the performance of some of our Spark Streaming
 jobs, we invested quite some effort figuring out the factors that affect
the performance characteristics of a Streaming job. We  defined an
empirical model that helps us reason about Streaming jobs and applied it to
tune the jobs in order to maximize throughput.

We have summarized our findings in a blog post with the intention of
collecting feedback and hoping that it is useful to other Spark Streaming
users facing similar issues.

 http://www.virdata.com/tuning-spark/

Your feedback is welcome.

With kind regards,

Gerard.
Data Processing Team Lead
Virdata.com
@maasg


Re: Tuning Spark Streaming jobs

2014-12-22 Thread Timothy Chen
Hi Gerard,

Really nice guide!

I'm particularly interested in the Mesos scheduling side to more evenly 
distribute cores across cluster.

I wonder if you are using coarse grain mode or fine grain mode? 

I'm making changes to the spark mesos scheduler and I think we can propose a 
best way to achieve what you mentioned.

Tim

Sent from my iPhone

 On Dec 22, 2014, at 8:33 AM, Gerard Maas gerard.m...@gmail.com wrote:
 
 Hi,
 
 After facing issues with the performance of some of our Spark Streaming
 jobs, we invested quite some effort figuring out the factors that affect
 the performance characteristics of a Streaming job. We  defined an
 empirical model that helps us reason about Streaming jobs and applied it to
 tune the jobs in order to maximize throughput.
 
 We have summarized our findings in a blog post with the intention of
 collecting feedback and hoping that it is useful to other Spark Streaming
 users facing similar issues.
 
 http://www.virdata.com/tuning-spark/
 
 Your feedback is welcome.
 
 With kind regards,
 
 Gerard.
 Data Processing Team Lead
 Virdata.com
 @maasg

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Use mvn to build Spark 1.2.0 failed

2014-12-22 Thread Patrick Wendell
I also couldn't reproduce this issued.

On Mon, Dec 22, 2014 at 2:24 AM, Sean Owen so...@cloudera.com wrote:
 I just tried the exact same command and do not see any error. Maybe
 you can make sure you're starting from a clean extraction of the
 distro, and check your environment. I'm on OSX, Maven 3.2, Java 8 but
 I don't know that any of those would be relevant.

 On Mon, Dec 22, 2014 at 4:10 AM, wyphao.2007 wyphao.2...@163.com wrote:
 Hi all, Today download Spark source from 
 http://spark.apache.org/downloads.html page, and I use


  ./make-distribution.sh --tgz -Phadoop-2.2 -Pyarn -DskipTests 
 -Dhadoop.version=2.2.0 -Phive


 to build the release, but I encountered an exception as follow:


 [INFO] --- build-helper-maven-plugin:1.8:add-source (add-scala-sources) @ 
 spark-parent ---
 [INFO] Source directory: /home/q/spark/spark-1.2.0/src/main/scala added.
 [INFO]
 [INFO] --- maven-remote-resources-plugin:1.5:process (default) @ 
 spark-parent ---
 [INFO] 
 
 [INFO] Reactor Summary:
 [INFO]
 [INFO] Spark Project Parent POM .. FAILURE [1.015s]
 [INFO] Spark Project Networking .. SKIPPED
 [INFO] Spark Project Shuffle Streaming Service ... SKIPPED
 [INFO] Spark Project Core  SKIPPED
 [INFO] Spark Project Bagel ... SKIPPED
 [INFO] Spark Project GraphX .. SKIPPED
 [INFO] Spark Project Streaming ... SKIPPED
 [INFO] Spark Project Catalyst  SKIPPED
 [INFO] Spark Project SQL . SKIPPED
 [INFO] Spark Project ML Library .. SKIPPED
 [INFO] Spark Project Tools ... SKIPPED
 [INFO] Spark Project Hive  SKIPPED
 [INFO] Spark Project REPL  SKIPPED
 [INFO] Spark Project YARN Parent POM . SKIPPED
 [INFO] Spark Project YARN Stable API . SKIPPED
 [INFO] Spark Project Assembly  SKIPPED
 [INFO] Spark Project External Twitter  SKIPPED
 [INFO] Spark Project External Flume Sink . SKIPPED
 [INFO] Spark Project External Flume .. SKIPPED
 [INFO] Spark Project External MQTT ... SKIPPED
 [INFO] Spark Project External ZeroMQ . SKIPPED
 [INFO] Spark Project External Kafka .. SKIPPED
 [INFO] Spark Project Examples  SKIPPED
 [INFO] Spark Project YARN Shuffle Service  SKIPPED
 [INFO] 
 
 [INFO] BUILD FAILURE
 [INFO] 
 
 [INFO] Total time: 1.644s
 [INFO] Finished at: Mon Dec 22 10:56:35 CST 2014
 [INFO] Final Memory: 21M/481M
 [INFO] 
 
 [ERROR] Failed to execute goal 
 org.apache.maven.plugins:maven-remote-resources-plugin:1.5:process (default) 
 on project spark-parent: Error finding remote resources manifests: 
 /home/q/spark/spark-1.2.0/target/maven-shared-archive-resources/META-INF/NOTICE
  (No such file or directory) - [Help 1]
 [ERROR]
 [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
 switch.
 [ERROR] Re-run Maven using the -X switch to enable full debug logging.
 [ERROR]
 [ERROR] For more information about the errors and possible solutions, please 
 read the following articles:
 [ERROR] [Help 1] 
 http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException


 but the NOTICE file is in the download spark release:


 [wyp@spark  /home/q/spark/spark-1.2.0]$ ll
 total 248
 drwxrwxr-x 3 1000 1000  4096 Dec 10 18:02 assembly
 drwxrwxr-x 3 1000 1000  4096 Dec 10 18:02 bagel
 drwxrwxr-x 2 1000 1000  4096 Dec 10 18:02 bin
 drwxrwxr-x 2 1000 1000  4096 Dec 10 18:02 conf
 -rw-rw-r-- 1 1000 1000   663 Dec 10 18:02 CONTRIBUTING.md
 drwxrwxr-x 3 1000 1000  4096 Dec 10 18:02 core
 drwxrwxr-x 3 1000 1000  4096 Dec 10 18:02 data
 drwxrwxr-x 4 1000 1000  4096 Dec 10 18:02 dev
 drwxrwxr-x 3 1000 1000  4096 Dec 10 18:02 docker
 drwxrwxr-x 7 1000 1000  4096 Dec 10 18:02 docs
 drwxrwxr-x 4 1000 1000  4096 Dec 10 18:02 ec2
 drwxrwxr-x 4 1000 1000  4096 Dec 10 18:02 examples
 drwxrwxr-x 8 1000 1000  4096 Dec 10 18:02 external
 drwxrwxr-x 5 1000 1000  4096 Dec 10 18:02 extras
 drwxrwxr-x 4 1000 1000  4096 Dec 10 18:02 graphx
 -rw-rw-r-- 1 1000 1000 45242 Dec 10 18:02 LICENSE
 -rwxrwxr-x 1 1000 1000  7941 Dec 10 18:02 make-distribution.sh
 drwxrwxr-x 3 1000 1000  4096 Dec 10 18:02 mllib
 drwxrwxr-x 5 1000 1000  4096 Dec 10 18:02 network
 -rw-rw-r-- 1 1000 1000 22559 Dec 10 18:02 NOTICE
 -rw-rw-r-- 1 1000 1000 49002 Dec 10 18:02 pom.xml
 drwxrwxr-x 4 1000 1000  4096 Dec 10 

cleaning up cache files left by SPARK-2713

2014-12-22 Thread Cody Koeninger
Is there a reason not to go ahead and move the _cache and _lock files
created by Utils.fetchFiles into the work directory, so they can be cleaned
up more easily?  I saw comments to that effect in the discussion of the PR
for 2713, but it doesn't look like it got done.

And no, I didn't just have a machine fill up the /tmp directory, why do you
ask?  :)


Re: cleaning up cache files left by SPARK-2713

2014-12-22 Thread Marcelo Vanzin
https://github.com/apache/spark/pull/3705

On Mon, Dec 22, 2014 at 10:19 AM, Cody Koeninger c...@koeninger.org wrote:
 Is there a reason not to go ahead and move the _cache and _lock files
 created by Utils.fetchFiles into the work directory, so they can be cleaned
 up more easily?  I saw comments to that effect in the discussion of the PR
 for 2713, but it doesn't look like it got done.

 And no, I didn't just have a machine fill up the /tmp directory, why do you
 ask?  :)



-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: spark-yarn_2.10 1.2.0 artifacts

2014-12-22 Thread David McWhorter

Thank you, Sean, using spark-network-yarn seems to do the trick.

On 12/19/2014 12:13 PM, Sean Owen wrote:

I believe spark-yarn does not exist from 1.2 onwards. Have a look at
spark-network-yarn for where some of that went, I believe.

On Fri, Dec 19, 2014 at 5:09 PM, David McWhorter mcwhor...@ccri.com wrote:

Hi all,

Thanks for your work on spark!  I am trying to locate spark-yarn jars for
the new 1.2.0 release.  The jars for spark-core, etc, are on maven central,
but the spark-yarn jars are missing.

Confusingly and perhaps relatedly, I also can't seem to get the spark-yarn
artifact to install on my local computer when I run 'mvn -Pyarn -Phadoop-2.2
-Dhadoop.version=2.2.0 -DskipTests clean install'.  At the install plugin
stage, maven reports:

[INFO] --- maven-install-plugin:2.5.1:install (default-install) @
spark-yarn_2.10 ---
[INFO] Skipping artifact installation

Any help or insights into how to use spark-yarn_2.10 1.2.0 in a maven build
would be appreciated.

David

--

David McWhorter
Software Engineer
Commonwealth Computer Research, Inc.
1422 Sachem Place, Unit #1
Charlottesville, VA 22901
mcwhor...@ccri.com | 434.299.0090x204


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



--

David McWhorter
Software Engineer
Commonwealth Computer Research, Inc.
1422 Sachem Place, Unit #1
Charlottesville, VA 22901
mcwhor...@ccri.com | 434.299.0090x204


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Tuning Spark Streaming jobs

2014-12-22 Thread Gerard Maas
Hi Tim,

That would be awesome. We have seen some really disparate Mesos allocations
for our Spark Streaming jobs. (like (7,4,1) over 3 executors for 4 kafka
consumer instead of the ideal (3,3,3,3))
For network dependent consumers, achieving an even deployment would
 provide a reliable and reproducible streaming job execution from the
performance point of view.
We're deploying in coarse grain mode. Not sure Spark Streaming would work
well in fine-grained given the added latency to acquire a worker.

You mention that you're changing the Mesos scheduler. Is there a Jira where
this job is taking place?

-kr, Gerard.


On Mon, Dec 22, 2014 at 6:01 PM, Timothy Chen tnac...@gmail.com wrote:

 Hi Gerard,

 Really nice guide!

 I'm particularly interested in the Mesos scheduling side to more evenly
 distribute cores across cluster.

 I wonder if you are using coarse grain mode or fine grain mode?

 I'm making changes to the spark mesos scheduler and I think we can propose
 a best way to achieve what you mentioned.

 Tim

 Sent from my iPhone

  On Dec 22, 2014, at 8:33 AM, Gerard Maas gerard.m...@gmail.com wrote:
 
  Hi,
 
  After facing issues with the performance of some of our Spark Streaming
  jobs, we invested quite some effort figuring out the factors that affect
  the performance characteristics of a Streaming job. We  defined an
  empirical model that helps us reason about Streaming jobs and applied it
 to
  tune the jobs in order to maximize throughput.
 
  We have summarized our findings in a blog post with the intention of
  collecting feedback and hoping that it is useful to other Spark Streaming
  users facing similar issues.
 
  http://www.virdata.com/tuning-spark/
 
  Your feedback is welcome.
 
  With kind regards,
 
  Gerard.
  Data Processing Team Lead
  Virdata.com
  @maasg



Re: Data source interface for making multiple tables available for query

2014-12-22 Thread Michael Armbrust
I agree and this is something that we have discussed in the past.
Essentially I think instead of creating a RelationProvider that returns a
single table, we'll have something like an external catalog that can return
multiple base relations.

On Sun, Dec 21, 2014 at 6:43 PM, Venkata ramana gollamudi 
ramana.gollam...@huawei.com wrote:

 Hi,

 Data source ddl.scala, CREATE TEMPORARY TABLE makes one table at time
 available to temp tables, how about the case if multiple/all tables from
 some data source needs to be available for query, just like hive tables. I
 think we also need that interface to connect such data sources. Please
 comment.

 Regards,
 Ramana



Announcing Spark Packages

2014-12-22 Thread Xiangrui Meng
Dear Spark users and developers,

I’m happy to announce Spark Packages (http://spark-packages.org), a
community package index to track the growing number of open source
packages and libraries that work with Apache Spark. Spark Packages
makes it easy for users to find, discuss, rate, and install packages
for any version of Spark, and makes it easy for developers to
contribute packages.

Spark Packages will feature integrations with various data sources,
management tools, higher level domain-specific libraries, machine
learning algorithms, code samples, and other Spark content. Thanks to
the package authors, the initial listing of packages includes
scientific computing libraries, a job execution server, a connector
for importing Avro data, tools for launching Spark on Google Compute
Engine, and many others.

I’d like to invite you to contribute and use Spark Packages and
provide feedback! As a disclaimer: Spark Packages is a community index
maintained by Databricks and (by design) will include packages outside
of the ASF Spark project. We are excited to help showcase and support
all of the great work going on in the broader Spark community!

Cheers,
Xiangrui

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



More general submitJob API

2014-12-22 Thread Alessandro Baretta
Fellow Sparkers,

I'm rather puzzled at the submitJob API. I can't quite figure out how it is
supposed to be used. Is there any more documentation about it?

Also, is there any simpler way to multiplex jobs on the cluster, such as
starting multiple computations in as many threads in the driver and reaping
all the results when they are available?

Thanks,

Alex


Re: Announcing Spark Packages

2014-12-22 Thread Andrew Ash
Hi Xiangrui,

That link is currently returning a 503 Over Quota error message.  Would you
mind pinging back out when the page is back up?

Thanks!
Andrew

On Mon, Dec 22, 2014 at 12:37 PM, Xiangrui Meng men...@gmail.com wrote:

 Dear Spark users and developers,

 I’m happy to announce Spark Packages (http://spark-packages.org), a
 community package index to track the growing number of open source
 packages and libraries that work with Apache Spark. Spark Packages
 makes it easy for users to find, discuss, rate, and install packages
 for any version of Spark, and makes it easy for developers to
 contribute packages.

 Spark Packages will feature integrations with various data sources,
 management tools, higher level domain-specific libraries, machine
 learning algorithms, code samples, and other Spark content. Thanks to
 the package authors, the initial listing of packages includes
 scientific computing libraries, a job execution server, a connector
 for importing Avro data, tools for launching Spark on Google Compute
 Engine, and many others.

 I’d like to invite you to contribute and use Spark Packages and
 provide feedback! As a disclaimer: Spark Packages is a community index
 maintained by Databricks and (by design) will include packages outside
 of the ASF Spark project. We are excited to help showcase and support
 all of the great work going on in the broader Spark community!

 Cheers,
 Xiangrui

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Announcing Spark Packages

2014-12-22 Thread Patrick Wendell
Xiangrui asked me to report that it's back and running :)

On Mon, Dec 22, 2014 at 3:21 PM, peng pc...@uowmail.edu.au wrote:
 Me 2 :)


 On 12/22/2014 06:14 PM, Andrew Ash wrote:

 Hi Xiangrui,

 That link is currently returning a 503 Over Quota error message.  Would you
 mind pinging back out when the page is back up?

 Thanks!
 Andrew

 On Mon, Dec 22, 2014 at 12:37 PM, Xiangrui Meng men...@gmail.com wrote:

 Dear Spark users and developers,

 I'm happy to announce Spark Packages (http://spark-packages.org), a
 community package index to track the growing number of open source
 packages and libraries that work with Apache Spark. Spark Packages
 makes it easy for users to find, discuss, rate, and install packages
 for any version of Spark, and makes it easy for developers to
 contribute packages.

 Spark Packages will feature integrations with various data sources,
 management tools, higher level domain-specific libraries, machine
 learning algorithms, code samples, and other Spark content. Thanks to
 the package authors, the initial listing of packages includes
 scientific computing libraries, a job execution server, a connector
 for importing Avro data, tools for launching Spark on Google Compute
 Engine, and many others.

 I'd like to invite you to contribute and use Spark Packages and
 provide feedback! As a disclaimer: Spark Packages is a community index
 maintained by Databricks and (by design) will include packages outside
 of the ASF Spark project. We are excited to help showcase and support
 all of the great work going on in the broader Spark community!

 Cheers,
 Xiangrui

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Announcing Spark Packages

2014-12-22 Thread Hitesh Shah
Hello Xiangrui, 

If you have not already done so, you should look at 
http://www.apache.org/foundation/marks/#domains for the policy on use of ASF 
trademarked terms in domain names. 

thanks
— Hitesh

On Dec 22, 2014, at 12:37 PM, Xiangrui Meng men...@gmail.com wrote:

 Dear Spark users and developers,
 
 I’m happy to announce Spark Packages (http://spark-packages.org), a
 community package index to track the growing number of open source
 packages and libraries that work with Apache Spark. Spark Packages
 makes it easy for users to find, discuss, rate, and install packages
 for any version of Spark, and makes it easy for developers to
 contribute packages.
 
 Spark Packages will feature integrations with various data sources,
 management tools, higher level domain-specific libraries, machine
 learning algorithms, code samples, and other Spark content. Thanks to
 the package authors, the initial listing of packages includes
 scientific computing libraries, a job execution server, a connector
 for importing Avro data, tools for launching Spark on Google Compute
 Engine, and many others.
 
 I’d like to invite you to contribute and use Spark Packages and
 provide feedback! As a disclaimer: Spark Packages is a community index
 maintained by Databricks and (by design) will include packages outside
 of the ASF Spark project. We are excited to help showcase and support
 all of the great work going on in the broader Spark community!
 
 Cheers,
 Xiangrui
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: More general submitJob API

2014-12-22 Thread Andrew Ash
Hi Alex,

SparkContext.submitJob() is marked as experimental -- most client programs
shouldn't be using it.  What are you looking to do?

For multiplexing jobs, one thing you can do is have multiple threads in
your client JVM each submit jobs on your SparkContext job.  This is
described here in the docs:
http://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application

Andrew

On Mon, Dec 22, 2014 at 1:32 PM, Alessandro Baretta alexbare...@gmail.com
wrote:

 Fellow Sparkers,

 I'm rather puzzled at the submitJob API. I can't quite figure out how it is
 supposed to be used. Is there any more documentation about it?

 Also, is there any simpler way to multiplex jobs on the cluster, such as
 starting multiple computations in as many threads in the driver and reaping
 all the results when they are available?

 Thanks,

 Alex



Re: More general submitJob API

2014-12-22 Thread Alessandro Baretta
Andrew,

Thanks, yes, this is what I wanted: basically just to start multiple jobs
concurrently in threads.

Alex

On Mon, Dec 22, 2014 at 4:04 PM, Andrew Ash and...@andrewash.com wrote:

 Hi Alex,

 SparkContext.submitJob() is marked as experimental -- most client programs
 shouldn't be using it.  What are you looking to do?

 For multiplexing jobs, one thing you can do is have multiple threads in
 your client JVM each submit jobs on your SparkContext job.  This is
 described here in the docs:
 http://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application

 Andrew

 On Mon, Dec 22, 2014 at 1:32 PM, Alessandro Baretta alexbare...@gmail.com
  wrote:

 Fellow Sparkers,

 I'm rather puzzled at the submitJob API. I can't quite figure out how it
 is
 supposed to be used. Is there any more documentation about it?

 Also, is there any simpler way to multiplex jobs on the cluster, such as
 starting multiple computations in as many threads in the driver and
 reaping
 all the results when they are available?

 Thanks,

 Alex





Re: More general submitJob API

2014-12-22 Thread Patrick Wendell
A SparkContext is thread safe, so you can just have different threads
that create their own RDD's and do actions, etc.

- Patrick

On Mon, Dec 22, 2014 at 4:15 PM, Alessandro Baretta
alexbare...@gmail.com wrote:
 Andrew,

 Thanks, yes, this is what I wanted: basically just to start multiple jobs
 concurrently in threads.

 Alex

 On Mon, Dec 22, 2014 at 4:04 PM, Andrew Ash and...@andrewash.com wrote:

 Hi Alex,

 SparkContext.submitJob() is marked as experimental -- most client programs
 shouldn't be using it.  What are you looking to do?

 For multiplexing jobs, one thing you can do is have multiple threads in
 your client JVM each submit jobs on your SparkContext job.  This is
 described here in the docs:
 http://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application

 Andrew

 On Mon, Dec 22, 2014 at 1:32 PM, Alessandro Baretta alexbare...@gmail.com
  wrote:

 Fellow Sparkers,

 I'm rather puzzled at the submitJob API. I can't quite figure out how it
 is
 supposed to be used. Is there any more documentation about it?

 Also, is there any simpler way to multiplex jobs on the cluster, such as
 starting multiple computations in as many threads in the driver and
 reaping
 all the results when they are available?

 Thanks,

 Alex




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Announcing Spark Packages

2014-12-22 Thread Patrick Wendell
Hey Nick,

I think Hitesh was just trying to be helpful and point out the policy
- not necessarily saying there was an issue. We've taken a close look
at this and I think we're in good shape her vis-a-vis this policy.

- Patrick

On Mon, Dec 22, 2014 at 5:29 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
 Hitesh,

 From your link:

 You may not use ASF trademarks such as Apache or ApacheFoo or Foo in
 your own domain names if that use would be likely to confuse a relevant
 consumer about the source of software or services provided through your
 website, without written approval of the VP, Apache Brand Management or
 designee.

 The title on the packages website is A community index of packages for
 Apache Spark. Furthermore, the footnote of the website reads Spark
 Packages is a community site hosting modules that are not part of Apache
 Spark.

 I think there's nothing on there that would confuse a relevant consumer
 about the source of software. It's pretty clear that the Spark Packages
 name is well within the ASF's guidelines.

 Have I misunderstood the ASF's policy?

 Nick


 On Mon Dec 22 2014 at 6:40:10 PM Hitesh Shah hit...@apache.org wrote:

 Hello Xiangrui,

 If you have not already done so, you should look at
 http://www.apache.org/foundation/marks/#domains for the policy on use of ASF
 trademarked terms in domain names.

 thanks
 -- Hitesh

 On Dec 22, 2014, at 12:37 PM, Xiangrui Meng men...@gmail.com wrote:

  Dear Spark users and developers,
 
  I'm happy to announce Spark Packages (http://spark-packages.org), a
  community package index to track the growing number of open source
  packages and libraries that work with Apache Spark. Spark Packages
  makes it easy for users to find, discuss, rate, and install packages
  for any version of Spark, and makes it easy for developers to
  contribute packages.
 
  Spark Packages will feature integrations with various data sources,
  management tools, higher level domain-specific libraries, machine
  learning algorithms, code samples, and other Spark content. Thanks to
  the package authors, the initial listing of packages includes
  scientific computing libraries, a job execution server, a connector
  for importing Avro data, tools for launching Spark on Google Compute
  Engine, and many others.
 
  I'd like to invite you to contribute and use Spark Packages and
  provide feedback! As a disclaimer: Spark Packages is a community index
  maintained by Databricks and (by design) will include packages outside
  of the ASF Spark project. We are excited to help showcase and support
  all of the great work going on in the broader Spark community!
 
  Cheers,
  Xiangrui
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Announcing Spark Packages

2014-12-22 Thread Nicholas Chammas
Okie doke! (I just assumed there was an issue since the policy was brought
up.)

On Mon Dec 22 2014 at 8:33:53 PM Patrick Wendell pwend...@gmail.com wrote:

 Hey Nick,

 I think Hitesh was just trying to be helpful and point out the policy
 - not necessarily saying there was an issue. We've taken a close look
 at this and I think we're in good shape her vis-a-vis this policy.

 - Patrick

 On Mon, Dec 22, 2014 at 5:29 PM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:
  Hitesh,
 
  From your link:
 
  You may not use ASF trademarks such as Apache or ApacheFoo or Foo
 in
  your own domain names if that use would be likely to confuse a relevant
  consumer about the source of software or services provided through your
  website, without written approval of the VP, Apache Brand Management or
  designee.
 
  The title on the packages website is A community index of packages for
  Apache Spark. Furthermore, the footnote of the website reads Spark
  Packages is a community site hosting modules that are not part of Apache
  Spark.
 
  I think there's nothing on there that would confuse a relevant consumer
  about the source of software. It's pretty clear that the Spark Packages
  name is well within the ASF's guidelines.
 
  Have I misunderstood the ASF's policy?
 
  Nick
 
 
  On Mon Dec 22 2014 at 6:40:10 PM Hitesh Shah hit...@apache.org wrote:
 
  Hello Xiangrui,
 
  If you have not already done so, you should look at
  http://www.apache.org/foundation/marks/#domains for the policy on use
 of ASF
  trademarked terms in domain names.
 
  thanks
  -- Hitesh
 
  On Dec 22, 2014, at 12:37 PM, Xiangrui Meng men...@gmail.com wrote:
 
   Dear Spark users and developers,
  
   I'm happy to announce Spark Packages (http://spark-packages.org), a
   community package index to track the growing number of open source
   packages and libraries that work with Apache Spark. Spark Packages
   makes it easy for users to find, discuss, rate, and install packages
   for any version of Spark, and makes it easy for developers to
   contribute packages.
  
   Spark Packages will feature integrations with various data sources,
   management tools, higher level domain-specific libraries, machine
   learning algorithms, code samples, and other Spark content. Thanks to
   the package authors, the initial listing of packages includes
   scientific computing libraries, a job execution server, a connector
   for importing Avro data, tools for launching Spark on Google Compute
   Engine, and many others.
  
   I'd like to invite you to contribute and use Spark Packages and
   provide feedback! As a disclaimer: Spark Packages is a community index
   maintained by Databricks and (by design) will include packages outside
   of the ASF Spark project. We are excited to help showcase and support
   all of the great work going on in the broader Spark community!
  
   Cheers,
   Xiangrui
  
   -
   To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
   For additional commands, e-mail: dev-h...@spark.apache.org
  
 
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 



Re: [ANNOUNCE] Requiring JIRA for inclusion in release credits

2014-12-22 Thread Nicholas Chammas
Does this include contributions made against the spark-ec2
https://github.com/mesos/spark-ec2 repo?

On Wed Dec 17 2014 at 12:29:19 AM Patrick Wendell pwend...@gmail.com
wrote:

 Hey All,

 Due to the very high volume of contributions, we're switching to an
 automated process for generating release credits. This process relies
 on JIRA for categorizing contributions, so it's not possible for us to
 provide credits in the case where users submit pull requests with no
 associated JIRA.

 This needed to be automated because, with more than 1000 commits per
 release, finding proper names for every commit and summarizing
 contributions was taking on the order of days of time.

 For 1.2.0 there were around 100 commits that did not have JIRA's. I'll
 try to manually merge these into the credits, but please e-mail me
 directly if you are not credited once the release notes are posted.
 The notes should be posted within 48 hours of right now.

 We already ask that users include a JIRA for pull requests, but now it
 will be required for proper attribution. I've updated the contributing
 guide on the wiki to reflect this.

 - Patrick

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: [ANNOUNCE] Requiring JIRA for inclusion in release credits

2014-12-22 Thread Patrick Wendell
Hey Josh,

We don't explicitly track contributions to spark-ec2 in the Apache
Spark release notes. The main reason is that usually updates to
spark-ec2 include a corresponding update to spark so we get it there.
This may not always be the case though, so let me know if you think
there is something missing we should add.

- Patrick

On Mon, Dec 22, 2014 at 6:17 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
 Does this include contributions made against the spark-ec2 repo?

 On Wed Dec 17 2014 at 12:29:19 AM Patrick Wendell pwend...@gmail.com
 wrote:

 Hey All,

 Due to the very high volume of contributions, we're switching to an
 automated process for generating release credits. This process relies
 on JIRA for categorizing contributions, so it's not possible for us to
 provide credits in the case where users submit pull requests with no
 associated JIRA.

 This needed to be automated because, with more than 1000 commits per
 release, finding proper names for every commit and summarizing
 contributions was taking on the order of days of time.

 For 1.2.0 there were around 100 commits that did not have JIRA's. I'll
 try to manually merge these into the credits, but please e-mail me
 directly if you are not credited once the release notes are posted.
 The notes should be posted within 48 hours of right now.

 We already ask that users include a JIRA for pull requests, but now it
 will be required for proper attribution. I've updated the contributing
 guide on the wiki to reflect this.

 - Patrick

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [ANNOUNCE] Requiring JIRA for inclusion in release credits

2014-12-22 Thread Patrick Wendell
s/Josh/Nick/ - sorry!

On Mon, Dec 22, 2014 at 10:52 PM, Patrick Wendell pwend...@gmail.com wrote:
 Hey Josh,

 We don't explicitly track contributions to spark-ec2 in the Apache
 Spark release notes. The main reason is that usually updates to
 spark-ec2 include a corresponding update to spark so we get it there.
 This may not always be the case though, so let me know if you think
 there is something missing we should add.

 - Patrick

 On Mon, Dec 22, 2014 at 6:17 PM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:
 Does this include contributions made against the spark-ec2 repo?

 On Wed Dec 17 2014 at 12:29:19 AM Patrick Wendell pwend...@gmail.com
 wrote:

 Hey All,

 Due to the very high volume of contributions, we're switching to an
 automated process for generating release credits. This process relies
 on JIRA for categorizing contributions, so it's not possible for us to
 provide credits in the case where users submit pull requests with no
 associated JIRA.

 This needed to be automated because, with more than 1000 commits per
 release, finding proper names for every commit and summarizing
 contributions was taking on the order of days of time.

 For 1.2.0 there were around 100 commits that did not have JIRA's. I'll
 try to manually merge these into the credits, but please e-mail me
 directly if you are not credited once the release notes are posted.
 The notes should be posted within 48 hours of right now.

 We already ask that users include a JIRA for pull requests, but now it
 will be required for proper attribution. I've updated the contributing
 guide on the wiki to reflect this.

 - Patrick

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org