Re: About akka used in spark

2015-06-10 Thread Akhil Das
If you look at the maven repo, you can see its from typesafe only
http://mvnrepository.com/artifact/org.spark-project.akka/akka-actor_2.10/2.3.4-spark

For sbt, you can download the sources by adding withSources() like:

libraryDependencies += org.spark-project.akka % akka-actor_2.10 %
2.3.4-spark withSources() withJavadoc()



Thanks
Best Regards

On Wed, Jun 10, 2015 at 11:25 AM, wangtao (A) wangtao...@huawei.com wrote:

  Hi guys,



 I see group id of akka used in spark is “org.spark-project.akka”. What is
 its difference with the typesafe one? What is its version? And where can we
 get the source code?



 Regards.



Re: About akka used in spark

2015-06-10 Thread Cheng Lian
We only shaded protobuf dependencies because of compatibility issues. 
The source code is not modified.


On 6/10/15 1:55 PM, wangtao (A) wrote:


Hi guys,

I see group id of akka used in spark is “org.spark-project.akka”. What 
is its difference with the typesafe one? What is its version? And 
where can we get the source code?


Regards.





答复: [VOTE] Release Apache Spark 1.4.0 (RC4)

2015-06-10 Thread Tao Wang
+1

Tested with building with Hadoop 2.7.0 and running with tests:

WordCount in yarn-client/yarn-cluster mode works fine;
Basic sql queries are passed;
“spark.sql.autoBroadcastJoinThreshold” works fine;
Thrift Server is fine;
Running streaming with kafka is good;
External shuffle in YARN mode is fine;
Hisotry Server can automatically clean the event log on hdfs;
Basic PySpark tests are fine;


发件人: Sean McNamara [via Apache Spark Developers List] 
[mailto:ml-node+s1001551n12675...@n3.nabble.com]
发送时间: 2015年6月9日 23:53
收件人: wangtao (A)
主题: Re: [VOTE] Release Apache Spark 1.4.0 (RC4)

+1

tested /w OS X + deployed one of our streaming apps onto a staging yarn cluster.

Sean

 On Jun 2, 2015, at 9:54 PM, Patrick Wendell [hidden 
 email]/user/SendEmail.jtp?type=nodenode=12675i=0 wrote:

 Please vote on releasing the following candidate as Apache Spark version 
 1.4.0!

 The tag to be voted on is v1.4.0-rc3 (commit 22596c5):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
 22596c534a38cfdda91aef18aa9037ab101e4251

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 [published as version: 1.4.0]
 https://repository.apache.org/content/repositories/orgapachespark-/
 [published as version: 1.4.0-rc4]
 https://repository.apache.org/content/repositories/orgapachespark-1112/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-docs/

 Please vote on releasing this package as Apache Spark 1.4.0!

 The vote is open until Saturday, June 06, at 05:00 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.4.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == What has changed since RC3 ==
 In addition to may smaller fixes, three blocker issues were fixed:
 4940630 [SPARK-8020] [SQL] Spark SQL conf in spark-defaults.conf make
 metadataHive get constructed too early
 6b0f615 [SPARK-8038] [SQL] [PYSPARK] fix Column.when() and otherwise()
 78a6723 [SPARK-7978] [SQL] [PYSPARK] DecimalType should not be singleton

 == How can I help test this release? ==
 If you are a Spark user, you can help us test this release by
 taking a Spark 1.3 workload and running on this release candidate,
 then reporting any regressions.

 == What justifies a -1 vote for this release? ==
 This vote is happening towards the end of the 1.4 QA period,
 so -1 votes should only occur for significant regressions from 1.3.1.
 Bugs already present in 1.3.X, minor regressions, or bugs related
 to new features will not block this release.

 -
 To unsubscribe, e-mail: [hidden 
 email]/user/SendEmail.jtp?type=nodenode=12675i=1
 For additional commands, e-mail: [hidden 
 email]/user/SendEmail.jtp?type=nodenode=12675i=2



-
To unsubscribe, e-mail: [hidden 
email]/user/SendEmail.jtp?type=nodenode=12675i=3
For additional commands, e-mail: [hidden 
email]/user/SendEmail.jtp?type=nodenode=12675i=4



If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-4-0-RC4-tp12582p12675.html
To unsubscribe from [VOTE] Release Apache Spark 1.4.0 (RC4), click 
herehttp://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=12582code=d2FuZ3RhbzExMUBodWF3ZWkuY29tfDEyNTgyfDgyNzMxMjE4MA==.
NAMLhttp://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-4-0-RC4-tp12582p12682.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Problem with pyspark on Docker talking to YARN cluster

2015-06-10 Thread Ashwin Shankar
All,
I was wondering if any of you have solved this problem :

I have pyspark(ipython mode) running on docker talking to
a yarn cluster(AM/executors are NOT running on docker).

When I start pyspark in the docker container, it binds to port *49460.*

Once the app is submitted to YARN, the app(AM) on the cluster side fails
with the following error message :
*ERROR yarn.ApplicationMaster: Failed to connect to driver at :49460*

This makes sense because AM is trying to talk to container directly and
it cannot, it should be talking to the docker host instead.

*Question* :
How do we make Spark AM talk to host1:port1 of the docker host(not the
container), which would then
route it to container which is running pyspark on host2:port2 ?

One solution I could think of is : after starting the driver(say on
hostA:portA), and before submitting the app to yarn, we could
reset driver's host/port to hostmachine's ip/port. So the AM can then talk
hostmachine's ip/port, which would be mapped
to the container.

Thoughts ?
-- 
Thanks,
Ashwin


[RESULT] [VOTE] Release Apache Spark 1.4.0 (RC4)

2015-06-10 Thread Patrick Wendell
This vote passes! Thanks to everyone who voted. I will get the release
artifacts and notes up within a day or two.

+1 (23 votes):
Reynold Xin*
Patrick Wendell*
Matei Zaharia*
Andrew Or*
Timothy Chen
Calvin Jia
Burak Yavuz
Krishna Sankar
Hari Shreedharan
Ram Sriharsha*
Kousuke Saruta
Sandy Ryza
Marcelo Vanzin
Bobby Chowdary
Mark Hamstra
Guoqiang Li
Joseph Bradley
Sean McNamara
Tathagata Das*
Ajay Singal
Wang, Daoyuan
Denny Lee
Forest Fang

0:

-1:

* Binding

On Tue, Jun 2, 2015 at 8:53 PM, Patrick Wendell pwend...@gmail.com wrote:
 Please vote on releasing the following candidate as Apache Spark version 
 1.4.0!

 The tag to be voted on is v1.4.0-rc3 (commit 22596c5):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
 22596c534a38cfdda91aef18aa9037ab101e4251

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 [published as version: 1.4.0]
 https://repository.apache.org/content/repositories/orgapachespark-/
 [published as version: 1.4.0-rc4]
 https://repository.apache.org/content/repositories/orgapachespark-1112/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-docs/

 Please vote on releasing this package as Apache Spark 1.4.0!

 The vote is open until Saturday, June 06, at 05:00 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.4.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == What has changed since RC3 ==
 In addition to may smaller fixes, three blocker issues were fixed:
 4940630 [SPARK-8020] [SQL] Spark SQL conf in spark-defaults.conf make
 metadataHive get constructed too early
 6b0f615 [SPARK-8038] [SQL] [PYSPARK] fix Column.when() and otherwise()
 78a6723 [SPARK-7978] [SQL] [PYSPARK] DecimalType should not be singleton

 == How can I help test this release? ==
 If you are a Spark user, you can help us test this release by
 taking a Spark 1.3 workload and running on this release candidate,
 then reporting any regressions.

 == What justifies a -1 vote for this release? ==
 This vote is happening towards the end of the 1.4 QA period,
 so -1 votes should only occur for significant regressions from 1.3.1.
 Bugs already present in 1.3.X, minor regressions, or bugs related
 to new features will not block this release.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark

2015-06-10 Thread Reynold Xin
This email is good. Just one note -- a lot of people are swamped right
before Spark Summit, so you might not get prompt responses this week.


On Wed, Jun 10, 2015 at 2:53 PM, Grega Kešpret gr...@celtra.com wrote:

 I have some time to work on it now. What's a good way to continue the
 discussions before coding it?

 This e-mail list, JIRA or something else?

 On Mon, Apr 6, 2015 at 12:59 AM, Reynold Xin r...@databricks.com wrote:

 I think those are great to have. I would put them in the DataFrame API
 though, since this is applying to structured data. Many of the advanced
 functions on the PairRDDFunctions should really go into the DataFrame API
 now we have it.

 One thing that would be great to understand is what state-of-the-art
 alternatives are out there. I did a quick google scholar search using the
 keyword approximate quantile and found some older papers. Just the
 first few I found:

 http://www.softnet.tuc.gr/~minos/Papers/sigmod05.pdf  by bell labs


 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.6.6513rep=rep1type=pdf
  by Bruce Lindsay, IBM

 http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf





 On Mon, Apr 6, 2015 at 12:50 AM, Grega Kešpret gr...@celtra.com wrote:

 Hi!

 I'd like to get community's opinion on implementing a generic quantile
 approximation algorithm for Spark that is O(n) and requires limited memory.
 I would find it useful and I haven't found any existing implementation. The
 plan was basically to wrap t-digest
 https://github.com/tdunning/t-digest, implement the
 serialization/deserialization boilerplate and provide

 def cdf(x: Double): Double
 def quantile(q: Double): Double


 on RDD[Double] and RDD[(K, Double)].

 Let me know what you think. Any other ideas/suggestions also welcome!

 Best,
 Grega
 --
 [image: Inline image 1]*Grega Kešpret*
 Senior Software Engineer, Analytics

 Skype: gregakespret
 celtra.com http://www.celtra.com/ | @celtramobile
 http://www.twitter.com/celtramobile






RE: Problem with pyspark on Docker talking to YARN cluster

2015-06-10 Thread Eron Wright
Options include:use 'spark.driver.host' and 'spark.driver.port' setting to 
stabilize the driver-side endpoint.  (ref)use host networking for your 
container, i.e. docker run --net=host ...use yarn-cluster mode (see 
SPARK-5162)
Hope this helps,Eron

Date: Wed, 10 Jun 2015 13:43:04 -0700
Subject: Problem with pyspark on Docker talking to YARN cluster
From: ashwinshanka...@gmail.com
To: dev@spark.apache.org; u...@spark.apache.org

All,I was wondering if any of you have solved this problem :
I have pyspark(ipython mode) running on docker talking toa yarn 
cluster(AM/executors are NOT running on docker).
When I start pyspark in the docker container, it binds to port 49460.
Once the app is submitted to YARN, the app(AM) on the cluster side fails with 
the following error message :ERROR yarn.ApplicationMaster: Failed to connect to 
driver at :49460
This makes sense because AM is trying to talk to container directly andit 
cannot, it should be talking to the docker host instead.
Question :How do we make Spark AM talk to host1:port1 of the docker host(not 
the container), which would thenroute it to container which is running pyspark 
on host2:port2 ?
One solution I could think of is : after starting the driver(say on 
hostA:portA), and before submitting the app to yarn, we could reset driver's 
host/port to hostmachine's ip/port. So the AM can then talk hostmachine's 
ip/port, which would be mappedto the container.
Thoughts ? -- 
Thanks,
Ashwin



  

Re: Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark

2015-06-10 Thread Grega Kešpret
I have some time to work on it now. What's a good way to continue the
discussions before coding it?

This e-mail list, JIRA or something else?

On Mon, Apr 6, 2015 at 12:59 AM, Reynold Xin r...@databricks.com wrote:

 I think those are great to have. I would put them in the DataFrame API
 though, since this is applying to structured data. Many of the advanced
 functions on the PairRDDFunctions should really go into the DataFrame API
 now we have it.

 One thing that would be great to understand is what state-of-the-art
 alternatives are out there. I did a quick google scholar search using the
 keyword approximate quantile and found some older papers. Just the
 first few I found:

 http://www.softnet.tuc.gr/~minos/Papers/sigmod05.pdf  by bell labs


 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.6.6513rep=rep1type=pdf
  by Bruce Lindsay, IBM

 http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf





 On Mon, Apr 6, 2015 at 12:50 AM, Grega Kešpret gr...@celtra.com wrote:

 Hi!

 I'd like to get community's opinion on implementing a generic quantile
 approximation algorithm for Spark that is O(n) and requires limited memory.
 I would find it useful and I haven't found any existing implementation. The
 plan was basically to wrap t-digest
 https://github.com/tdunning/t-digest, implement the
 serialization/deserialization boilerplate and provide

 def cdf(x: Double): Double
 def quantile(q: Double): Double


 on RDD[Double] and RDD[(K, Double)].

 Let me know what you think. Any other ideas/suggestions also welcome!

 Best,
 Grega
 --
 [image: Inline image 1]*Grega Kešpret*
 Senior Software Engineer, Analytics

 Skype: gregakespret
 celtra.com http://www.celtra.com/ | @celtramobile
 http://www.twitter.com/celtramobile





Jcenter / bintray support for spark packages?

2015-06-10 Thread Hector Yee
Hi Spark devs,

Is it possible to add jcenter or bintray support for Spark packages?

I'm trying to add our artifact which is on jcenter

https://bintray.com/airbnb/aerosolve

but I noticed in Spark packages it only accepts Maven coordinates.

-- 
Yee Yang Li Hector
google.com/+HectorYee

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Jcenter / bintray support for spark packages?

2015-06-10 Thread Patrick Wendell
Hey Hector,

It's not a bad idea. I think we'd want to do this by virtue of
allowing custom repositories, so users can add bintray or others.

- Patrick

On Wed, Jun 10, 2015 at 6:23 PM, Hector Yee hector@gmail.com wrote:
 Hi Spark devs,

 Is it possible to add jcenter or bintray support for Spark packages?

 I'm trying to add our artifact which is on jcenter

 https://bintray.com/airbnb/aerosolve

 but I noticed in Spark packages it only accepts Maven coordinates.

 --
 Yee Yang Li Hector
 google.com/+HectorYee

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark

2015-06-10 Thread Ray Ortigas
Hi Grega and Reynold,

Grega, if you still want to use t-digest, I filed this PR because I thought
your t-digest suggestion was a good idea.

https://github.com/tdunning/t-digest/pull/56

If it is helpful feel free to do whatever with it.

Regards,
Ray


On Wed, Jun 10, 2015 at 2:54 PM, Reynold Xin r...@databricks.com wrote:

 This email is good. Just one note -- a lot of people are swamped right
 before Spark Summit, so you might not get prompt responses this week.


 On Wed, Jun 10, 2015 at 2:53 PM, Grega Kešpret gr...@celtra.com wrote:

 I have some time to work on it now. What's a good way to continue the
 discussions before coding it?

 This e-mail list, JIRA or something else?

 On Mon, Apr 6, 2015 at 12:59 AM, Reynold Xin r...@databricks.com wrote:

 I think those are great to have. I would put them in the DataFrame API
 though, since this is applying to structured data. Many of the advanced
 functions on the PairRDDFunctions should really go into the DataFrame API
 now we have it.

 One thing that would be great to understand is what state-of-the-art
 alternatives are out there. I did a quick google scholar search using the
 keyword approximate quantile and found some older papers. Just the
 first few I found:

 http://www.softnet.tuc.gr/~minos/Papers/sigmod05.pdf  by bell labs


 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.6.6513rep=rep1type=pdf
  by Bruce Lindsay, IBM

 http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf





 On Mon, Apr 6, 2015 at 12:50 AM, Grega Kešpret gr...@celtra.com wrote:

 Hi!

 I'd like to get community's opinion on implementing a generic quantile
 approximation algorithm for Spark that is O(n) and requires limited memory.
 I would find it useful and I haven't found any existing implementation. The
 plan was basically to wrap t-digest
 https://github.com/tdunning/t-digest, implement the
 serialization/deserialization boilerplate and provide

 def cdf(x: Double): Double
 def quantile(q: Double): Double


 on RDD[Double] and RDD[(K, Double)].

 Let me know what you think. Any other ideas/suggestions also welcome!

 Best,
 Grega
 --
 [image: Inline image 1]*Grega Kešpret*
 Senior Software Engineer, Analytics

 Skype: gregakespret
 celtra.com http://www.celtra.com/ | @celtramobile
 http://www.twitter.com/celtramobile







Re: Problem with pyspark on Docker talking to YARN cluster

2015-06-10 Thread Ashwin Shankar
Hi Eron, Thanks for your reply, but none of these options works for us.


1. use 'spark.driver.host' and 'spark.driver.port' setting to
stabilize the driver-side endpoint.  (ref
https://spark.apache.org/docs/latest/configuration.html#networking)

 This unfortunately won't help since if we set spark.driver.port to
something, its going to be used to bind on the client
side and the same will be passed to the AM. We need two variables,a) one to
bind to on the client side, b)another port which is opened up on the docker
host and will be used by the AM to talk back to the driver.

2. use host networking for your container, i.e. docker run --net=host ...

We run containers in shared environment, and this option makes host network
stack accessible to all
containers in it, which could leads to security issues.

3. use yarn-cluster mode

 Pyspark interactive shell(ipython) doesn't have cluster mode. SPARK-5162
https://issues.apache.org/jira/browse/SPARK-5162 is for spark-submit
python in cluster mode.

Thanks,
Ashwin


On Wed, Jun 10, 2015 at 3:55 PM, Eron Wright ewri...@live.com wrote:

 Options include:

1. use 'spark.driver.host' and 'spark.driver.port' setting to
stabilize the driver-side endpoint.  (ref
https://spark.apache.org/docs/latest/configuration.html#networking)
2. use host networking for your container, i.e. docker run --net=host
...
3. use yarn-cluster mode (see SPARK-5162
https://issues.apache.org/jira/browse/SPARK-5162)


 Hope this helps,
 Eron


 --
 Date: Wed, 10 Jun 2015 13:43:04 -0700
 Subject: Problem with pyspark on Docker talking to YARN cluster
 From: ashwinshanka...@gmail.com
 To: dev@spark.apache.org; u...@spark.apache.org


 All,
 I was wondering if any of you have solved this problem :

 I have pyspark(ipython mode) running on docker talking to
 a yarn cluster(AM/executors are NOT running on docker).

 When I start pyspark in the docker container, it binds to port *49460.*

 Once the app is submitted to YARN, the app(AM) on the cluster side fails
 with the following error message :
 *ERROR yarn.ApplicationMaster: Failed to connect to driver at :49460*

 This makes sense because AM is trying to talk to container directly and
 it cannot, it should be talking to the docker host instead.

 *Question* :
 How do we make Spark AM talk to host1:port1 of the docker host(not the
 container), which would then
 route it to container which is running pyspark on host2:port2 ?

 One solution I could think of is : after starting the driver(say on
 hostA:portA), and before submitting the app to yarn, we could
 reset driver's host/port to hostmachine's ip/port. So the AM can then talk
 hostmachine's ip/port, which would be mapped
 to the container.

 Thoughts ?
 --
 Thanks,
 Ashwin





-- 
Thanks,
Ashwin


Re: How to support dependency jars and files on HDFS in standalone cluster mode?

2015-06-10 Thread Cheng Lian
Since the jars are already on HDFS, you can access them directly in your 
Spark application without using --jars


Cheng

On 6/11/15 11:04 AM, Dong Lei wrote:


Hi spark-dev:

I can not use a hdfs location for the “--jars” or “--files” option 
when doing a spark-submit in a standalone cluster mode. For example:


Spark-submit  …   --jars hdfs://ip/1.jar  …. 
 hdfs://ip/app.jar (standalone cluster mode)


will not download 1.jar to driver’s http file server(but the app.jar 
will be downloaded to the driver’s dir).


I figure out the reason spark not downloading the jars is that when 
doing sc.addJar to http file server, the function called is Files.copy 
which does not support a remote location.


And I think if spark can download the jars and add them to http file 
server, the classpath is not correctly set, because the classpath 
contains remote location.


So I’m trying to make it work and come up with two options, but 
neither of them seem to be elegant, and I want to hear your advices:


Option 1:

Modify HTTPFileServer.addFileToDir, let it recognize a “hdfs” prefix.

This is not good because I think it breaks the scope of http file server.

Option 2:

Modify DriverRunner.downloadUserJar, let it download all the “--jars” 
and “--files” with the application jar.


This sounds more reasonable that option 1 for downloading files. But 
this way I need to read the “spark.jars” and “spark.files” on 
downloadUserJar or DriverRunnder.start and replace it with a local 
path. How can I do that?


Do you have a more elegant solution, or do we have a plan to support 
it in the furture?


Thanks

Dong Lei





Re: [sample code] deeplearning4j for Spark ML (@DeveloperAPI)

2015-06-10 Thread Nick Pentreath
Looks very interesting, thanks for sharing this.

I haven't had much chance to do more than a quick glance over the code.
Quick question - are the Word2Vec and GLOVE implementations fully parallel
on Spark?

On Mon, Jun 8, 2015 at 6:20 PM, Eron Wright ewri...@live.com wrote:


 The deeplearning4j framework provides a variety of distributed, neural
 network-based learning algorithms, including convolutional nets, deep
 auto-encoders, deep-belief nets, and recurrent nets.  We’re working on
 integration with the Spark ML pipeline, leveraging the developer API.
 This announcement is to share some code and get feedback from the Spark
 community.

 The integration code is located in the dl4j-spark-ml module
 https://github.com/deeplearning4j/deeplearning4j/tree/master/deeplearning4j-scaleout/spark/dl4j-spark-ml
  in
 the deeplearning4j repository.

 Major aspects of the integration work:

1. *ML algorithms.*  To bind the dl4j algorithms to the ML pipeline,
we developed a new classifier

 https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/main/scala/org/deeplearning4j/spark/ml/classification/MultiLayerNetworkClassification.scala
  and
a new unsupervised learning estimator

 https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/main/scala/org/deeplearning4j/spark/ml/Unsupervised.scala.

2. *ML attributes.* We strove to interoperate well with other pipeline
components.   ML Attributes are column-level metadata enabling information
sharing between pipeline components.See here

 https://github.com/deeplearning4j/deeplearning4j/blob/4d33302dd8a792906050eda82a7d50ff77a8d957/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/main/scala/org/deeplearning4j/spark/ml/classification/MultiLayerNetworkClassification.scala#L89
  how
the classifier reads label metadata from a column provided by the new
StringIndexer

 http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-docs/api/scala/index.html#org.apache.spark.ml.feature.StringIndexer
.
3. *Large binary data.*  It is challenging to work with large binary
data in Spark.   An effective approach is to leverage PrunedScan and to
carefully control partition sizes.  Here

 https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/main/scala/org/deeplearning4j/spark/sql/sources/lfw/LfwRelation.scala
  we
explored this with a custom data source based on the new relation API.
4. *Column-based record readers.*  Here

 https://github.com/deeplearning4j/deeplearning4j/blob/b237385b56d42d24bd3c99d1eece6cb658f387f2/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/main/scala/org/deeplearning4j/spark/sql/sources/lfw/LfwRelation.scala#L96
  we
explored how to construct rows from a Hadoop input split by composing a
number of column-level readers, with pruning support.
5. *UDTs*.   With Spark SQL it is possible to introduce new data
types.   We prototyped an experimental Tensor type, here

 https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/main/scala/org/deeplearning4j/spark/sql/types/tensors.scala
.
6. *Spark Package.*   We developed a spark package to make it easy to
use the dl4j framework in spark-shell and with spark-submit.  See the
deeplearning4j/dl4j-spark-ml
https://github.com/deeplearning4j/dl4j-spark-ml repository for
useful snippets involving the sbt-spark-package plugin.
7. *Example code.*   Examples demonstrate how the standardized ML API
simplifies interoperability, such as with label preprocessing and feature
scaling.   See the deeplearning4j/dl4j-spark-ml-examples
https://github.com/deeplearning4j/dl4j-spark-ml-examples repository
for an expanding set of example pipelines.

 Hope this proves useful to the community as we transition to exciting new
 concepts in Spark SQL and Spark ML.   Meanwhile, we have Spark working
 with multiple GPUs on AWS http://deeplearning4j.org/gpu_aws.html and
 we're looking forward to optimizations that will speed neural net training
 even more.

 Eron Wright
 Contributor | deeplearning4j.org




Re: [DISCUSS] Minimize use of MINOR, BUILD, and HOTFIX w/ no JIRA

2015-06-10 Thread Joseph Bradley
+1

On Sat, Jun 6, 2015 at 9:01 AM, Patrick Wendell pwend...@gmail.com wrote:

 Hey All,

 Just a request here - it would be great if people could create JIRA's
 for any and all merged pull requests. The reason is that when patches
 get reverted due to build breaks or other issues, it is very difficult
 to keep track of what is going on if there is no JIRA. Here is a list
 of 5 patches we had to revert recently that didn't include a JIRA:

 Revert [MINOR] [BUILD] Use custom temp directory during build.
 Revert [SQL] [TEST] [MINOR] Uses a temporary log4j.properties in
 HiveThriftServer2Test to ensure expected logging behavior
 Revert [BUILD] Always run SQL tests in master build.
 Revert [MINOR] [CORE] Warn users who try to cache RDDs with
 dynamic allocation on.
 Revert [HOT FIX] [YARN] Check whether `/lib` exists before
 listing its files

 The cost overhead of creating a JIRA relative to other aspects of
 development is very small. If it's *really* a documentation change or
 something small, that's okay.

 But anything affecting the build, packaging, etc. These all need to
 have a JIRA to ensure that follow-up can be well communicated to all
 Spark developers.

 Hopefully this is something everyone can get behind, but opened a
 discussion here in case others feel differently.

 - Patrick

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




How to support dependency jars and files on HDFS in standalone cluster mode?

2015-06-10 Thread Dong Lei
Hi spark-dev:

I can not use a hdfs location for the --jars or --files option when doing a 
spark-submit in a standalone cluster mode. For example:
Spark-submit  ...   --jars hdfs://ip/1.jar    
hdfs://ip/app.jar (standalone cluster mode)
will not download 1.jar to driver's http file server(but the app.jar will be 
downloaded to the driver's dir).

I figure out the reason spark not downloading the jars is that when doing 
sc.addJar to http file server, the function called is Files.copy which does not 
support a remote location.
And I think if spark can download the jars and add them to http file server, 
the classpath is not correctly set, because the classpath contains remote 
location.

So I'm trying to make it work and come up with two options, but neither of them 
seem to be elegant, and I want to hear your advices:

Option 1:
Modify HTTPFileServer.addFileToDir, let it recognize a hdfs prefix.

This is not good because I think it breaks the scope of http file server.

Option 2:
Modify DriverRunner.downloadUserJar, let it download all the --jars and 
--files with the application jar.

This sounds more reasonable that option 1 for downloading files. But this way I 
need to read the spark.jars and spark.files on downloadUserJar or 
DriverRunnder.start and replace it with a local path. How can I do that?


Do you have a more elegant solution, or do we have a plan to support it in the 
furture?

Thanks
Dong Lei


Re: [ml] Why all model classes are final?

2015-06-10 Thread Joseph Bradley
Hi Peter,

We've tried to be cautious about making APIs public without need, to allow
for changes needed in the future which we can't foresee now.  Marking
classes as final is part of that.  While marking things as Experimental or
DeveloperApi is a sort of warning, we've often felt that even changing
those Experimental/Developer APIs is dangerous since people can come to
rely on those APIs.

However, customization is a very valid use case, and I agree that the
classes should be opened up in the future.  I hope that, as the Pipelines
API graduates from alpha, more users will give feedback about them, and
that will give us enough confidence in the API stability to make the
classes non-final.

Joseph

On Mon, Jun 8, 2015 at 9:17 AM, Peter Rudenko petro.rude...@gmail.com
wrote:

 Hi, previously all the models in ml package were private to package, so if
 i need to customize some models i inherit them in org.apache.spark.ml
 package in my project. But now new models (
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala#L46)
 are final classes. So if i need to customize 1 line or so, i need to
 redefine the whole class. Any reasons to do so? As a developer,i understand
 all the risks of using Developer/Alpha API. That's why i'm using spark,
 because it provides a building blocks that i could easily customize and
 combine for my need.

 Thanks,
 Peter Rudenko

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org