Re: About akka used in spark
If you look at the maven repo, you can see its from typesafe only http://mvnrepository.com/artifact/org.spark-project.akka/akka-actor_2.10/2.3.4-spark For sbt, you can download the sources by adding withSources() like: libraryDependencies += org.spark-project.akka % akka-actor_2.10 % 2.3.4-spark withSources() withJavadoc() Thanks Best Regards On Wed, Jun 10, 2015 at 11:25 AM, wangtao (A) wangtao...@huawei.com wrote: Hi guys, I see group id of akka used in spark is “org.spark-project.akka”. What is its difference with the typesafe one? What is its version? And where can we get the source code? Regards.
Re: About akka used in spark
We only shaded protobuf dependencies because of compatibility issues. The source code is not modified. On 6/10/15 1:55 PM, wangtao (A) wrote: Hi guys, I see group id of akka used in spark is “org.spark-project.akka”. What is its difference with the typesafe one? What is its version? And where can we get the source code? Regards.
答复: [VOTE] Release Apache Spark 1.4.0 (RC4)
+1 Tested with building with Hadoop 2.7.0 and running with tests: WordCount in yarn-client/yarn-cluster mode works fine; Basic sql queries are passed; “spark.sql.autoBroadcastJoinThreshold” works fine; Thrift Server is fine; Running streaming with kafka is good; External shuffle in YARN mode is fine; Hisotry Server can automatically clean the event log on hdfs; Basic PySpark tests are fine; 发件人: Sean McNamara [via Apache Spark Developers List] [mailto:ml-node+s1001551n12675...@n3.nabble.com] 发送时间: 2015年6月9日 23:53 收件人: wangtao (A) 主题: Re: [VOTE] Release Apache Spark 1.4.0 (RC4) +1 tested /w OS X + deployed one of our streaming apps onto a staging yarn cluster. Sean On Jun 2, 2015, at 9:54 PM, Patrick Wendell [hidden email]/user/SendEmail.jtp?type=nodenode=12675i=0 wrote: Please vote on releasing the following candidate as Apache Spark version 1.4.0! The tag to be voted on is v1.4.0-rc3 (commit 22596c5): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h= 22596c534a38cfdda91aef18aa9037ab101e4251 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-bin/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: [published as version: 1.4.0] https://repository.apache.org/content/repositories/orgapachespark-/ [published as version: 1.4.0-rc4] https://repository.apache.org/content/repositories/orgapachespark-1112/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-docs/ Please vote on releasing this package as Apache Spark 1.4.0! The vote is open until Saturday, June 06, at 05:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.4.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == What has changed since RC3 == In addition to may smaller fixes, three blocker issues were fixed: 4940630 [SPARK-8020] [SQL] Spark SQL conf in spark-defaults.conf make metadataHive get constructed too early 6b0f615 [SPARK-8038] [SQL] [PYSPARK] fix Column.when() and otherwise() 78a6723 [SPARK-7978] [SQL] [PYSPARK] DecimalType should not be singleton == How can I help test this release? == If you are a Spark user, you can help us test this release by taking a Spark 1.3 workload and running on this release candidate, then reporting any regressions. == What justifies a -1 vote for this release? == This vote is happening towards the end of the 1.4 QA period, so -1 votes should only occur for significant regressions from 1.3.1. Bugs already present in 1.3.X, minor regressions, or bugs related to new features will not block this release. - To unsubscribe, e-mail: [hidden email]/user/SendEmail.jtp?type=nodenode=12675i=1 For additional commands, e-mail: [hidden email]/user/SendEmail.jtp?type=nodenode=12675i=2 - To unsubscribe, e-mail: [hidden email]/user/SendEmail.jtp?type=nodenode=12675i=3 For additional commands, e-mail: [hidden email]/user/SendEmail.jtp?type=nodenode=12675i=4 If you reply to this email, your message will be added to the discussion below: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-4-0-RC4-tp12582p12675.html To unsubscribe from [VOTE] Release Apache Spark 1.4.0 (RC4), click herehttp://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=12582code=d2FuZ3RhbzExMUBodWF3ZWkuY29tfDEyNTgyfDgyNzMxMjE4MA==. NAMLhttp://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-4-0-RC4-tp12582p12682.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Problem with pyspark on Docker talking to YARN cluster
All, I was wondering if any of you have solved this problem : I have pyspark(ipython mode) running on docker talking to a yarn cluster(AM/executors are NOT running on docker). When I start pyspark in the docker container, it binds to port *49460.* Once the app is submitted to YARN, the app(AM) on the cluster side fails with the following error message : *ERROR yarn.ApplicationMaster: Failed to connect to driver at :49460* This makes sense because AM is trying to talk to container directly and it cannot, it should be talking to the docker host instead. *Question* : How do we make Spark AM talk to host1:port1 of the docker host(not the container), which would then route it to container which is running pyspark on host2:port2 ? One solution I could think of is : after starting the driver(say on hostA:portA), and before submitting the app to yarn, we could reset driver's host/port to hostmachine's ip/port. So the AM can then talk hostmachine's ip/port, which would be mapped to the container. Thoughts ? -- Thanks, Ashwin
[RESULT] [VOTE] Release Apache Spark 1.4.0 (RC4)
This vote passes! Thanks to everyone who voted. I will get the release artifacts and notes up within a day or two. +1 (23 votes): Reynold Xin* Patrick Wendell* Matei Zaharia* Andrew Or* Timothy Chen Calvin Jia Burak Yavuz Krishna Sankar Hari Shreedharan Ram Sriharsha* Kousuke Saruta Sandy Ryza Marcelo Vanzin Bobby Chowdary Mark Hamstra Guoqiang Li Joseph Bradley Sean McNamara Tathagata Das* Ajay Singal Wang, Daoyuan Denny Lee Forest Fang 0: -1: * Binding On Tue, Jun 2, 2015 at 8:53 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.4.0! The tag to be voted on is v1.4.0-rc3 (commit 22596c5): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h= 22596c534a38cfdda91aef18aa9037ab101e4251 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-bin/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: [published as version: 1.4.0] https://repository.apache.org/content/repositories/orgapachespark-/ [published as version: 1.4.0-rc4] https://repository.apache.org/content/repositories/orgapachespark-1112/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-docs/ Please vote on releasing this package as Apache Spark 1.4.0! The vote is open until Saturday, June 06, at 05:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.4.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == What has changed since RC3 == In addition to may smaller fixes, three blocker issues were fixed: 4940630 [SPARK-8020] [SQL] Spark SQL conf in spark-defaults.conf make metadataHive get constructed too early 6b0f615 [SPARK-8038] [SQL] [PYSPARK] fix Column.when() and otherwise() 78a6723 [SPARK-7978] [SQL] [PYSPARK] DecimalType should not be singleton == How can I help test this release? == If you are a Spark user, you can help us test this release by taking a Spark 1.3 workload and running on this release candidate, then reporting any regressions. == What justifies a -1 vote for this release? == This vote is happening towards the end of the 1.4 QA period, so -1 votes should only occur for significant regressions from 1.3.1. Bugs already present in 1.3.X, minor regressions, or bugs related to new features will not block this release. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark
This email is good. Just one note -- a lot of people are swamped right before Spark Summit, so you might not get prompt responses this week. On Wed, Jun 10, 2015 at 2:53 PM, Grega Kešpret gr...@celtra.com wrote: I have some time to work on it now. What's a good way to continue the discussions before coding it? This e-mail list, JIRA or something else? On Mon, Apr 6, 2015 at 12:59 AM, Reynold Xin r...@databricks.com wrote: I think those are great to have. I would put them in the DataFrame API though, since this is applying to structured data. Many of the advanced functions on the PairRDDFunctions should really go into the DataFrame API now we have it. One thing that would be great to understand is what state-of-the-art alternatives are out there. I did a quick google scholar search using the keyword approximate quantile and found some older papers. Just the first few I found: http://www.softnet.tuc.gr/~minos/Papers/sigmod05.pdf by bell labs http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.6.6513rep=rep1type=pdf by Bruce Lindsay, IBM http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf On Mon, Apr 6, 2015 at 12:50 AM, Grega Kešpret gr...@celtra.com wrote: Hi! I'd like to get community's opinion on implementing a generic quantile approximation algorithm for Spark that is O(n) and requires limited memory. I would find it useful and I haven't found any existing implementation. The plan was basically to wrap t-digest https://github.com/tdunning/t-digest, implement the serialization/deserialization boilerplate and provide def cdf(x: Double): Double def quantile(q: Double): Double on RDD[Double] and RDD[(K, Double)]. Let me know what you think. Any other ideas/suggestions also welcome! Best, Grega -- [image: Inline image 1]*Grega Kešpret* Senior Software Engineer, Analytics Skype: gregakespret celtra.com http://www.celtra.com/ | @celtramobile http://www.twitter.com/celtramobile
RE: Problem with pyspark on Docker talking to YARN cluster
Options include:use 'spark.driver.host' and 'spark.driver.port' setting to stabilize the driver-side endpoint. (ref)use host networking for your container, i.e. docker run --net=host ...use yarn-cluster mode (see SPARK-5162) Hope this helps,Eron Date: Wed, 10 Jun 2015 13:43:04 -0700 Subject: Problem with pyspark on Docker talking to YARN cluster From: ashwinshanka...@gmail.com To: dev@spark.apache.org; u...@spark.apache.org All,I was wondering if any of you have solved this problem : I have pyspark(ipython mode) running on docker talking toa yarn cluster(AM/executors are NOT running on docker). When I start pyspark in the docker container, it binds to port 49460. Once the app is submitted to YARN, the app(AM) on the cluster side fails with the following error message :ERROR yarn.ApplicationMaster: Failed to connect to driver at :49460 This makes sense because AM is trying to talk to container directly andit cannot, it should be talking to the docker host instead. Question :How do we make Spark AM talk to host1:port1 of the docker host(not the container), which would thenroute it to container which is running pyspark on host2:port2 ? One solution I could think of is : after starting the driver(say on hostA:portA), and before submitting the app to yarn, we could reset driver's host/port to hostmachine's ip/port. So the AM can then talk hostmachine's ip/port, which would be mappedto the container. Thoughts ? -- Thanks, Ashwin
Re: Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark
I have some time to work on it now. What's a good way to continue the discussions before coding it? This e-mail list, JIRA or something else? On Mon, Apr 6, 2015 at 12:59 AM, Reynold Xin r...@databricks.com wrote: I think those are great to have. I would put them in the DataFrame API though, since this is applying to structured data. Many of the advanced functions on the PairRDDFunctions should really go into the DataFrame API now we have it. One thing that would be great to understand is what state-of-the-art alternatives are out there. I did a quick google scholar search using the keyword approximate quantile and found some older papers. Just the first few I found: http://www.softnet.tuc.gr/~minos/Papers/sigmod05.pdf by bell labs http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.6.6513rep=rep1type=pdf by Bruce Lindsay, IBM http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf On Mon, Apr 6, 2015 at 12:50 AM, Grega Kešpret gr...@celtra.com wrote: Hi! I'd like to get community's opinion on implementing a generic quantile approximation algorithm for Spark that is O(n) and requires limited memory. I would find it useful and I haven't found any existing implementation. The plan was basically to wrap t-digest https://github.com/tdunning/t-digest, implement the serialization/deserialization boilerplate and provide def cdf(x: Double): Double def quantile(q: Double): Double on RDD[Double] and RDD[(K, Double)]. Let me know what you think. Any other ideas/suggestions also welcome! Best, Grega -- [image: Inline image 1]*Grega Kešpret* Senior Software Engineer, Analytics Skype: gregakespret celtra.com http://www.celtra.com/ | @celtramobile http://www.twitter.com/celtramobile
Jcenter / bintray support for spark packages?
Hi Spark devs, Is it possible to add jcenter or bintray support for Spark packages? I'm trying to add our artifact which is on jcenter https://bintray.com/airbnb/aerosolve but I noticed in Spark packages it only accepts Maven coordinates. -- Yee Yang Li Hector google.com/+HectorYee - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Jcenter / bintray support for spark packages?
Hey Hector, It's not a bad idea. I think we'd want to do this by virtue of allowing custom repositories, so users can add bintray or others. - Patrick On Wed, Jun 10, 2015 at 6:23 PM, Hector Yee hector@gmail.com wrote: Hi Spark devs, Is it possible to add jcenter or bintray support for Spark packages? I'm trying to add our artifact which is on jcenter https://bintray.com/airbnb/aerosolve but I noticed in Spark packages it only accepts Maven coordinates. -- Yee Yang Li Hector google.com/+HectorYee - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark
Hi Grega and Reynold, Grega, if you still want to use t-digest, I filed this PR because I thought your t-digest suggestion was a good idea. https://github.com/tdunning/t-digest/pull/56 If it is helpful feel free to do whatever with it. Regards, Ray On Wed, Jun 10, 2015 at 2:54 PM, Reynold Xin r...@databricks.com wrote: This email is good. Just one note -- a lot of people are swamped right before Spark Summit, so you might not get prompt responses this week. On Wed, Jun 10, 2015 at 2:53 PM, Grega Kešpret gr...@celtra.com wrote: I have some time to work on it now. What's a good way to continue the discussions before coding it? This e-mail list, JIRA or something else? On Mon, Apr 6, 2015 at 12:59 AM, Reynold Xin r...@databricks.com wrote: I think those are great to have. I would put them in the DataFrame API though, since this is applying to structured data. Many of the advanced functions on the PairRDDFunctions should really go into the DataFrame API now we have it. One thing that would be great to understand is what state-of-the-art alternatives are out there. I did a quick google scholar search using the keyword approximate quantile and found some older papers. Just the first few I found: http://www.softnet.tuc.gr/~minos/Papers/sigmod05.pdf by bell labs http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.6.6513rep=rep1type=pdf by Bruce Lindsay, IBM http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf On Mon, Apr 6, 2015 at 12:50 AM, Grega Kešpret gr...@celtra.com wrote: Hi! I'd like to get community's opinion on implementing a generic quantile approximation algorithm for Spark that is O(n) and requires limited memory. I would find it useful and I haven't found any existing implementation. The plan was basically to wrap t-digest https://github.com/tdunning/t-digest, implement the serialization/deserialization boilerplate and provide def cdf(x: Double): Double def quantile(q: Double): Double on RDD[Double] and RDD[(K, Double)]. Let me know what you think. Any other ideas/suggestions also welcome! Best, Grega -- [image: Inline image 1]*Grega Kešpret* Senior Software Engineer, Analytics Skype: gregakespret celtra.com http://www.celtra.com/ | @celtramobile http://www.twitter.com/celtramobile
Re: Problem with pyspark on Docker talking to YARN cluster
Hi Eron, Thanks for your reply, but none of these options works for us. 1. use 'spark.driver.host' and 'spark.driver.port' setting to stabilize the driver-side endpoint. (ref https://spark.apache.org/docs/latest/configuration.html#networking) This unfortunately won't help since if we set spark.driver.port to something, its going to be used to bind on the client side and the same will be passed to the AM. We need two variables,a) one to bind to on the client side, b)another port which is opened up on the docker host and will be used by the AM to talk back to the driver. 2. use host networking for your container, i.e. docker run --net=host ... We run containers in shared environment, and this option makes host network stack accessible to all containers in it, which could leads to security issues. 3. use yarn-cluster mode Pyspark interactive shell(ipython) doesn't have cluster mode. SPARK-5162 https://issues.apache.org/jira/browse/SPARK-5162 is for spark-submit python in cluster mode. Thanks, Ashwin On Wed, Jun 10, 2015 at 3:55 PM, Eron Wright ewri...@live.com wrote: Options include: 1. use 'spark.driver.host' and 'spark.driver.port' setting to stabilize the driver-side endpoint. (ref https://spark.apache.org/docs/latest/configuration.html#networking) 2. use host networking for your container, i.e. docker run --net=host ... 3. use yarn-cluster mode (see SPARK-5162 https://issues.apache.org/jira/browse/SPARK-5162) Hope this helps, Eron -- Date: Wed, 10 Jun 2015 13:43:04 -0700 Subject: Problem with pyspark on Docker talking to YARN cluster From: ashwinshanka...@gmail.com To: dev@spark.apache.org; u...@spark.apache.org All, I was wondering if any of you have solved this problem : I have pyspark(ipython mode) running on docker talking to a yarn cluster(AM/executors are NOT running on docker). When I start pyspark in the docker container, it binds to port *49460.* Once the app is submitted to YARN, the app(AM) on the cluster side fails with the following error message : *ERROR yarn.ApplicationMaster: Failed to connect to driver at :49460* This makes sense because AM is trying to talk to container directly and it cannot, it should be talking to the docker host instead. *Question* : How do we make Spark AM talk to host1:port1 of the docker host(not the container), which would then route it to container which is running pyspark on host2:port2 ? One solution I could think of is : after starting the driver(say on hostA:portA), and before submitting the app to yarn, we could reset driver's host/port to hostmachine's ip/port. So the AM can then talk hostmachine's ip/port, which would be mapped to the container. Thoughts ? -- Thanks, Ashwin -- Thanks, Ashwin
Re: How to support dependency jars and files on HDFS in standalone cluster mode?
Since the jars are already on HDFS, you can access them directly in your Spark application without using --jars Cheng On 6/11/15 11:04 AM, Dong Lei wrote: Hi spark-dev: I can not use a hdfs location for the “--jars” or “--files” option when doing a spark-submit in a standalone cluster mode. For example: Spark-submit … --jars hdfs://ip/1.jar …. hdfs://ip/app.jar (standalone cluster mode) will not download 1.jar to driver’s http file server(but the app.jar will be downloaded to the driver’s dir). I figure out the reason spark not downloading the jars is that when doing sc.addJar to http file server, the function called is Files.copy which does not support a remote location. And I think if spark can download the jars and add them to http file server, the classpath is not correctly set, because the classpath contains remote location. So I’m trying to make it work and come up with two options, but neither of them seem to be elegant, and I want to hear your advices: Option 1: Modify HTTPFileServer.addFileToDir, let it recognize a “hdfs” prefix. This is not good because I think it breaks the scope of http file server. Option 2: Modify DriverRunner.downloadUserJar, let it download all the “--jars” and “--files” with the application jar. This sounds more reasonable that option 1 for downloading files. But this way I need to read the “spark.jars” and “spark.files” on downloadUserJar or DriverRunnder.start and replace it with a local path. How can I do that? Do you have a more elegant solution, or do we have a plan to support it in the furture? Thanks Dong Lei
Re: [sample code] deeplearning4j for Spark ML (@DeveloperAPI)
Looks very interesting, thanks for sharing this. I haven't had much chance to do more than a quick glance over the code. Quick question - are the Word2Vec and GLOVE implementations fully parallel on Spark? On Mon, Jun 8, 2015 at 6:20 PM, Eron Wright ewri...@live.com wrote: The deeplearning4j framework provides a variety of distributed, neural network-based learning algorithms, including convolutional nets, deep auto-encoders, deep-belief nets, and recurrent nets. We’re working on integration with the Spark ML pipeline, leveraging the developer API. This announcement is to share some code and get feedback from the Spark community. The integration code is located in the dl4j-spark-ml module https://github.com/deeplearning4j/deeplearning4j/tree/master/deeplearning4j-scaleout/spark/dl4j-spark-ml in the deeplearning4j repository. Major aspects of the integration work: 1. *ML algorithms.* To bind the dl4j algorithms to the ML pipeline, we developed a new classifier https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/main/scala/org/deeplearning4j/spark/ml/classification/MultiLayerNetworkClassification.scala and a new unsupervised learning estimator https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/main/scala/org/deeplearning4j/spark/ml/Unsupervised.scala. 2. *ML attributes.* We strove to interoperate well with other pipeline components. ML Attributes are column-level metadata enabling information sharing between pipeline components.See here https://github.com/deeplearning4j/deeplearning4j/blob/4d33302dd8a792906050eda82a7d50ff77a8d957/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/main/scala/org/deeplearning4j/spark/ml/classification/MultiLayerNetworkClassification.scala#L89 how the classifier reads label metadata from a column provided by the new StringIndexer http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-docs/api/scala/index.html#org.apache.spark.ml.feature.StringIndexer . 3. *Large binary data.* It is challenging to work with large binary data in Spark. An effective approach is to leverage PrunedScan and to carefully control partition sizes. Here https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/main/scala/org/deeplearning4j/spark/sql/sources/lfw/LfwRelation.scala we explored this with a custom data source based on the new relation API. 4. *Column-based record readers.* Here https://github.com/deeplearning4j/deeplearning4j/blob/b237385b56d42d24bd3c99d1eece6cb658f387f2/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/main/scala/org/deeplearning4j/spark/sql/sources/lfw/LfwRelation.scala#L96 we explored how to construct rows from a Hadoop input split by composing a number of column-level readers, with pruning support. 5. *UDTs*. With Spark SQL it is possible to introduce new data types. We prototyped an experimental Tensor type, here https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/main/scala/org/deeplearning4j/spark/sql/types/tensors.scala . 6. *Spark Package.* We developed a spark package to make it easy to use the dl4j framework in spark-shell and with spark-submit. See the deeplearning4j/dl4j-spark-ml https://github.com/deeplearning4j/dl4j-spark-ml repository for useful snippets involving the sbt-spark-package plugin. 7. *Example code.* Examples demonstrate how the standardized ML API simplifies interoperability, such as with label preprocessing and feature scaling. See the deeplearning4j/dl4j-spark-ml-examples https://github.com/deeplearning4j/dl4j-spark-ml-examples repository for an expanding set of example pipelines. Hope this proves useful to the community as we transition to exciting new concepts in Spark SQL and Spark ML. Meanwhile, we have Spark working with multiple GPUs on AWS http://deeplearning4j.org/gpu_aws.html and we're looking forward to optimizations that will speed neural net training even more. Eron Wright Contributor | deeplearning4j.org
Re: [DISCUSS] Minimize use of MINOR, BUILD, and HOTFIX w/ no JIRA
+1 On Sat, Jun 6, 2015 at 9:01 AM, Patrick Wendell pwend...@gmail.com wrote: Hey All, Just a request here - it would be great if people could create JIRA's for any and all merged pull requests. The reason is that when patches get reverted due to build breaks or other issues, it is very difficult to keep track of what is going on if there is no JIRA. Here is a list of 5 patches we had to revert recently that didn't include a JIRA: Revert [MINOR] [BUILD] Use custom temp directory during build. Revert [SQL] [TEST] [MINOR] Uses a temporary log4j.properties in HiveThriftServer2Test to ensure expected logging behavior Revert [BUILD] Always run SQL tests in master build. Revert [MINOR] [CORE] Warn users who try to cache RDDs with dynamic allocation on. Revert [HOT FIX] [YARN] Check whether `/lib` exists before listing its files The cost overhead of creating a JIRA relative to other aspects of development is very small. If it's *really* a documentation change or something small, that's okay. But anything affecting the build, packaging, etc. These all need to have a JIRA to ensure that follow-up can be well communicated to all Spark developers. Hopefully this is something everyone can get behind, but opened a discussion here in case others feel differently. - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
How to support dependency jars and files on HDFS in standalone cluster mode?
Hi spark-dev: I can not use a hdfs location for the --jars or --files option when doing a spark-submit in a standalone cluster mode. For example: Spark-submit ... --jars hdfs://ip/1.jar hdfs://ip/app.jar (standalone cluster mode) will not download 1.jar to driver's http file server(but the app.jar will be downloaded to the driver's dir). I figure out the reason spark not downloading the jars is that when doing sc.addJar to http file server, the function called is Files.copy which does not support a remote location. And I think if spark can download the jars and add them to http file server, the classpath is not correctly set, because the classpath contains remote location. So I'm trying to make it work and come up with two options, but neither of them seem to be elegant, and I want to hear your advices: Option 1: Modify HTTPFileServer.addFileToDir, let it recognize a hdfs prefix. This is not good because I think it breaks the scope of http file server. Option 2: Modify DriverRunner.downloadUserJar, let it download all the --jars and --files with the application jar. This sounds more reasonable that option 1 for downloading files. But this way I need to read the spark.jars and spark.files on downloadUserJar or DriverRunnder.start and replace it with a local path. How can I do that? Do you have a more elegant solution, or do we have a plan to support it in the furture? Thanks Dong Lei
Re: [ml] Why all model classes are final?
Hi Peter, We've tried to be cautious about making APIs public without need, to allow for changes needed in the future which we can't foresee now. Marking classes as final is part of that. While marking things as Experimental or DeveloperApi is a sort of warning, we've often felt that even changing those Experimental/Developer APIs is dangerous since people can come to rely on those APIs. However, customization is a very valid use case, and I agree that the classes should be opened up in the future. I hope that, as the Pipelines API graduates from alpha, more users will give feedback about them, and that will give us enough confidence in the API stability to make the classes non-final. Joseph On Mon, Jun 8, 2015 at 9:17 AM, Peter Rudenko petro.rude...@gmail.com wrote: Hi, previously all the models in ml package were private to package, so if i need to customize some models i inherit them in org.apache.spark.ml package in my project. But now new models ( https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala#L46) are final classes. So if i need to customize 1 line or so, i need to redefine the whole class. Any reasons to do so? As a developer,i understand all the risks of using Developer/Alpha API. That's why i'm using spark, because it provides a building blocks that i could easily customize and combine for my need. Thanks, Peter Rudenko - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org