Re: [VOTE] Release Apache Spark 1.1.0 (RC2)
Thanks to Nick Chammas and Cheng Lian who pointed out two issues with the release candidate. I'll cancel this in favor of RC3. On Fri, Aug 29, 2014 at 1:33 PM, Jeremy Freeman freeman.jer...@gmail.com wrote: +1. Validated several custom analysis pipelines on a private cluster in standalone mode. Tested new PySpark support for arbitrary Hadoop input formats, works great! -- Jeremy -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-0-RC2-tp8107p8143.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.1.0 (RC2)
+1 I tested the source and Hadoop 2.4 release. Checksums and signatures are OK. Compiles fine with Java 8 on OS X. Tests... don't fail any more than usual. FWIW I've also been using the 1.1.0-SNAPSHOT for some time in another project and have encountered no problems. I notice that the 1.1.0 release removes the CDH4-specific build, but adds two MapR-specific builds. Compare with https://dist.apache.org/repos/dist/release/spark/spark-1.0.2/ I commented on the commit: https://github.com/apache/spark/commit/ceb19830b88486faa87ff41e18d03ede713a73cc I'm in favor of removing all vendor-specific builds. This change *looks* a bit funny as there was no JIRA (?) and appears to swap one vendor for another. Of course there's nothing untoward going on, but what was the reasoning? It's best avoided, and MapR already distributes Spark just fine, no? This is a gray area with ASF projects. I mention it as well because it came up with Apache Flink recently (http://mail-archives.eu.apache.org/mod_mbox/incubator-flink-dev/201408.mbox/%3CCANC1h_u%3DN0YKFu3pDaEVYz5ZcQtjQnXEjQA2ReKmoS%2Bye7%3Do%3DA%40mail.gmail.com%3E) Another vendor rightly noted this could look like favoritism. They changed to remove vendor releases. On Fri, Aug 29, 2014 at 3:14 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.1.0! The tag to be voted on is v1.1.0-rc2 (commit 711aebb3): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=711aebb329ca28046396af1e34395a0df92b5327 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1029/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc2-docs/ Please vote on releasing this package as Apache Spark 1.1.0! The vote is open until Monday, September 01, at 03:11 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.1.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == Regressions fixed since RC1 == LZ4 compression issue: https://issues.apache.org/jira/browse/SPARK-3277 == What justifies a -1 vote for this release? == This vote is happening very late into the QA period compared with previous votes, so -1 votes should only occur for significant regressions from 1.0.2. Bugs already present in 1.0.X will not block this release. == What default changes should I be aware of? == 1. The default value of spark.io.compression.codec is now snappy -- Old behavior can be restored by switching to lzf 2. PySpark now performs external spilling during aggregations. -- Old behavior can be restored by setting spark.shuffle.spill to false. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.1.0 (RC2)
Hey Sean, The reason there are no longer CDH-specific builds is that all newer versions of CDH and HDP work with builds for the upstream Hadoop projects. I dropped CDH4 in favor of a newer Hadoop version (2.4) and the Hadoop-without-Hive (also 2.4) build. For MapR - we can't officially post those artifacts on ASF web space when we make the final release, we can only link to them as being hosted by MapR specifically since they use non-compatible licenses. However, I felt that providing these during a testing period was alright, with the goal of increasing test coverage. I couldn't find any policy against posting these on personal web space during RC voting. However, we can remove them if there is one. Dropping CDH4 was more because it is now pretty old, but we can add it back if people want. The binary packaging is a slightly separate question from release votes, so I can always add more binary packages whenever. And on this, my main concern is covering the most popular Hadoop versions to lower the bar for users to build and test Spark. - Patrick On Thu, Aug 28, 2014 at 11:04 PM, Sean Owen so...@cloudera.com wrote: +1 I tested the source and Hadoop 2.4 release. Checksums and signatures are OK. Compiles fine with Java 8 on OS X. Tests... don't fail any more than usual. FWIW I've also been using the 1.1.0-SNAPSHOT for some time in another project and have encountered no problems. I notice that the 1.1.0 release removes the CDH4-specific build, but adds two MapR-specific builds. Compare with https://dist.apache.org/repos/dist/release/spark/spark-1.0.2/ I commented on the commit: https://github.com/apache/spark/commit/ceb19830b88486faa87ff41e18d03ede713a73cc I'm in favor of removing all vendor-specific builds. This change *looks* a bit funny as there was no JIRA (?) and appears to swap one vendor for another. Of course there's nothing untoward going on, but what was the reasoning? It's best avoided, and MapR already distributes Spark just fine, no? This is a gray area with ASF projects. I mention it as well because it came up with Apache Flink recently (http://mail-archives.eu.apache.org/mod_mbox/incubator-flink-dev/201408.mbox/%3CCANC1h_u%3DN0YKFu3pDaEVYz5ZcQtjQnXEjQA2ReKmoS%2Bye7%3Do%3DA%40mail.gmail.com%3E) Another vendor rightly noted this could look like favoritism. They changed to remove vendor releases. On Fri, Aug 29, 2014 at 3:14 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.1.0! The tag to be voted on is v1.1.0-rc2 (commit 711aebb3): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=711aebb329ca28046396af1e34395a0df92b5327 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1029/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc2-docs/ Please vote on releasing this package as Apache Spark 1.1.0! The vote is open until Monday, September 01, at 03:11 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.1.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == Regressions fixed since RC1 == LZ4 compression issue: https://issues.apache.org/jira/browse/SPARK-3277 == What justifies a -1 vote for this release? == This vote is happening very late into the QA period compared with previous votes, so -1 votes should only occur for significant regressions from 1.0.2. Bugs already present in 1.0.X will not block this release. == What default changes should I be aware of? == 1. The default value of spark.io.compression.codec is now snappy -- Old behavior can be restored by switching to lzf 2. PySpark now performs external spilling during aggregations. -- Old behavior can be restored by setting spark.shuffle.spill to false. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.1.0 (RC2)
(Copying my reply since I don't know if it goes to the mailing list) Great, thanks for explaining the reasoning. You're saying these aren't going into the final release? I think that moots any issue surrounding distributing them then. This is all I know of from the ASF: https://community.apache.org/projectIndependence.html I don't read it as expressly forbidding this kind of thing although you can see how it bumps up against the spirit. There's not a bright line -- what about Tomcat providing binaries compiled for Windows for example? does that favor an OS vendor? From this technical ASF perspective only the releases matter -- do what you want with snapshots and RCs. The only issue there is maybe releasing something different than was in the RC; is that at all confusing? Just needs a note. I think this theoretical issue doesn't exist if these binaries aren't released, so I see no reason to not proceed. The rest is a different question about whether you want to spend time maintaining this profile and candidate. The vendor already manages their build I think and -- and I don't know -- may even prefer not to have a different special build floating around. There's also the theoretical argument that this turns off other vendors from adopting Spark if it's perceived to be too connected to other vendors. I'd like to maximize Spark's distribution and there's some argument you do this by not making vendor profiles. But as I say a different question to just think about over time... (oh and PS for my part I think it's a good thing that CDH4 binaries were removed. I wasn't arguing for resurrecting them) On Fri, Aug 29, 2014 at 7:26 AM, Patrick Wendell pwend...@gmail.com wrote: Hey Sean, The reason there are no longer CDH-specific builds is that all newer versions of CDH and HDP work with builds for the upstream Hadoop projects. I dropped CDH4 in favor of a newer Hadoop version (2.4) and the Hadoop-without-Hive (also 2.4) build. For MapR - we can't officially post those artifacts on ASF web space when we make the final release, we can only link to them as being hosted by MapR specifically since they use non-compatible licenses. However, I felt that providing these during a testing period was alright, with the goal of increasing test coverage. I couldn't find any policy against posting these on personal web space during RC voting. However, we can remove them if there is one. Dropping CDH4 was more because it is now pretty old, but we can add it back if people want. The binary packaging is a slightly separate question from release votes, so I can always add more binary packages whenever. And on this, my main concern is covering the most popular Hadoop versions to lower the bar for users to build and test Spark. - Patrick On Thu, Aug 28, 2014 at 11:04 PM, Sean Owen so...@cloudera.com wrote: +1 I tested the source and Hadoop 2.4 release. Checksums and signatures are OK. Compiles fine with Java 8 on OS X. Tests... don't fail any more than usual. FWIW I've also been using the 1.1.0-SNAPSHOT for some time in another project and have encountered no problems. I notice that the 1.1.0 release removes the CDH4-specific build, but adds two MapR-specific builds. Compare with https://dist.apache.org/repos/dist/release/spark/spark-1.0.2/ I commented on the commit: https://github.com/apache/spark/commit/ceb19830b88486faa87ff41e18d03ede713a73cc I'm in favor of removing all vendor-specific builds. This change *looks* a bit funny as there was no JIRA (?) and appears to swap one vendor for another. Of course there's nothing untoward going on, but what was the reasoning? It's best avoided, and MapR already distributes Spark just fine, no? This is a gray area with ASF projects. I mention it as well because it came up with Apache Flink recently (http://mail-archives.eu.apache.org/mod_mbox/incubator-flink-dev/201408.mbox/%3CCANC1h_u%3DN0YKFu3pDaEVYz5ZcQtjQnXEjQA2ReKmoS%2Bye7%3Do%3DA%40mail.gmail.com%3E) Another vendor rightly noted this could look like favoritism. They changed to remove vendor releases. On Fri, Aug 29, 2014 at 3:14 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.1.0! The tag to be voted on is v1.1.0-rc2 (commit 711aebb3): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=711aebb329ca28046396af1e34395a0df92b5327 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1029/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc2-docs/ Please vote on releasing this package as Apache Spark 1.1.0! The vote is open
Re: [VOTE] Release Apache Spark 1.1.0 (RC2)
Personally I'd actually consider putting CDH4 back if there are still users on it. It's always better to be inclusive, and the convenience of a one-click download is high. Do we have a sense on what % of CDH users still use CDH4? Matei On August 28, 2014 at 11:31:13 PM, Sean Owen (so...@cloudera.com) wrote: (Copying my reply since I don't know if it goes to the mailing list) Great, thanks for explaining the reasoning. You're saying these aren't going into the final release? I think that moots any issue surrounding distributing them then. This is all I know of from the ASF: https://community.apache.org/projectIndependence.html I don't read it as expressly forbidding this kind of thing although you can see how it bumps up against the spirit. There's not a bright line -- what about Tomcat providing binaries compiled for Windows for example? does that favor an OS vendor? From this technical ASF perspective only the releases matter -- do what you want with snapshots and RCs. The only issue there is maybe releasing something different than was in the RC; is that at all confusing? Just needs a note. I think this theoretical issue doesn't exist if these binaries aren't released, so I see no reason to not proceed. The rest is a different question about whether you want to spend time maintaining this profile and candidate. The vendor already manages their build I think and -- and I don't know -- may even prefer not to have a different special build floating around. There's also the theoretical argument that this turns off other vendors from adopting Spark if it's perceived to be too connected to other vendors. I'd like to maximize Spark's distribution and there's some argument you do this by not making vendor profiles. But as I say a different question to just think about over time... (oh and PS for my part I think it's a good thing that CDH4 binaries were removed. I wasn't arguing for resurrecting them) On Fri, Aug 29, 2014 at 7:26 AM, Patrick Wendell pwend...@gmail.com wrote: Hey Sean, The reason there are no longer CDH-specific builds is that all newer versions of CDH and HDP work with builds for the upstream Hadoop projects. I dropped CDH4 in favor of a newer Hadoop version (2.4) and the Hadoop-without-Hive (also 2.4) build. For MapR - we can't officially post those artifacts on ASF web space when we make the final release, we can only link to them as being hosted by MapR specifically since they use non-compatible licenses. However, I felt that providing these during a testing period was alright, with the goal of increasing test coverage. I couldn't find any policy against posting these on personal web space during RC voting. However, we can remove them if there is one. Dropping CDH4 was more because it is now pretty old, but we can add it back if people want. The binary packaging is a slightly separate question from release votes, so I can always add more binary packages whenever. And on this, my main concern is covering the most popular Hadoop versions to lower the bar for users to build and test Spark. - Patrick On Thu, Aug 28, 2014 at 11:04 PM, Sean Owen so...@cloudera.com wrote: +1 I tested the source and Hadoop 2.4 release. Checksums and signatures are OK. Compiles fine with Java 8 on OS X. Tests... don't fail any more than usual. FWIW I've also been using the 1.1.0-SNAPSHOT for some time in another project and have encountered no problems. I notice that the 1.1.0 release removes the CDH4-specific build, but adds two MapR-specific builds. Compare with https://dist.apache.org/repos/dist/release/spark/spark-1.0.2/ I commented on the commit: https://github.com/apache/spark/commit/ceb19830b88486faa87ff41e18d03ede713a73cc I'm in favor of removing all vendor-specific builds. This change *looks* a bit funny as there was no JIRA (?) and appears to swap one vendor for another. Of course there's nothing untoward going on, but what was the reasoning? It's best avoided, and MapR already distributes Spark just fine, no? This is a gray area with ASF projects. I mention it as well because it came up with Apache Flink recently (http://mail-archives.eu.apache.org/mod_mbox/incubator-flink-dev/201408.mbox/%3CCANC1h_u%3DN0YKFu3pDaEVYz5ZcQtjQnXEjQA2ReKmoS%2Bye7%3Do%3DA%40mail.gmail.com%3E) Another vendor rightly noted this could look like favoritism. They changed to remove vendor releases. On Fri, Aug 29, 2014 at 3:14 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.1.0! The tag to be voted on is v1.1.0-rc2 (commit 711aebb3): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=711aebb329ca28046396af1e34395a0df92b5327 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc2/ Release artifacts are signed with the
Re: [VOTE] Release Apache Spark 1.1.0 (RC2)
Yeah, we can't/won't post MapR binaries on the ASF web space for the release. However, I have been linking to them (at their request) with a clear identifier that it is an incompatible license and a 3rd party build. The only vendor specific build property we provide is compatibility with different Hadoop FileSystem clients, since unfortunately there is not a universally adopted client/server protocol. I think our goal has always been to provide a path for using ASF Spark with vendor-specific filesystems. Some vendors perform backports or enhancements... and this of course we would never want to manage in the upstream project. In terms of vendor support for this approach - In the early days Cloudera asked us to add CDH4 repository and more recently Pivotal and MapR also asked us to allow linking against their hadoop-client libraries. So we've added these based on direct requests from vendors. Given the ubiquity of the Hadoop FileSystem API, it's hard for me to imagine ruffling feathers by supporting this. But if we get feedback in that direction over time we can of course consider a different approach. - Patrick On Thu, Aug 28, 2014 at 11:30 PM, Sean Owen so...@cloudera.com wrote: (Copying my reply since I don't know if it goes to the mailing list) Great, thanks for explaining the reasoning. You're saying these aren't going into the final release? I think that moots any issue surrounding distributing them then. This is all I know of from the ASF: https://community.apache.org/projectIndependence.html I don't read it as expressly forbidding this kind of thing although you can see how it bumps up against the spirit. There's not a bright line -- what about Tomcat providing binaries compiled for Windows for example? does that favor an OS vendor? From this technical ASF perspective only the releases matter -- do what you want with snapshots and RCs. The only issue there is maybe releasing something different than was in the RC; is that at all confusing? Just needs a note. I think this theoretical issue doesn't exist if these binaries aren't released, so I see no reason to not proceed. The rest is a different question about whether you want to spend time maintaining this profile and candidate. The vendor already manages their build I think and -- and I don't know -- may even prefer not to have a different special build floating around. There's also the theoretical argument that this turns off other vendors from adopting Spark if it's perceived to be too connected to other vendors. I'd like to maximize Spark's distribution and there's some argument you do this by not making vendor profiles. But as I say a different question to just think about over time... (oh and PS for my part I think it's a good thing that CDH4 binaries were removed. I wasn't arguing for resurrecting them) On Fri, Aug 29, 2014 at 7:26 AM, Patrick Wendell pwend...@gmail.com wrote: Hey Sean, The reason there are no longer CDH-specific builds is that all newer versions of CDH and HDP work with builds for the upstream Hadoop projects. I dropped CDH4 in favor of a newer Hadoop version (2.4) and the Hadoop-without-Hive (also 2.4) build. For MapR - we can't officially post those artifacts on ASF web space when we make the final release, we can only link to them as being hosted by MapR specifically since they use non-compatible licenses. However, I felt that providing these during a testing period was alright, with the goal of increasing test coverage. I couldn't find any policy against posting these on personal web space during RC voting. However, we can remove them if there is one. Dropping CDH4 was more because it is now pretty old, but we can add it back if people want. The binary packaging is a slightly separate question from release votes, so I can always add more binary packages whenever. And on this, my main concern is covering the most popular Hadoop versions to lower the bar for users to build and test Spark. - Patrick On Thu, Aug 28, 2014 at 11:04 PM, Sean Owen so...@cloudera.com wrote: +1 I tested the source and Hadoop 2.4 release. Checksums and signatures are OK. Compiles fine with Java 8 on OS X. Tests... don't fail any more than usual. FWIW I've also been using the 1.1.0-SNAPSHOT for some time in another project and have encountered no problems. I notice that the 1.1.0 release removes the CDH4-specific build, but adds two MapR-specific builds. Compare with https://dist.apache.org/repos/dist/release/spark/spark-1.0.2/ I commented on the commit: https://github.com/apache/spark/commit/ceb19830b88486faa87ff41e18d03ede713a73cc I'm in favor of removing all vendor-specific builds. This change *looks* a bit funny as there was no JIRA (?) and appears to swap one vendor for another. Of course there's nothing untoward going on, but what was the reasoning? It's best avoided, and MapR already distributes Spark just fine, no? This is a gray area with ASF projects. I
Re: [VOTE] Release Apache Spark 1.1.0 (RC2)
On Fri, Aug 29, 2014 at 7:42 AM, Patrick Wendell pwend...@gmail.com wrote: In terms of vendor support for this approach - In the early days Cloudera asked us to add CDH4 repository and more recently Pivotal and MapR also asked us to allow linking against their hadoop-client libraries. So we've added these based on direct requests from vendors. Given the ubiquity of the Hadoop FileSystem API, it's hard for me to imagine ruffling feathers by supporting this. But if we get feedback in that direction over time we can of course consider a different approach. By this, you mean that it's easy to control the Hadoop version in the build and set it to some other vendor-specific release? Yes that seems ideal. Making the build flexible, and adding the repository references to pom.xml is part of enabling that -- to me, no question that's good. So you can always roll your own build for your cluster, if you need to. I understand the role of the cdh4 / mapr3 / mapr4 binaries as just a convenience. But it's a convenience for people who... - are installing Spark on a cluster (i.e. not an end user) - that doesn't have it in their distro already - whose distro isn't compatible with a plain vanilla Hadoop distro That can't be many. CDH4.6+ is most of the installed CDH base and it already has Spark. I thought MapR already had Spark built in. The audience seems small enough, and the convenience relatively small enough (is it hard to run the distribution script?) that it caused me to ask whether it was worth bothering providing these, especially give the possible ASF sensitivity. I say crack on; you get my point. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.1.0 (RC2)
FWIW we use CDH4 extensively and would very much appreciate having a prebuilt version of Spark for it. We're doing a CDH 4.4 to 4.7 upgrade across all the clusters now and have plans for a 5.x transition after that. On Aug 28, 2014 11:57 PM, Sean Owen so...@cloudera.com wrote: On Fri, Aug 29, 2014 at 7:42 AM, Patrick Wendell pwend...@gmail.com wrote: In terms of vendor support for this approach - In the early days Cloudera asked us to add CDH4 repository and more recently Pivotal and MapR also asked us to allow linking against their hadoop-client libraries. So we've added these based on direct requests from vendors. Given the ubiquity of the Hadoop FileSystem API, it's hard for me to imagine ruffling feathers by supporting this. But if we get feedback in that direction over time we can of course consider a different approach. By this, you mean that it's easy to control the Hadoop version in the build and set it to some other vendor-specific release? Yes that seems ideal. Making the build flexible, and adding the repository references to pom.xml is part of enabling that -- to me, no question that's good. So you can always roll your own build for your cluster, if you need to. I understand the role of the cdh4 / mapr3 / mapr4 binaries as just a convenience. But it's a convenience for people who... - are installing Spark on a cluster (i.e. not an end user) - that doesn't have it in their distro already - whose distro isn't compatible with a plain vanilla Hadoop distro That can't be many. CDH4.6+ is most of the installed CDH base and it already has Spark. I thought MapR already had Spark built in. The audience seems small enough, and the convenience relatively small enough (is it hard to run the distribution script?) that it caused me to ask whether it was worth bothering providing these, especially give the possible ASF sensitivity. I say crack on; you get my point. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.1.0 (RC2)
i suspect there are more cdh4 than cdh5 clusters. most people plan to move to cdh5 within say 6 months. On Fri, Aug 29, 2014 at 3:57 AM, Andrew Ash and...@andrewash.com wrote: FWIW we use CDH4 extensively and would very much appreciate having a prebuilt version of Spark for it. We're doing a CDH 4.4 to 4.7 upgrade across all the clusters now and have plans for a 5.x transition after that. On Aug 28, 2014 11:57 PM, Sean Owen so...@cloudera.com wrote: On Fri, Aug 29, 2014 at 7:42 AM, Patrick Wendell pwend...@gmail.com wrote: In terms of vendor support for this approach - In the early days Cloudera asked us to add CDH4 repository and more recently Pivotal and MapR also asked us to allow linking against their hadoop-client libraries. So we've added these based on direct requests from vendors. Given the ubiquity of the Hadoop FileSystem API, it's hard for me to imagine ruffling feathers by supporting this. But if we get feedback in that direction over time we can of course consider a different approach. By this, you mean that it's easy to control the Hadoop version in the build and set it to some other vendor-specific release? Yes that seems ideal. Making the build flexible, and adding the repository references to pom.xml is part of enabling that -- to me, no question that's good. So you can always roll your own build for your cluster, if you need to. I understand the role of the cdh4 / mapr3 / mapr4 binaries as just a convenience. But it's a convenience for people who... - are installing Spark on a cluster (i.e. not an end user) - that doesn't have it in their distro already - whose distro isn't compatible with a plain vanilla Hadoop distro That can't be many. CDH4.6+ is most of the installed CDH base and it already has Spark. I thought MapR already had Spark built in. The audience seems small enough, and the convenience relatively small enough (is it hard to run the distribution script?) that it caused me to ask whether it was worth bothering providing these, especially give the possible ASF sensitivity. I say crack on; you get my point. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.1.0 (RC2)
We just used CDH 4.7 for our production cluster. And I believe we won't use CDH 5 in the next year. Sent from my iPhone On 2014年8月29日, at 14:39, Matei Zaharia matei.zaha...@gmail.com wrote: Personally I'd actually consider putting CDH4 back if there are still users on it. It's always better to be inclusive, and the convenience of a one-click download is high. Do we have a sense on what % of CDH users still use CDH4? Matei On August 28, 2014 at 11:31:13 PM, Sean Owen (so...@cloudera.com) wrote: (Copying my reply since I don't know if it goes to the mailing list) Great, thanks for explaining the reasoning. You're saying these aren't going into the final release? I think that moots any issue surrounding distributing them then. This is all I know of from the ASF: https://community.apache.org/projectIndependence.html I don't read it as expressly forbidding this kind of thing although you can see how it bumps up against the spirit. There's not a bright line -- what about Tomcat providing binaries compiled for Windows for example? does that favor an OS vendor? From this technical ASF perspective only the releases matter -- do what you want with snapshots and RCs. The only issue there is maybe releasing something different than was in the RC; is that at all confusing? Just needs a note. I think this theoretical issue doesn't exist if these binaries aren't released, so I see no reason to not proceed. The rest is a different question about whether you want to spend time maintaining this profile and candidate. The vendor already manages their build I think and -- and I don't know -- may even prefer not to have a different special build floating around. There's also the theoretical argument that this turns off other vendors from adopting Spark if it's perceived to be too connected to other vendors. I'd like to maximize Spark's distribution and there's some argument you do this by not making vendor profiles. But as I say a different question to just think about over time... (oh and PS for my part I think it's a good thing that CDH4 binaries were removed. I wasn't arguing for resurrecting them) On Fri, Aug 29, 2014 at 7:26 AM, Patrick Wendell pwend...@gmail.com wrote: Hey Sean, The reason there are no longer CDH-specific builds is that all newer versions of CDH and HDP work with builds for the upstream Hadoop projects. I dropped CDH4 in favor of a newer Hadoop version (2.4) and the Hadoop-without-Hive (also 2.4) build. For MapR - we can't officially post those artifacts on ASF web space when we make the final release, we can only link to them as being hosted by MapR specifically since they use non-compatible licenses. However, I felt that providing these during a testing period was alright, with the goal of increasing test coverage. I couldn't find any policy against posting these on personal web space during RC voting. However, we can remove them if there is one. Dropping CDH4 was more because it is now pretty old, but we can add it back if people want. The binary packaging is a slightly separate question from release votes, so I can always add more binary packages whenever. And on this, my main concern is covering the most popular Hadoop versions to lower the bar for users to build and test Spark. - Patrick On Thu, Aug 28, 2014 at 11:04 PM, Sean Owen so...@cloudera.com wrote: +1 I tested the source and Hadoop 2.4 release. Checksums and signatures are OK. Compiles fine with Java 8 on OS X. Tests... don't fail any more than usual. FWIW I've also been using the 1.1.0-SNAPSHOT for some time in another project and have encountered no problems. I notice that the 1.1.0 release removes the CDH4-specific build, but adds two MapR-specific builds. Compare with https://dist.apache.org/repos/dist/release/spark/spark-1.0.2/ I commented on the commit: https://github.com/apache/spark/commit/ceb19830b88486faa87ff41e18d03ede713a73cc I'm in favor of removing all vendor-specific builds. This change *looks* a bit funny as there was no JIRA (?) and appears to swap one vendor for another. Of course there's nothing untoward going on, but what was the reasoning? It's best avoided, and MapR already distributes Spark just fine, no? This is a gray area with ASF projects. I mention it as well because it came up with Apache Flink recently (http://mail-archives.eu.apache.org/mod_mbox/incubator-flink-dev/201408.mbox/%3CCANC1h_u%3DN0YKFu3pDaEVYz5ZcQtjQnXEjQA2ReKmoS%2Bye7%3Do%3DA%40mail.gmail.com%3E) Another vendor rightly noted this could look like favoritism. They changed to remove vendor releases. On Fri, Aug 29, 2014 at 3:14 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.1.0! The tag to be voted on is v1.1.0-rc2 (commit 711aebb3):
Re: [VOTE] Release Apache Spark 1.1.0 (RC2)
Okay I'll plan to add cdh4 binary as well for the final release! --- sent from my phone On Aug 29, 2014 8:26 AM, Ye Xianjin advance...@gmail.com wrote: We just used CDH 4.7 for our production cluster. And I believe we won't use CDH 5 in the next year. Sent from my iPhone On 2014年8月29日, at 14:39, Matei Zaharia matei.zaha...@gmail.com wrote: Personally I'd actually consider putting CDH4 back if there are still users on it. It's always better to be inclusive, and the convenience of a one-click download is high. Do we have a sense on what % of CDH users still use CDH4? Matei On August 28, 2014 at 11:31:13 PM, Sean Owen (so...@cloudera.com) wrote: (Copying my reply since I don't know if it goes to the mailing list) Great, thanks for explaining the reasoning. You're saying these aren't going into the final release? I think that moots any issue surrounding distributing them then. This is all I know of from the ASF: https://community.apache.org/projectIndependence.html I don't read it as expressly forbidding this kind of thing although you can see how it bumps up against the spirit. There's not a bright line -- what about Tomcat providing binaries compiled for Windows for example? does that favor an OS vendor? From this technical ASF perspective only the releases matter -- do what you want with snapshots and RCs. The only issue there is maybe releasing something different than was in the RC; is that at all confusing? Just needs a note. I think this theoretical issue doesn't exist if these binaries aren't released, so I see no reason to not proceed. The rest is a different question about whether you want to spend time maintaining this profile and candidate. The vendor already manages their build I think and -- and I don't know -- may even prefer not to have a different special build floating around. There's also the theoretical argument that this turns off other vendors from adopting Spark if it's perceived to be too connected to other vendors. I'd like to maximize Spark's distribution and there's some argument you do this by not making vendor profiles. But as I say a different question to just think about over time... (oh and PS for my part I think it's a good thing that CDH4 binaries were removed. I wasn't arguing for resurrecting them) On Fri, Aug 29, 2014 at 7:26 AM, Patrick Wendell pwend...@gmail.com wrote: Hey Sean, The reason there are no longer CDH-specific builds is that all newer versions of CDH and HDP work with builds for the upstream Hadoop projects. I dropped CDH4 in favor of a newer Hadoop version (2.4) and the Hadoop-without-Hive (also 2.4) build. For MapR - we can't officially post those artifacts on ASF web space when we make the final release, we can only link to them as being hosted by MapR specifically since they use non-compatible licenses. However, I felt that providing these during a testing period was alright, with the goal of increasing test coverage. I couldn't find any policy against posting these on personal web space during RC voting. However, we can remove them if there is one. Dropping CDH4 was more because it is now pretty old, but we can add it back if people want. The binary packaging is a slightly separate question from release votes, so I can always add more binary packages whenever. And on this, my main concern is covering the most popular Hadoop versions to lower the bar for users to build and test Spark. - Patrick On Thu, Aug 28, 2014 at 11:04 PM, Sean Owen so...@cloudera.com wrote: +1 I tested the source and Hadoop 2.4 release. Checksums and signatures are OK. Compiles fine with Java 8 on OS X. Tests... don't fail any more than usual. FWIW I've also been using the 1.1.0-SNAPSHOT for some time in another project and have encountered no problems. I notice that the 1.1.0 release removes the CDH4-specific build, but adds two MapR-specific builds. Compare with https://dist.apache.org/repos/dist/release/spark/spark-1.0.2/ I commented on the commit: https://github.com/apache/spark/commit/ceb19830b88486faa87ff41e18d03ede713a73cc I'm in favor of removing all vendor-specific builds. This change *looks* a bit funny as there was no JIRA (?) and appears to swap one vendor for another. Of course there's nothing untoward going on, but what was the reasoning? It's best avoided, and MapR already distributes Spark just fine, no? This is a gray area with ASF projects. I mention it as well because it came up with Apache Flink recently ( http://mail-archives.eu.apache.org/mod_mbox/incubator-flink-dev/201408.mbox/%3CCANC1h_u%3DN0YKFu3pDaEVYz5ZcQtjQnXEjQA2ReKmoS%2Bye7%3Do%3DA%40mail.gmail.com%3E ) Another vendor rightly noted this could look like favoritism. They changed to remove vendor releases. On Fri, Aug 29, 2014 at 3:14 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on
Re: [VOTE] Release Apache Spark 1.1.0 (RC2)
There were several formatting and typographical errors in the SQL docs that I've fixed in this PR https://github.com/apache/spark/pull/2201. Dunno if we want to roll that into the release. On Fri, Aug 29, 2014 at 12:17 PM, Patrick Wendell pwend...@gmail.com wrote: Okay I'll plan to add cdh4 binary as well for the final release! --- sent from my phone On Aug 29, 2014 8:26 AM, Ye Xianjin advance...@gmail.com wrote: We just used CDH 4.7 for our production cluster. And I believe we won't use CDH 5 in the next year. Sent from my iPhone On 2014年8月29日, at 14:39, Matei Zaharia matei.zaha...@gmail.com wrote: Personally I'd actually consider putting CDH4 back if there are still users on it. It's always better to be inclusive, and the convenience of a one-click download is high. Do we have a sense on what % of CDH users still use CDH4? Matei On August 28, 2014 at 11:31:13 PM, Sean Owen (so...@cloudera.com) wrote: (Copying my reply since I don't know if it goes to the mailing list) Great, thanks for explaining the reasoning. You're saying these aren't going into the final release? I think that moots any issue surrounding distributing them then. This is all I know of from the ASF: https://community.apache.org/projectIndependence.html I don't read it as expressly forbidding this kind of thing although you can see how it bumps up against the spirit. There's not a bright line -- what about Tomcat providing binaries compiled for Windows for example? does that favor an OS vendor? From this technical ASF perspective only the releases matter -- do what you want with snapshots and RCs. The only issue there is maybe releasing something different than was in the RC; is that at all confusing? Just needs a note. I think this theoretical issue doesn't exist if these binaries aren't released, so I see no reason to not proceed. The rest is a different question about whether you want to spend time maintaining this profile and candidate. The vendor already manages their build I think and -- and I don't know -- may even prefer not to have a different special build floating around. There's also the theoretical argument that this turns off other vendors from adopting Spark if it's perceived to be too connected to other vendors. I'd like to maximize Spark's distribution and there's some argument you do this by not making vendor profiles. But as I say a different question to just think about over time... (oh and PS for my part I think it's a good thing that CDH4 binaries were removed. I wasn't arguing for resurrecting them) On Fri, Aug 29, 2014 at 7:26 AM, Patrick Wendell pwend...@gmail.com wrote: Hey Sean, The reason there are no longer CDH-specific builds is that all newer versions of CDH and HDP work with builds for the upstream Hadoop projects. I dropped CDH4 in favor of a newer Hadoop version (2.4) and the Hadoop-without-Hive (also 2.4) build. For MapR - we can't officially post those artifacts on ASF web space when we make the final release, we can only link to them as being hosted by MapR specifically since they use non-compatible licenses. However, I felt that providing these during a testing period was alright, with the goal of increasing test coverage. I couldn't find any policy against posting these on personal web space during RC voting. However, we can remove them if there is one. Dropping CDH4 was more because it is now pretty old, but we can add it back if people want. The binary packaging is a slightly separate question from release votes, so I can always add more binary packages whenever. And on this, my main concern is covering the most popular Hadoop versions to lower the bar for users to build and test Spark. - Patrick On Thu, Aug 28, 2014 at 11:04 PM, Sean Owen so...@cloudera.com wrote: +1 I tested the source and Hadoop 2.4 release. Checksums and signatures are OK. Compiles fine with Java 8 on OS X. Tests... don't fail any more than usual. FWIW I've also been using the 1.1.0-SNAPSHOT for some time in another project and have encountered no problems. I notice that the 1.1.0 release removes the CDH4-specific build, but adds two MapR-specific builds. Compare with https://dist.apache.org/repos/dist/release/spark/spark-1.0.2/ I commented on the commit: https://github.com/apache/spark/commit/ceb19830b88486faa87ff41e18d03ede713a73cc I'm in favor of removing all vendor-specific builds. This change *looks* a bit funny as there was no JIRA (?) and appears to swap one vendor for another. Of course there's nothing untoward going on, but what was the reasoning? It's best avoided, and MapR already distributes Spark just fine, no? This is a gray area with ASF projects. I mention it as well because it came up with Apache Flink recently
Re: [VOTE] Release Apache Spark 1.1.0 (RC2)
Hey Nicholas, Thanks for this, we can merge in doc changes outside of the actual release timeline, so we'll make sure to loop those changes in before we publish the final 1.1 docs. - Patrick On Fri, Aug 29, 2014 at 9:24 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: There were several formatting and typographical errors in the SQL docs that I've fixed in this PR. Dunno if we want to roll that into the release. On Fri, Aug 29, 2014 at 12:17 PM, Patrick Wendell pwend...@gmail.com wrote: Okay I'll plan to add cdh4 binary as well for the final release! --- sent from my phone On Aug 29, 2014 8:26 AM, Ye Xianjin advance...@gmail.com wrote: We just used CDH 4.7 for our production cluster. And I believe we won't use CDH 5 in the next year. Sent from my iPhone On 2014年8月29日, at 14:39, Matei Zaharia matei.zaha...@gmail.com wrote: Personally I'd actually consider putting CDH4 back if there are still users on it. It's always better to be inclusive, and the convenience of a one-click download is high. Do we have a sense on what % of CDH users still use CDH4? Matei On August 28, 2014 at 11:31:13 PM, Sean Owen (so...@cloudera.com) wrote: (Copying my reply since I don't know if it goes to the mailing list) Great, thanks for explaining the reasoning. You're saying these aren't going into the final release? I think that moots any issue surrounding distributing them then. This is all I know of from the ASF: https://community.apache.org/projectIndependence.html I don't read it as expressly forbidding this kind of thing although you can see how it bumps up against the spirit. There's not a bright line -- what about Tomcat providing binaries compiled for Windows for example? does that favor an OS vendor? From this technical ASF perspective only the releases matter -- do what you want with snapshots and RCs. The only issue there is maybe releasing something different than was in the RC; is that at all confusing? Just needs a note. I think this theoretical issue doesn't exist if these binaries aren't released, so I see no reason to not proceed. The rest is a different question about whether you want to spend time maintaining this profile and candidate. The vendor already manages their build I think and -- and I don't know -- may even prefer not to have a different special build floating around. There's also the theoretical argument that this turns off other vendors from adopting Spark if it's perceived to be too connected to other vendors. I'd like to maximize Spark's distribution and there's some argument you do this by not making vendor profiles. But as I say a different question to just think about over time... (oh and PS for my part I think it's a good thing that CDH4 binaries were removed. I wasn't arguing for resurrecting them) On Fri, Aug 29, 2014 at 7:26 AM, Patrick Wendell pwend...@gmail.com wrote: Hey Sean, The reason there are no longer CDH-specific builds is that all newer versions of CDH and HDP work with builds for the upstream Hadoop projects. I dropped CDH4 in favor of a newer Hadoop version (2.4) and the Hadoop-without-Hive (also 2.4) build. For MapR - we can't officially post those artifacts on ASF web space when we make the final release, we can only link to them as being hosted by MapR specifically since they use non-compatible licenses. However, I felt that providing these during a testing period was alright, with the goal of increasing test coverage. I couldn't find any policy against posting these on personal web space during RC voting. However, we can remove them if there is one. Dropping CDH4 was more because it is now pretty old, but we can add it back if people want. The binary packaging is a slightly separate question from release votes, so I can always add more binary packages whenever. And on this, my main concern is covering the most popular Hadoop versions to lower the bar for users to build and test Spark. - Patrick On Thu, Aug 28, 2014 at 11:04 PM, Sean Owen so...@cloudera.com wrote: +1 I tested the source and Hadoop 2.4 release. Checksums and signatures are OK. Compiles fine with Java 8 on OS X. Tests... don't fail any more than usual. FWIW I've also been using the 1.1.0-SNAPSHOT for some time in another project and have encountered no problems. I notice that the 1.1.0 release removes the CDH4-specific build, but adds two MapR-specific builds. Compare with https://dist.apache.org/repos/dist/release/spark/spark-1.0.2/ I commented on the commit: https://github.com/apache/spark/commit/ceb19830b88486faa87ff41e18d03ede713a73cc I'm in favor of removing all vendor-specific builds. This change *looks* a bit funny as there was no JIRA (?) and appears to swap one vendor for another. Of course
Re: [VOTE] Release Apache Spark 1.1.0 (RC2)
[Let me know if I should be posting these comments in a different thread.] Should the default Spark version in spark-ec2 https://github.com/apache/spark/blob/e1535ad3c6f7400f2b7915ea91da9c60510557ba/ec2/spark_ec2.py#L86 be updated for this release? Nick On Fri, Aug 29, 2014 at 12:55 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Nicholas, Thanks for this, we can merge in doc changes outside of the actual release timeline, so we'll make sure to loop those changes in before we publish the final 1.1 docs. - Patrick On Fri, Aug 29, 2014 at 9:24 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: There were several formatting and typographical errors in the SQL docs that I've fixed in this PR. Dunno if we want to roll that into the release. On Fri, Aug 29, 2014 at 12:17 PM, Patrick Wendell pwend...@gmail.com wrote: Okay I'll plan to add cdh4 binary as well for the final release! --- sent from my phone On Aug 29, 2014 8:26 AM, Ye Xianjin advance...@gmail.com wrote: We just used CDH 4.7 for our production cluster. And I believe we won't use CDH 5 in the next year. Sent from my iPhone On 2014年8月29日, at 14:39, Matei Zaharia matei.zaha...@gmail.com wrote: Personally I'd actually consider putting CDH4 back if there are still users on it. It's always better to be inclusive, and the convenience of a one-click download is high. Do we have a sense on what % of CDH users still use CDH4? Matei On August 28, 2014 at 11:31:13 PM, Sean Owen (so...@cloudera.com) wrote: (Copying my reply since I don't know if it goes to the mailing list) Great, thanks for explaining the reasoning. You're saying these aren't going into the final release? I think that moots any issue surrounding distributing them then. This is all I know of from the ASF: https://community.apache.org/projectIndependence.html I don't read it as expressly forbidding this kind of thing although you can see how it bumps up against the spirit. There's not a bright line -- what about Tomcat providing binaries compiled for Windows for example? does that favor an OS vendor? From this technical ASF perspective only the releases matter -- do what you want with snapshots and RCs. The only issue there is maybe releasing something different than was in the RC; is that at all confusing? Just needs a note. I think this theoretical issue doesn't exist if these binaries aren't released, so I see no reason to not proceed. The rest is a different question about whether you want to spend time maintaining this profile and candidate. The vendor already manages their build I think and -- and I don't know -- may even prefer not to have a different special build floating around. There's also the theoretical argument that this turns off other vendors from adopting Spark if it's perceived to be too connected to other vendors. I'd like to maximize Spark's distribution and there's some argument you do this by not making vendor profiles. But as I say a different question to just think about over time... (oh and PS for my part I think it's a good thing that CDH4 binaries were removed. I wasn't arguing for resurrecting them) On Fri, Aug 29, 2014 at 7:26 AM, Patrick Wendell pwend...@gmail.com wrote: Hey Sean, The reason there are no longer CDH-specific builds is that all newer versions of CDH and HDP work with builds for the upstream Hadoop projects. I dropped CDH4 in favor of a newer Hadoop version (2.4) and the Hadoop-without-Hive (also 2.4) build. For MapR - we can't officially post those artifacts on ASF web space when we make the final release, we can only link to them as being hosted by MapR specifically since they use non-compatible licenses. However, I felt that providing these during a testing period was alright, with the goal of increasing test coverage. I couldn't find any policy against posting these on personal web space during RC voting. However, we can remove them if there is one. Dropping CDH4 was more because it is now pretty old, but we can add it back if people want. The binary packaging is a slightly separate question from release votes, so I can always add more binary packages whenever. And on this, my main concern is covering the most popular Hadoop versions to lower the bar for users to build and test Spark. - Patrick On Thu, Aug 28, 2014 at 11:04 PM, Sean Owen so...@cloudera.com wrote: +1 I tested the source and Hadoop 2.4 release. Checksums and signatures are OK. Compiles fine with Java 8 on OS X. Tests... don't fail any more than usual. FWIW I've also been using the 1.1.0-SNAPSHOT for some time in another project and have encountered no problems. I
Re: [VOTE] Release Apache Spark 1.1.0 (RC2)
Oh darn - I missed this update. GRR, unfortunately I think this means I'll need to cut a new RC. Thanks for catching this Nick. On Fri, Aug 29, 2014 at 10:18 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: [Let me know if I should be posting these comments in a different thread.] Should the default Spark version in spark-ec2 be updated for this release? Nick On Fri, Aug 29, 2014 at 12:55 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Nicholas, Thanks for this, we can merge in doc changes outside of the actual release timeline, so we'll make sure to loop those changes in before we publish the final 1.1 docs. - Patrick On Fri, Aug 29, 2014 at 9:24 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: There were several formatting and typographical errors in the SQL docs that I've fixed in this PR. Dunno if we want to roll that into the release. On Fri, Aug 29, 2014 at 12:17 PM, Patrick Wendell pwend...@gmail.com wrote: Okay I'll plan to add cdh4 binary as well for the final release! --- sent from my phone On Aug 29, 2014 8:26 AM, Ye Xianjin advance...@gmail.com wrote: We just used CDH 4.7 for our production cluster. And I believe we won't use CDH 5 in the next year. Sent from my iPhone On 2014年8月29日, at 14:39, Matei Zaharia matei.zaha...@gmail.com wrote: Personally I'd actually consider putting CDH4 back if there are still users on it. It's always better to be inclusive, and the convenience of a one-click download is high. Do we have a sense on what % of CDH users still use CDH4? Matei On August 28, 2014 at 11:31:13 PM, Sean Owen (so...@cloudera.com) wrote: (Copying my reply since I don't know if it goes to the mailing list) Great, thanks for explaining the reasoning. You're saying these aren't going into the final release? I think that moots any issue surrounding distributing them then. This is all I know of from the ASF: https://community.apache.org/projectIndependence.html I don't read it as expressly forbidding this kind of thing although you can see how it bumps up against the spirit. There's not a bright line -- what about Tomcat providing binaries compiled for Windows for example? does that favor an OS vendor? From this technical ASF perspective only the releases matter -- do what you want with snapshots and RCs. The only issue there is maybe releasing something different than was in the RC; is that at all confusing? Just needs a note. I think this theoretical issue doesn't exist if these binaries aren't released, so I see no reason to not proceed. The rest is a different question about whether you want to spend time maintaining this profile and candidate. The vendor already manages their build I think and -- and I don't know -- may even prefer not to have a different special build floating around. There's also the theoretical argument that this turns off other vendors from adopting Spark if it's perceived to be too connected to other vendors. I'd like to maximize Spark's distribution and there's some argument you do this by not making vendor profiles. But as I say a different question to just think about over time... (oh and PS for my part I think it's a good thing that CDH4 binaries were removed. I wasn't arguing for resurrecting them) On Fri, Aug 29, 2014 at 7:26 AM, Patrick Wendell pwend...@gmail.com wrote: Hey Sean, The reason there are no longer CDH-specific builds is that all newer versions of CDH and HDP work with builds for the upstream Hadoop projects. I dropped CDH4 in favor of a newer Hadoop version (2.4) and the Hadoop-without-Hive (also 2.4) build. For MapR - we can't officially post those artifacts on ASF web space when we make the final release, we can only link to them as being hosted by MapR specifically since they use non-compatible licenses. However, I felt that providing these during a testing period was alright, with the goal of increasing test coverage. I couldn't find any policy against posting these on personal web space during RC voting. However, we can remove them if there is one. Dropping CDH4 was more because it is now pretty old, but we can add it back if people want. The binary packaging is a slightly separate question from release votes, so I can always add more binary packages whenever. And on this, my main concern is covering the most popular Hadoop versions to lower the bar for users to build and test Spark. - Patrick On Thu, Aug 28, 2014 at 11:04 PM, Sean Owen so...@cloudera.com wrote: +1 I tested the source and Hadoop 2.4 release. Checksums and signatures are OK. Compiles fine with Java
Re: [VOTE] Release Apache Spark 1.1.0 (RC2)
In our internal projects we use this bit of code in the maven pom to create a properties file with build information (sorry for the messy indentation). Then we have code that reads this property file somewhere and provides that info. This should make it easier to not have to change version numbers in Scala/Java/Python code ever again. :-) Shouldn't be hard to do something like that in sbt (actually should be much easier). plugin groupIdorg.apache.maven.plugins/groupId artifactIdmaven-antrun-plugin/artifactId version1.6/version executions execution idbuild-info/id phasecompile/phase goals goalrun/goal /goals configuration target taskdef resource=net/sf/antcontrib/antcontrib.properties classpathref=maven.plugin.classpath/ if not isset property=build.hash/ /not then exec executable=git outputproperty=build.hash arg line=rev-parse HEAD/ /exec /then /if echobuildRevision: ${build.hash}/echo echo file=${build.info} message=version=${project.version}${line.separator} / echo file=${build.info} append=true message=hash=${build.hash}${line.separator} / echo file=${build.info} append=true / /target /configuration /execution /executions dependencies dependency groupIdant-contrib/groupId artifactIdant-contrib/artifactId version1.0b3/version exclusions exclusion groupIdant/groupId artifactIdant/artifactId /exclusion /exclusions /dependency /dependencies /plugin /plugins On Fri, Aug 29, 2014 at 11:43 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Sounds good. As an FYI, we had this problem with the 1.0.2 release https://issues.apache.org/jira/browse/SPARK-3242. Is there perhaps some kind of automated check we can make to catch this for us in the future? Where would it go? On Fri, Aug 29, 2014 at 2:18 PM, Patrick Wendell pwend...@gmail.com wrote: Oh darn - I missed this update. GRR, unfortunately I think this means I'll need to cut a new RC. Thanks for catching this Nick. On Fri, Aug 29, 2014 at 10:18 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: [Let me know if I should be posting these comments in a different thread.] Should the default Spark version in spark-ec2 be updated for this release? Nick On Fri, Aug 29, 2014 at 12:55 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Nicholas, Thanks for this, we can merge in doc changes outside of the actual release timeline, so we'll make sure to loop those changes in before we publish the final 1.1 docs. - Patrick On Fri, Aug 29, 2014 at 9:24 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: There were several formatting and typographical errors in the SQL docs that I've fixed in this PR. Dunno if we want to roll that into the release. On Fri, Aug 29, 2014 at 12:17 PM, Patrick Wendell pwend...@gmail.com wrote: Okay I'll plan to add cdh4 binary as well for the final release! --- sent from my phone On Aug 29, 2014 8:26 AM, Ye Xianjin advance...@gmail.com wrote: We just used CDH 4.7 for our production cluster. And I believe we won't use CDH 5 in the next year. Sent from my iPhone On 2014年8月29日, at 14:39, Matei Zaharia matei.zaha...@gmail.com wrote: Personally I'd actually consider putting CDH4 back if there are still users on it. It's always better to be inclusive, and the convenience of a one-click download is high. Do we have a sense on what % of CDH users still use CDH4? Matei On August 28, 2014 at 11:31:13 PM, Sean Owen (so...@cloudera.com ) wrote: (Copying my reply since I don't know if it goes to the mailing list) Great, thanks for explaining the reasoning. You're saying these aren't going into the final release? I think that moots any issue surrounding distributing them then. This is all I know of from the ASF: https://community.apache.org/projectIndependence.html I don't read it as expressly forbidding this kind of thing although you can see how it bumps up against the spirit. There's not a bright line -- what about Tomcat providing binaries compiled for Windows for example?
Re: [VOTE] Release Apache Spark 1.1.0 (RC2)
Just noticed one thing: although --with-hive is deprecated by -Phive, make-distribution.sh still relies on $SPARK_HIVE (which was controlled by --with-hive) to determine whether to include datanucleus jar files. This means we have to do something like SPARK_HIVE=true ./make-distribution.sh ... to enable Hive support. Otherwise datanucleus jars are not included in lib/. This issue is similar to SPARK-3234 https://issues.apache.org/jira/browse/SPARK-3234, both SPARK_HADOOP_VERSION and SPARK_HIVE are controlled by some deprecated command line options. On Fri, Aug 29, 2014 at 11:18 AM, Patrick Wendell pwend...@gmail.com wrote: Oh darn - I missed this update. GRR, unfortunately I think this means I'll need to cut a new RC. Thanks for catching this Nick. On Fri, Aug 29, 2014 at 10:18 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: [Let me know if I should be posting these comments in a different thread.] Should the default Spark version in spark-ec2 be updated for this release? Nick On Fri, Aug 29, 2014 at 12:55 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Nicholas, Thanks for this, we can merge in doc changes outside of the actual release timeline, so we'll make sure to loop those changes in before we publish the final 1.1 docs. - Patrick On Fri, Aug 29, 2014 at 9:24 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: There were several formatting and typographical errors in the SQL docs that I've fixed in this PR. Dunno if we want to roll that into the release. On Fri, Aug 29, 2014 at 12:17 PM, Patrick Wendell pwend...@gmail.com wrote: Okay I'll plan to add cdh4 binary as well for the final release! --- sent from my phone On Aug 29, 2014 8:26 AM, Ye Xianjin advance...@gmail.com wrote: We just used CDH 4.7 for our production cluster. And I believe we won't use CDH 5 in the next year. Sent from my iPhone On 2014年8月29日, at 14:39, Matei Zaharia matei.zaha...@gmail.com wrote: Personally I'd actually consider putting CDH4 back if there are still users on it. It's always better to be inclusive, and the convenience of a one-click download is high. Do we have a sense on what % of CDH users still use CDH4? Matei On August 28, 2014 at 11:31:13 PM, Sean Owen (so...@cloudera.com ) wrote: (Copying my reply since I don't know if it goes to the mailing list) Great, thanks for explaining the reasoning. You're saying these aren't going into the final release? I think that moots any issue surrounding distributing them then. This is all I know of from the ASF: https://community.apache.org/projectIndependence.html I don't read it as expressly forbidding this kind of thing although you can see how it bumps up against the spirit. There's not a bright line -- what about Tomcat providing binaries compiled for Windows for example? does that favor an OS vendor? From this technical ASF perspective only the releases matter -- do what you want with snapshots and RCs. The only issue there is maybe releasing something different than was in the RC; is that at all confusing? Just needs a note. I think this theoretical issue doesn't exist if these binaries aren't released, so I see no reason to not proceed. The rest is a different question about whether you want to spend time maintaining this profile and candidate. The vendor already manages their build I think and -- and I don't know -- may even prefer not to have a different special build floating around. There's also the theoretical argument that this turns off other vendors from adopting Spark if it's perceived to be too connected to other vendors. I'd like to maximize Spark's distribution and there's some argument you do this by not making vendor profiles. But as I say a different question to just think about over time... (oh and PS for my part I think it's a good thing that CDH4 binaries were removed. I wasn't arguing for resurrecting them) On Fri, Aug 29, 2014 at 7:26 AM, Patrick Wendell pwend...@gmail.com wrote: Hey Sean, The reason there are no longer CDH-specific builds is that all newer versions of CDH and HDP work with builds for the upstream Hadoop projects. I dropped CDH4 in favor of a newer Hadoop version (2.4) and the Hadoop-without-Hive (also 2.4) build. For MapR - we can't officially post those artifacts on ASF web space when we make the final release, we can only link to them as being hosted by MapR specifically since they use non-compatible licenses. However, I felt that providing these during a testing period was alright,
Re: [VOTE] Release Apache Spark 1.1.0 (RC2)
+1. Validated several custom analysis pipelines on a private cluster in standalone mode. Tested new PySpark support for arbitrary Hadoop input formats, works great! -- Jeremy -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-0-RC2-tp8107p8143.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[VOTE] Release Apache Spark 1.1.0 (RC2)
Please vote on releasing the following candidate as Apache Spark version 1.1.0! The tag to be voted on is v1.1.0-rc2 (commit 711aebb3): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=711aebb329ca28046396af1e34395a0df92b5327 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1029/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc2-docs/ Please vote on releasing this package as Apache Spark 1.1.0! The vote is open until Monday, September 01, at 03:11 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.1.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == Regressions fixed since RC1 == LZ4 compression issue: https://issues.apache.org/jira/browse/SPARK-3277 == What justifies a -1 vote for this release? == This vote is happening very late into the QA period compared with previous votes, so -1 votes should only occur for significant regressions from 1.0.2. Bugs already present in 1.0.X will not block this release. == What default changes should I be aware of? == 1. The default value of spark.io.compression.codec is now snappy -- Old behavior can be restored by switching to lzf 2. PySpark now performs external spilling during aggregations. -- Old behavior can be restored by setting spark.shuffle.spill to false. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.1.0 (RC2)
I'll kick off the vote with a +1. On Thu, Aug 28, 2014 at 7:14 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.1.0! The tag to be voted on is v1.1.0-rc2 (commit 711aebb3): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=711aebb329ca28046396af1e34395a0df92b5327 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1029/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc2-docs/ Please vote on releasing this package as Apache Spark 1.1.0! The vote is open until Monday, September 01, at 03:11 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.1.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == Regressions fixed since RC1 == LZ4 compression issue: https://issues.apache.org/jira/browse/SPARK-3277 == What justifies a -1 vote for this release? == This vote is happening very late into the QA period compared with previous votes, so -1 votes should only occur for significant regressions from 1.0.2. Bugs already present in 1.0.X will not block this release. == What default changes should I be aware of? == 1. The default value of spark.io.compression.codec is now snappy -- Old behavior can be restored by switching to lzf 2. PySpark now performs external spilling during aggregations. -- Old behavior can be restored by setting spark.shuffle.spill to false. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.1.0 (RC2)
+1. Tested MLlib algorithms on Amazon EC2, algorithms show speed-ups between 1.5-5x compared to the 1.0.2 release. - Original Message - From: Patrick Wendell pwend...@gmail.com To: dev@spark.apache.org Sent: Thursday, August 28, 2014 8:32:11 PM Subject: Re: [VOTE] Release Apache Spark 1.1.0 (RC2) I'll kick off the vote with a +1. On Thu, Aug 28, 2014 at 7:14 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.1.0! The tag to be voted on is v1.1.0-rc2 (commit 711aebb3): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=711aebb329ca28046396af1e34395a0df92b5327 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1029/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc2-docs/ Please vote on releasing this package as Apache Spark 1.1.0! The vote is open until Monday, September 01, at 03:11 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.1.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == Regressions fixed since RC1 == LZ4 compression issue: https://issues.apache.org/jira/browse/SPARK-3277 == What justifies a -1 vote for this release? == This vote is happening very late into the QA period compared with previous votes, so -1 votes should only occur for significant regressions from 1.0.2. Bugs already present in 1.0.X will not block this release. == What default changes should I be aware of? == 1. The default value of spark.io.compression.codec is now snappy -- Old behavior can be restored by switching to lzf 2. PySpark now performs external spilling during aggregations. -- Old behavior can be restored by setting spark.shuffle.spill to false. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.1.0 (RC2)
+1 Make-distrubtion works, and also tested simple spark jobs on Spark on Mesos on 8 node Mesos cluster. Tim On Thu, Aug 28, 2014 at 8:53 PM, Burak Yavuz bya...@stanford.edu wrote: +1. Tested MLlib algorithms on Amazon EC2, algorithms show speed-ups between 1.5-5x compared to the 1.0.2 release. - Original Message - From: Patrick Wendell pwend...@gmail.com To: dev@spark.apache.org Sent: Thursday, August 28, 2014 8:32:11 PM Subject: Re: [VOTE] Release Apache Spark 1.1.0 (RC2) I'll kick off the vote with a +1. On Thu, Aug 28, 2014 at 7:14 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.1.0! The tag to be voted on is v1.1.0-rc2 (commit 711aebb3): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=711aebb329ca28046396af1e34395a0df92b5327 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1029/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc2-docs/ Please vote on releasing this package as Apache Spark 1.1.0! The vote is open until Monday, September 01, at 03:11 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.1.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == Regressions fixed since RC1 == LZ4 compression issue: https://issues.apache.org/jira/browse/SPARK-3277 == What justifies a -1 vote for this release? == This vote is happening very late into the QA period compared with previous votes, so -1 votes should only occur for significant regressions from 1.0.2. Bugs already present in 1.0.X will not block this release. == What default changes should I be aware of? == 1. The default value of spark.io.compression.codec is now snappy -- Old behavior can be restored by switching to lzf 2. PySpark now performs external spilling during aggregations. -- Old behavior can be restored by setting spark.shuffle.spill to false. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.1.0 (RC2)
+1. Tested Spark SQL Thrift server and CLI against a single node standalone cluster. On Thu, Aug 28, 2014 at 9:27 PM, Timothy Chen tnac...@gmail.com wrote: +1 Make-distrubtion works, and also tested simple spark jobs on Spark on Mesos on 8 node Mesos cluster. Tim On Thu, Aug 28, 2014 at 8:53 PM, Burak Yavuz bya...@stanford.edu wrote: +1. Tested MLlib algorithms on Amazon EC2, algorithms show speed-ups between 1.5-5x compared to the 1.0.2 release. - Original Message - From: Patrick Wendell pwend...@gmail.com To: dev@spark.apache.org Sent: Thursday, August 28, 2014 8:32:11 PM Subject: Re: [VOTE] Release Apache Spark 1.1.0 (RC2) I'll kick off the vote with a +1. On Thu, Aug 28, 2014 at 7:14 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.1.0! The tag to be voted on is v1.1.0-rc2 (commit 711aebb3): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=711aebb329ca28046396af1e34395a0df92b5327 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1029/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc2-docs/ Please vote on releasing this package as Apache Spark 1.1.0! The vote is open until Monday, September 01, at 03:11 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.1.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == Regressions fixed since RC1 == LZ4 compression issue: https://issues.apache.org/jira/browse/SPARK-3277 == What justifies a -1 vote for this release? == This vote is happening very late into the QA period compared with previous votes, so -1 votes should only occur for significant regressions from 1.0.2. Bugs already present in 1.0.X will not block this release. == What default changes should I be aware of? == 1. The default value of spark.io.compression.codec is now snappy -- Old behavior can be restored by switching to lzf 2. PySpark now performs external spilling during aggregations. -- Old behavior can be restored by setting spark.shuffle.spill to false. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org