[jira] [Commented] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds
[ https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14660816#comment-14660816 ] Ryan Williams commented on SPARK-1517: -- That all makes sense, thanks. I guess I was imagining "we" could publish the SHA'd "snapshots" to a Maven repository other than the Apache snapshot repository, especially if the latter has rules that make this inconvenient. Understood that the binaries (and Maven artifacts) would have to be clearly branded as not official Apache releases. If I came up with a URL that binaries could be uploaded to, what would have to change to make it happen? Likewise if I found a Maven repository that could host these artifacts? > Publish nightly snapshots of documentation, maven artifacts, and binary builds > -- > > Key: SPARK-1517 > URL: https://issues.apache.org/jira/browse/SPARK-1517 > Project: Spark > Issue Type: Improvement > Components: Build, Project Infra >Reporter: Patrick Wendell >Assignee: Patrick Wendell >Priority: Critical > > Should be pretty easy to do with Jenkins. The only thing I can think of that > would be tricky is to set up credentials so that jenkins can publish this > stuff somewhere on apache infra. > Ideally we don't want to have to put a private key on every jenkins box > (since they are otherwise pretty stateless). One idea is to encrypt these > credentials with a passphrase and post them somewhere publicly visible. Then > the jenkins build can download the credentials provided we set a passphrase > in an environment variable in jenkins. There may be simpler solutions as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds
[ https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14660796#comment-14660796 ] Patrick Wendell commented on SPARK-1517: Hey Ryan, IIRC - the Apache snapshot repository won't let us publish binaries that do not have SNAPSHOT in the version number. The reason is it expects to see timestamped snapshots so its garbage collection mechanism can work. We could look at adding sha1 hashes, before SNAPSHOT, but I think there is some chance this would break their cleanup. In terms of posting more binaries - I can look at whether Databricks or Berkeley might be able to donate S3 resources for this, but it would have to be clearly maintained by those organizations and not branded as official Apache releases or anything like that. > Publish nightly snapshots of documentation, maven artifacts, and binary builds > -- > > Key: SPARK-1517 > URL: https://issues.apache.org/jira/browse/SPARK-1517 > Project: Spark > Issue Type: Improvement > Components: Build, Project Infra >Reporter: Patrick Wendell >Assignee: Patrick Wendell >Priority: Critical > > Should be pretty easy to do with Jenkins. The only thing I can think of that > would be tricky is to set up credentials so that jenkins can publish this > stuff somewhere on apache infra. > Ideally we don't want to have to put a private key on every jenkins box > (since they are otherwise pretty stateless). One idea is to encrypt these > credentials with a passphrase and post them somewhere publicly visible. Then > the jenkins build can download the credentials provided we set a passphrase > in an environment variable in jenkins. There may be simpler solutions as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds
[ https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14660553#comment-14660553 ] Ryan Williams commented on SPARK-1517: -- h3. Maven snapshots I hear your point that idiomatic Maven-snapshot workflows are not well suited to this task. Something I've been doing instead is running commands like this from within a Spark repo: {code} $ sha=$(git --no-pager log --no-walk --format="%h" HEAD) $ mvn versions:set -DgenerateBackupPoms=false -DnewVersion=$sha $ mvn install -DskipTests {code} This renames the version in all POMs to the abbreviated SHA of {{HEAD}}, builds Spark, and installs the SHA-namespaced artifacts in my local Maven cache, at e.g. {{~/.m2/repository/org/apache/spark/spark-core_2.10/901dbd0}}. Then I just put {{901dbd0}} as the version in some other project and, voila, I can link against arbitrary Spark SHAs, have many co-exist in my local Maven cache without them all being named {{1.x.y-SNAPSHOT}}, etc. [Here's an example|https://github.com/hammerlab/pageant/blob/56bff88f426dd69083424a91cc35099a2a157f10/pom.xml#L30] where I needed a patched Spark before {{1.4.1}} was released with the fix I needed. Could any existing continuous build infrastructure be modified to run the {{mvn versions:set}} command above and publish artifacts to some Maven repository, ID'd by SHA? h3. Binaries It also makes sense that your ASF user account will not scale for this purpose :) OTOH, it should be possible to store these cheaply somewhere. {{spark-1.4.1-bin-hadoop2.4.tgz}} is ~234MB and there are ~4000 SHAs from 1.2.0 to 1.5.0, so hosting every single SHA in that range would be a few TB, afaict. Analogous to my previous question: could any existing continuous build infrastructure be modified to run the {{mvn versions:set}} command above and send upload binaries somewhere that could hold more than just the last few? These binaries are apparently already being generated, and mostly deleted in ~24hrs as your ASF userdir runs out of space? > Publish nightly snapshots of documentation, maven artifacts, and binary builds > -- > > Key: SPARK-1517 > URL: https://issues.apache.org/jira/browse/SPARK-1517 > Project: Spark > Issue Type: Improvement > Components: Build, Project Infra >Reporter: Patrick Wendell >Assignee: Patrick Wendell >Priority: Critical > > Should be pretty easy to do with Jenkins. The only thing I can think of that > would be tricky is to set up credentials so that jenkins can publish this > stuff somewhere on apache infra. > Ideally we don't want to have to put a private key on every jenkins box > (since they are otherwise pretty stateless). One idea is to encrypt these > credentials with a passphrase and post them somewhere publicly visible. Then > the jenkins build can download the credentials provided we set a passphrase > in an environment variable in jenkins. There may be simpler solutions as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds
[ https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14660420#comment-14660420 ] Patrick Wendell commented on SPARK-1517: Hey Ryan, For the maven snapshot releases - unfortunately we are constrained by maven's own SNAPSHOT version format which doesn't allow encoding anything other than the timestamp. It's just not supported in their SNAPSHOT mechanism. However, one thing we could see is whether we can align the timestamp with the time of the actual spark commit, rather than the time of publication of the SNAPSHOT release. I'm not sure if maven lets you provide a custom timestamp when publishing. If we had that feature users could look at the Spark commit log and do some manual association. For the binaries, the reason why the same commit appears multiple times is that we do the build every four hours and always publish the latest one even if it's a duplicate. However, this could be modified pretty easily to just avoid double-publishing the same commit if there hasn't been any code change. Maybe create a JIRA for this? In terms of how many older versions are available, the scripts we use for this have a tunable retention window. Right now I'm only keeping the last 4 builds, we could probably extend it to something like 10 builds. However, at some point I'm likely to blow out of space in my ASF user account. Since the binaries are quite large, I don't think at least using ASF infrastructure it's feasible to keep all past builds. We have 3000 commits in a typical Spark release, and it's a few gigs for each binary build. > Publish nightly snapshots of documentation, maven artifacts, and binary builds > -- > > Key: SPARK-1517 > URL: https://issues.apache.org/jira/browse/SPARK-1517 > Project: Spark > Issue Type: Improvement > Components: Build, Project Infra >Reporter: Patrick Wendell >Assignee: Patrick Wendell >Priority: Critical > > Should be pretty easy to do with Jenkins. The only thing I can think of that > would be tricky is to set up credentials so that jenkins can publish this > stuff somewhere on apache infra. > Ideally we don't want to have to put a private key on every jenkins box > (since they are otherwise pretty stateless). One idea is to encrypt these > credentials with a passphrase and post them somewhere publicly visible. Then > the jenkins build can download the credentials provided we set a passphrase > in an environment variable in jenkins. There may be simpler solutions as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds
[ https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14660374#comment-14660374 ] Ryan Williams commented on SPARK-1517: -- Hey [~pwendell], thanks for continuing to push on this. A workflow I'd like to see supported (and maybe it already is; please let me know if so) is to more easily fetch these artifacts (both [Maven snapshots|https://repository.apache.org/content/repositories/snapshots/org/apache/spark/] and [bundled release {{.tgz}} files|https://people.apache.org/~pwendell/spark-nightly/]) by their git SHAs. For the Maven snapshots, I'd like to be able to just change the Spark version in a downstream project's POM to a git SHA and have Maven fetch the Spark JARs for that SHA (assuming it's one that has been built by the tools here); I'm fine with the (presumably necessary) step on my end of adding a Maven repository to make this work, either per-project or globally. Today, the Maven snapshots at e.g. https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-core_2.10/1.5.0-SNAPSHOT/ all seem to be uniquely ID'd by timestamps that I don't know how to get useful information out of, which has precluded my using them. On the bundled releases front, I see that the git SHA is being added to the folders at https://people.apache.org/~pwendell/spark-nightly/spark-master-bin/: !http://cl.ly/image/0o111a1o0U2N/Screen%20Shot%202015-08-06%20at%201.08.18%20PM.png! but those don't seem to stick around more than a day or so? Additionally, as that screenshot shows, there are 3 copies of one SHA there right now, and only 2 SHAs total. I rolled some of my own scripts for cloning, building, and selecting specific Spark versions locally at [ryan-williams/spark-helpers|https://github.com/ryan-williams/spark-helpers], which currently fetches release {{.tgz}} files for released Spark versions, but for arbitrary Spark SHAs there doesn't seem to be an easy way to download a pre-built Spark, so I am just cloning them and running {{mvn package}}. Let me know if you have thoughts about exposing built artifacts for more SHAs, the workflows I've described here, etc. Thanks again! > Publish nightly snapshots of documentation, maven artifacts, and binary builds > -- > > Key: SPARK-1517 > URL: https://issues.apache.org/jira/browse/SPARK-1517 > Project: Spark > Issue Type: Improvement > Components: Build, Project Infra >Reporter: Patrick Wendell >Assignee: Patrick Wendell >Priority: Critical > > Should be pretty easy to do with Jenkins. The only thing I can think of that > would be tricky is to set up credentials so that jenkins can publish this > stuff somewhere on apache infra. > Ideally we don't want to have to put a private key on every jenkins box > (since they are otherwise pretty stateless). One idea is to encrypt these > credentials with a passphrase and post them somewhere publicly visible. Then > the jenkins build can download the credentials provided we set a passphrase > in an environment variable in jenkins. There may be simpler solutions as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds
[ https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14627473#comment-14627473 ] Apache Spark commented on SPARK-1517: - User 'pwendell' has created a pull request for this issue: https://github.com/apache/spark/pull/7411 > Publish nightly snapshots of documentation, maven artifacts, and binary builds > -- > > Key: SPARK-1517 > URL: https://issues.apache.org/jira/browse/SPARK-1517 > Project: Spark > Issue Type: Improvement > Components: Build, Project Infra >Reporter: Patrick Wendell >Assignee: Patrick Wendell >Priority: Critical > > Should be pretty easy to do with Jenkins. The only thing I can think of that > would be tricky is to set up credentials so that jenkins can publish this > stuff somewhere on apache infra. > Ideally we don't want to have to put a private key on every jenkins box > (since they are otherwise pretty stateless). One idea is to encrypt these > credentials with a passphrase and post them somewhere publicly visible. Then > the jenkins build can download the credentials provided we set a passphrase > in an environment variable in jenkins. There may be simpler solutions as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds
[ https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14606991#comment-14606991 ] Josh Rosen commented on SPARK-1517: --- [~felixcheung], nightly doc builds are already being published at https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/ and these include the R docs. > Publish nightly snapshots of documentation, maven artifacts, and binary builds > -- > > Key: SPARK-1517 > URL: https://issues.apache.org/jira/browse/SPARK-1517 > Project: Spark > Issue Type: Improvement > Components: Build, Project Infra >Reporter: Patrick Wendell >Assignee: Patrick Wendell >Priority: Critical > > Should be pretty easy to do with Jenkins. The only thing I can think of that > would be tricky is to set up credentials so that jenkins can publish this > stuff somewhere on apache infra. > Ideally we don't want to have to put a private key on every jenkins box > (since they are otherwise pretty stateless). One idea is to encrypt these > credentials with a passphrase and post them somewhere publicly visible. Then > the jenkins build can download the credentials provided we set a passphrase > in an environment variable in jenkins. There may be simpler solutions as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds
[ https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14606961#comment-14606961 ] Felix Cheung commented on SPARK-1517: - Does this handle SparkR doc build (Which requires R) > Publish nightly snapshots of documentation, maven artifacts, and binary builds > -- > > Key: SPARK-1517 > URL: https://issues.apache.org/jira/browse/SPARK-1517 > Project: Spark > Issue Type: Improvement > Components: Build, Project Infra >Reporter: Patrick Wendell >Assignee: Patrick Wendell >Priority: Critical > > Should be pretty easy to do with Jenkins. The only thing I can think of that > would be tricky is to set up credentials so that jenkins can publish this > stuff somewhere on apache infra. > Ideally we don't want to have to put a private key on every jenkins box > (since they are otherwise pretty stateless). One idea is to encrypt these > credentials with a passphrase and post them somewhere publicly visible. Then > the jenkins build can download the credentials provided we set a passphrase > in an environment variable in jenkins. There may be simpler solutions as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds
[ https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14265419#comment-14265419 ] Ryan Williams commented on SPARK-1517: -- Agreed that the redirect you speak of should exist / be fixed; a separate JIRA should be filed for that. bq. Whether there should be nightly builds of the site is a different question. My understanding is that there has been consensus at a few points that this is a good idea. The main concern you've voiced is the risk that people will land on the github README when looking for stable/release docs, and: # find "nightly" info directly in the README content (and not understand it to be incorrect (too up-to-date) for their purposes), or # inadvertently follow a link to published "nightly" docs. re: 1, in my last post I suggested doing away with the pretense that the github README will directly contain Spark documentation, and replacing its current content with links to the relevant published docs, potentially *both* nightly and stable. re: 2, as long as the README's links to "nightly" and "stable" docs sites are clearly marked, this should not be a problem. Users already must have a minimal level of understanding of what version of Spark docs they want to look at. > Publish nightly snapshots of documentation, maven artifacts, and binary builds > -- > > Key: SPARK-1517 > URL: https://issues.apache.org/jira/browse/SPARK-1517 > Project: Spark > Issue Type: Improvement > Components: Build, Project Infra >Reporter: Patrick Wendell >Priority: Blocker > > Should be pretty easy to do with Jenkins. The only thing I can think of that > would be tricky is to set up credentials so that jenkins can publish this > stuff somewhere on apache infra. > Ideally we don't want to have to put a private key on every jenkins box > (since they are otherwise pretty stateless). One idea is to encrypt these > credentials with a passphrase and post them somewhere publicly visible. Then > the jenkins build can download the credentials provided we set a passphrase > in an environment variable in jenkins. There may be simpler solutions as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds
[ https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14265390#comment-14265390 ] Sean Owen commented on SPARK-1517: -- Recap: old URL was "building-with-maven.html", new URL is "building-spark.html" to match a rename and content change of the page itself a few months ago. There should be a redirect from the former to latter. Until the 1.2.0 site was published, there was no building-spark.html page live on the site. So README.md had to link to building-with-maven.html, with the intent that after 1.2.0 this would just redirect to building-spark.html. I'm not sure why, but the redirect isn't working. It redirects to http://spark.apache.org/building-spark.html . It seems like this is some default mechanism, and the redirector that the plugin is supposed to generate isn't present or something. Could somehow be my mistake but I certainly recall it worked on my local build of the site or else I never would have proposed it. So yes one direct hotfix is to change links to the old page to links to the new page. Only one of two links in README.md was updated. It's easy to fix the other. The README.md that you see on github.com is always going to be from master, but people are going to encounter the page and sometimes expect it corresponds to a latest stable release. (You can always view README.md from the branch you want of course, if you know what you're doing.) Yes, for this reason I agree that it's best to make it mostly pointers to other information, and I think that was already the intent of changes that included the renaming I alluded to above. IIRC there was a desire to not strip down README.md further and leave some minimal, hopefully fairly unchanging, info there. Whether there should be nightly builds of the site is a different question. If you linked to "nightly" instead of "latest" I suppose you'd have more of the same problem, no? people finding the github site and perhaps thinking they are seeing latest stable docs? On the other hand, it would at least be more internally consistent. On the other other hand, would you have to change the links to the stable URLs for release and then back as part of the release process? I had thought just linking to latest stable release docs was simple and fine. > Publish nightly snapshots of documentation, maven artifacts, and binary builds > -- > > Key: SPARK-1517 > URL: https://issues.apache.org/jira/browse/SPARK-1517 > Project: Spark > Issue Type: Improvement > Components: Build, Project Infra >Reporter: Patrick Wendell >Priority: Blocker > > Should be pretty easy to do with Jenkins. The only thing I can think of that > would be tricky is to set up credentials so that jenkins can publish this > stuff somewhere on apache infra. > Ideally we don't want to have to put a private key on every jenkins box > (since they are otherwise pretty stateless). One idea is to encrypt these > credentials with a passphrase and post them somewhere publicly visible. Then > the jenkins build can download the credentials provided we set a passphrase > in an environment variable in jenkins. There may be simpler solutions as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds
[ https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14265355#comment-14265355 ] Ryan Williams commented on SPARK-1517: -- Hey [~pwendell], any updates here? The disconnect between the content of the github README and the "/latest/" published docs leading up to the 1.2.0 release continues to cast a shadow and new divergence is set to begin as we move further from having just cut a release. As was [recently pointed out on the dev list|http://apache-spark-developers-list.1001551.n3.nabble.com/Starting-with-Spark-tp9908p9925.html], [my|https://github.com/apache/spark/commit/4ceb048b38949dd0a909d2ee6777607341c9c93a#diff-04c6e90faac2675aa89e2176d2eec7d8] and [Reynold's|https://github.com/apache/spark/commit/342b57db66e379c475daf5399baf680ff42b87c2#diff-04c6e90faac2675aa89e2176d2eec7d8] "fixes" to previously-broken links in the README became broken links when the 1.2.0 docs were cut, as [~srowen] [warned would happen|https://github.com/apache/spark/commit/342b57db66e379c475daf5399baf680ff42b87c2#commitcomment-8250912] (one is fixed [here|https://github.com/apache/spark/pull/3802/files], the other remains broken on the README today). I still believe that the correct fix is to have the README point at docs that are published with each Jenkins build, per this JIRA and [our previous discussion about it|http://apache-spark-developers-list.1001551.n3.nabble.com/Spurious-test-failures-testing-best-practices-tt9560.html#a9568]. Even better would be to publish nightly docs *and* remove any pretense that the github README is a canonical source of documentation, in favor of just linking to the /latest/ published docs. Let me know if you want me to file that as a sub-task here. > Publish nightly snapshots of documentation, maven artifacts, and binary builds > -- > > Key: SPARK-1517 > URL: https://issues.apache.org/jira/browse/SPARK-1517 > Project: Spark > Issue Type: Improvement > Components: Build, Project Infra >Reporter: Patrick Wendell >Priority: Blocker > > Should be pretty easy to do with Jenkins. The only thing I can think of that > would be tricky is to set up credentials so that jenkins can publish this > stuff somewhere on apache infra. > Ideally we don't want to have to put a private key on every jenkins box > (since they are otherwise pretty stateless). One idea is to encrypt these > credentials with a passphrase and post them somewhere publicly visible. Then > the jenkins build can download the credentials provided we set a passphrase > in an environment variable in jenkins. There may be simpler solutions as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org