Re: jenkins maintenance/downtime, aug 28th, 730am-9am PDT
jenkins is now coming down. On Thu, Aug 28, 2014 at 7:19 AM, shane knapp skn...@berkeley.edu wrote: reminder: this is starting in 10 minutes On Wed, Aug 27, 2014 at 4:13 PM, shane knapp skn...@berkeley.edu wrote: tomorrow morning i will be upgrading jenkins to the latest/greatest (1.577). at 730am, i will put jenkins in to a quiet period, so no new builds will be accepted. once any running builds are finished, i will be taking jenkins down for the upgrade. depending on what and how many jobs are running, i'm expecting this to take, at most, an hour. i'll send out an update tomorrow morning right before i begin, and will send out updates and an all-clear once we're up and running again. 1.577 release notes: http://jenkins-ci.org/changelog please let me know if there are any questions/concerns. thanks in advance! shane
Re: jenkins maintenance/downtime, aug 28th, 730am-9am PDT
jenkins is upgraded, but a few jobs sneaked in before i could do the plugin updates. i've put jenkins in quiet mode again, and once the spark builds finish, i'll restart jenkins to enable the plugin updates and we'll be good to go. let's all take a moment to bask in the glory of the shiny new UI! :) On Thu, Aug 28, 2014 at 7:46 AM, shane knapp skn...@berkeley.edu wrote: jenkins is now coming down. On Thu, Aug 28, 2014 at 7:19 AM, shane knapp skn...@berkeley.edu wrote: reminder: this is starting in 10 minutes On Wed, Aug 27, 2014 at 4:13 PM, shane knapp skn...@berkeley.edu wrote: tomorrow morning i will be upgrading jenkins to the latest/greatest (1.577). at 730am, i will put jenkins in to a quiet period, so no new builds will be accepted. once any running builds are finished, i will be taking jenkins down for the upgrade. depending on what and how many jobs are running, i'm expecting this to take, at most, an hour. i'll send out an update tomorrow morning right before i begin, and will send out updates and an all-clear once we're up and running again. 1.577 release notes: http://jenkins-ci.org/changelog please let me know if there are any questions/concerns. thanks in advance! shane
Re: jenkins maintenance/downtime, aug 28th, 730am-9am PDT
this one job is blocking the jenkins restart: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19406/ i'm about to kill it so that i can get this done. i'll restart the job after jenkins is back up. On Thu, Aug 28, 2014 at 7:51 AM, shane knapp skn...@berkeley.edu wrote: jenkins is upgraded, but a few jobs sneaked in before i could do the plugin updates. i've put jenkins in quiet mode again, and once the spark builds finish, i'll restart jenkins to enable the plugin updates and we'll be good to go. let's all take a moment to bask in the glory of the shiny new UI! :) On Thu, Aug 28, 2014 at 7:46 AM, shane knapp skn...@berkeley.edu wrote: jenkins is now coming down. On Thu, Aug 28, 2014 at 7:19 AM, shane knapp skn...@berkeley.edu wrote: reminder: this is starting in 10 minutes On Wed, Aug 27, 2014 at 4:13 PM, shane knapp skn...@berkeley.edu wrote: tomorrow morning i will be upgrading jenkins to the latest/greatest (1.577). at 730am, i will put jenkins in to a quiet period, so no new builds will be accepted. once any running builds are finished, i will be taking jenkins down for the upgrade. depending on what and how many jobs are running, i'm expecting this to take, at most, an hour. i'll send out an update tomorrow morning right before i begin, and will send out updates and an all-clear once we're up and running again. 1.577 release notes: http://jenkins-ci.org/changelog please let me know if there are any questions/concerns. thanks in advance! shane
Re: jenkins maintenance/downtime, aug 28th, 730am-9am PDT
all clear: jenkins and all plugins have been updated! On Thu, Aug 28, 2014 at 7:51 AM, shane knapp skn...@berkeley.edu wrote: jenkins is upgraded, but a few jobs sneaked in before i could do the plugin updates. i've put jenkins in quiet mode again, and once the spark builds finish, i'll restart jenkins to enable the plugin updates and we'll be good to go. let's all take a moment to bask in the glory of the shiny new UI! :) On Thu, Aug 28, 2014 at 7:46 AM, shane knapp skn...@berkeley.edu wrote: jenkins is now coming down. On Thu, Aug 28, 2014 at 7:19 AM, shane knapp skn...@berkeley.edu wrote: reminder: this is starting in 10 minutes On Wed, Aug 27, 2014 at 4:13 PM, shane knapp skn...@berkeley.edu wrote: tomorrow morning i will be upgrading jenkins to the latest/greatest (1.577). at 730am, i will put jenkins in to a quiet period, so no new builds will be accepted. once any running builds are finished, i will be taking jenkins down for the upgrade. depending on what and how many jobs are running, i'm expecting this to take, at most, an hour. i'll send out an update tomorrow morning right before i begin, and will send out updates and an all-clear once we're up and running again. 1.577 release notes: http://jenkins-ci.org/changelog please let me know if there are any questions/concerns. thanks in advance! shane
Re: jenkins maintenance/downtime, aug 28th, 730am-9am PDT
no problem! also, i retriggered: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19406 it's currently: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19411 On Thu, Aug 28, 2014 at 9:46 AM, Reynold Xin r...@databricks.com wrote: Thanks for doing this, Shane. On Thursday, August 28, 2014, shane knapp skn...@berkeley.edu wrote: all clear: jenkins and all plugins have been updated! On Thu, Aug 28, 2014 at 7:51 AM, shane knapp skn...@berkeley.edu wrote: jenkins is upgraded, but a few jobs sneaked in before i could do the plugin updates. i've put jenkins in quiet mode again, and once the spark builds finish, i'll restart jenkins to enable the plugin updates and we'll be good to go. let's all take a moment to bask in the glory of the shiny new UI! :) On Thu, Aug 28, 2014 at 7:46 AM, shane knapp skn...@berkeley.edu wrote: jenkins is now coming down. On Thu, Aug 28, 2014 at 7:19 AM, shane knapp skn...@berkeley.edu wrote: reminder: this is starting in 10 minutes On Wed, Aug 27, 2014 at 4:13 PM, shane knapp skn...@berkeley.edu wrote: tomorrow morning i will be upgrading jenkins to the latest/greatest (1.577). at 730am, i will put jenkins in to a quiet period, so no new builds will be accepted. once any running builds are finished, i will be taking jenkins down for the upgrade. depending on what and how many jobs are running, i'm expecting this to take, at most, an hour. i'll send out an update tomorrow morning right before i begin, and will send out updates and an all-clear once we're up and running again. 1.577 release notes: http://jenkins-ci.org/changelog please let me know if there are any questions/concerns. thanks in advance! shane -- You received this message because you are subscribed to the Google Groups amp-infra group. To unsubscribe from this group and stop receiving emails from it, send an email to amp-infra+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
emergency jenkins restart, aug 29th, 730am-9am PDT -- plus a postmortem
as with all software upgrades, sometimes things don't always work as expected. a recent change to stapler[1], to verbosely report NotExportableExceptions[2] is spamming our jenkins log file with stack traces, which is growing rather quickly (1.2G since 9am). this has been reported to the jenkins jira[3], and a fix has been pushed and will be rolled out soon[4]. this isn't affecting any builds, and jenkins is happily humming along. in the interim, so that we don't run out of disk space, i will be redirecting the jenkins logs tommorow morning to /dev/null for the long weekend. once a real fix has been released, i will update any packages needed and redirect the logging back to the log file. other than a short downtime, this will have no user-facing impact. please let me know if you have any questions/concerns. thanks for your patience! shane the new guy :) [1] -- https://wiki.jenkins-ci.org/display/JENKINS/Architecture [2] -- https://github.com/stapler/stapler/commit/ed2cb8b04c1514377f3a8bfbd567f050a67c6e1c [3] -- https://issues.jenkins-ci.org/browse/JENKINS-24458?focusedCommentId=209247 [4] -- https://github.com/stapler/stapler/commit/e2b39098ca1f61a58970b8a41a3ae79053cf30e3
Re: emergency jenkins restart, aug 29th, 730am-9am PDT -- plus a postmortem
reminder: this is happening right now. jenkins is currently in quiet mode, and in ~30 minutes, will be briefly going down. On Thu, Aug 28, 2014 at 1:03 PM, shane knapp skn...@berkeley.edu wrote: as with all software upgrades, sometimes things don't always work as expected. a recent change to stapler[1], to verbosely report NotExportableExceptions[2] is spamming our jenkins log file with stack traces, which is growing rather quickly (1.2G since 9am). this has been reported to the jenkins jira[3], and a fix has been pushed and will be rolled out soon[4]. this isn't affecting any builds, and jenkins is happily humming along. in the interim, so that we don't run out of disk space, i will be redirecting the jenkins logs tommorow morning to /dev/null for the long weekend. once a real fix has been released, i will update any packages needed and redirect the logging back to the log file. other than a short downtime, this will have no user-facing impact. please let me know if you have any questions/concerns. thanks for your patience! shane the new guy :) [1] -- https://wiki.jenkins-ci.org/display/JENKINS/Architecture [2] -- https://github.com/stapler/stapler/commit/ed2cb8b04c1514377f3a8bfbd567f050a67c6e1c [3] -- https://issues.jenkins-ci.org/browse/JENKINS-24458?focusedCommentId=209247 [4] -- https://github.com/stapler/stapler/commit/e2b39098ca1f61a58970b8a41a3ae79053cf30e3
Re: emergency jenkins restart, aug 29th, 730am-9am PDT -- plus a postmortem
this is done. On Fri, Aug 29, 2014 at 7:32 AM, shane knapp skn...@berkeley.edu wrote: reminder: this is happening right now. jenkins is currently in quiet mode, and in ~30 minutes, will be briefly going down. On Thu, Aug 28, 2014 at 1:03 PM, shane knapp skn...@berkeley.edu wrote: as with all software upgrades, sometimes things don't always work as expected. a recent change to stapler[1], to verbosely report NotExportableExceptions[2] is spamming our jenkins log file with stack traces, which is growing rather quickly (1.2G since 9am). this has been reported to the jenkins jira[3], and a fix has been pushed and will be rolled out soon[4]. this isn't affecting any builds, and jenkins is happily humming along. in the interim, so that we don't run out of disk space, i will be redirecting the jenkins logs tommorow morning to /dev/null for the long weekend. once a real fix has been released, i will update any packages needed and redirect the logging back to the log file. other than a short downtime, this will have no user-facing impact. please let me know if you have any questions/concerns. thanks for your patience! shane the new guy :) [1] -- https://wiki.jenkins-ci.org/display/JENKINS/Architecture [2] -- https://github.com/stapler/stapler/commit/ed2cb8b04c1514377f3a8bfbd567f050a67c6e1c [3] -- https://issues.jenkins-ci.org/browse/JENKINS-24458?focusedCommentId=209247 [4] -- https://github.com/stapler/stapler/commit/e2b39098ca1f61a58970b8a41a3ae79053cf30e3
new jenkins plugin installed and ready for use
i have always found the 'Rebuild' plugin super useful: https://wiki.jenkins-ci.org/display/JENKINS/Rebuild+Plugin this is installed and enables. enjoy! shane
hey spark developers! intro from shane knapp, devops engineer @ AMPLab
so, i had a meeting w/the databricks guys on friday and they recommended i send an email out to the list to say 'hi' and give you guys a quick intro. :) hi! i'm shane knapp, the new AMPLab devops engineer, and will be spending time getting the jenkins build infrastructure up to production quality. much of this will be 'under the covers' work, like better system level auth, backups, etc, but some will definitely be user facing: timely jenkins updates, debugging broken build infrastructure and some plugin support. i've been working in the bay area now since 1997 at many different companies, and my last 10 years has been split between google and palantir. i'm a huge proponent of OSS, and am really happy to be able to help with the work you guys are doing! if anyone has any requests/questions/comments, feel free to drop me a line! shane
Re: quick jenkins restart
and we're back and building! On Tue, Sep 2, 2014 at 5:07 PM, shane knapp skn...@berkeley.edu wrote: since our queue is really short, i'm waiting for a couple of builds to finish and will be restarting jenkins to install/update some plugins. the github pull request builder looks like it has some fixes to reduce spammy github calls, and reduce any potential rate limiting. i'll let everyone know when it's back up... this should be super quick (~15 mins for tests to finish, ~2 mins for jenkins to restart). thanks in advance! shane
amplab jenkins is down
i am trying to get things up and running, but it looks like either the firewall gateway or jenkins server itself is down. i'll update as soon as i know more.
Re: amplab jenkins is down
looks like a power outage in soda hall. more updates as they happen. On Thu, Sep 4, 2014 at 12:25 PM, shane knapp skn...@berkeley.edu wrote: i am trying to get things up and running, but it looks like either the firewall gateway or jenkins server itself is down. i'll update as soon as i know more.
Re: amplab jenkins is down
looks like some hardware failed, and we're swapping in a replacement. i don't have more specific information yet -- including *what* failed, as our sysadmin is super busy ATM. the root cause was an incorrect circuit being switched off during building maintenance. on a side note, this incident will be accelerating our plan to move the entire jenkins infrastructure in to a managed datacenter environment. this will be our major push over the next couple of weeks. more details about this, also, as soon as i get them. i'm very sorry about the downtime, we'll get everything up and running ASAP. On Thu, Sep 4, 2014 at 12:27 PM, shane knapp skn...@berkeley.edu wrote: looks like a power outage in soda hall. more updates as they happen. On Thu, Sep 4, 2014 at 12:25 PM, shane knapp skn...@berkeley.edu wrote: i am trying to get things up and running, but it looks like either the firewall gateway or jenkins server itself is down. i'll update as soon as i know more.
Re: amplab jenkins is down
it's a faulty power switch on the firewall, which has been swapped out. we're about to reboot and be good to go. On Thu, Sep 4, 2014 at 1:19 PM, shane knapp skn...@berkeley.edu wrote: looks like some hardware failed, and we're swapping in a replacement. i don't have more specific information yet -- including *what* failed, as our sysadmin is super busy ATM. the root cause was an incorrect circuit being switched off during building maintenance. on a side note, this incident will be accelerating our plan to move the entire jenkins infrastructure in to a managed datacenter environment. this will be our major push over the next couple of weeks. more details about this, also, as soon as i get them. i'm very sorry about the downtime, we'll get everything up and running ASAP. On Thu, Sep 4, 2014 at 12:27 PM, shane knapp skn...@berkeley.edu wrote: looks like a power outage in soda hall. more updates as they happen. On Thu, Sep 4, 2014 at 12:25 PM, shane knapp skn...@berkeley.edu wrote: i am trying to get things up and running, but it looks like either the firewall gateway or jenkins server itself is down. i'll update as soon as i know more.
Re: amplab jenkins is down
AND WE'RE UP! sorry that this took so long... i'll send out a more detailed explanation of what happened soon. now, off to back up jenkins. shane On Thu, Sep 4, 2014 at 1:27 PM, shane knapp skn...@berkeley.edu wrote: it's a faulty power switch on the firewall, which has been swapped out. we're about to reboot and be good to go. On Thu, Sep 4, 2014 at 1:19 PM, shane knapp skn...@berkeley.edu wrote: looks like some hardware failed, and we're swapping in a replacement. i don't have more specific information yet -- including *what* failed, as our sysadmin is super busy ATM. the root cause was an incorrect circuit being switched off during building maintenance. on a side note, this incident will be accelerating our plan to move the entire jenkins infrastructure in to a managed datacenter environment. this will be our major push over the next couple of weeks. more details about this, also, as soon as i get them. i'm very sorry about the downtime, we'll get everything up and running ASAP. On Thu, Sep 4, 2014 at 12:27 PM, shane knapp skn...@berkeley.edu wrote: looks like a power outage in soda hall. more updates as they happen. On Thu, Sep 4, 2014 at 12:25 PM, shane knapp skn...@berkeley.edu wrote: i am trying to get things up and running, but it looks like either the firewall gateway or jenkins server itself is down. i'll update as soon as i know more.
Re: amplab jenkins is down
looking On Thu, Sep 4, 2014 at 4:21 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: It appears that our main man is having trouble https://amplab.cs.berkeley.edu/jenkins/view/Pull%20Request%20Builders/job/SparkPullRequestBuilder/ hearing new requests https://github.com/apache/spark/pull/2277#issuecomment-54549106. Do we need some smelling salts? On Thu, Sep 4, 2014 at 5:49 PM, shane knapp skn...@berkeley.edu wrote: i'd ping the Jenkinsmench... the master was completely offline, so any new jobs wouldn't have reached it. any jobs that were queued when power was lost probably started up, but jobs that were running would fail. On Thu, Sep 4, 2014 at 2:45 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Woohoo! Thanks Shane. Do you know if queued PR builds will automatically be picked up? Or do we have to ping the Jenkinmensch manually from each PR? Nick On Thu, Sep 4, 2014 at 5:37 PM, shane knapp skn...@berkeley.edu wrote: AND WE'RE UP! sorry that this took so long... i'll send out a more detailed explanation of what happened soon. now, off to back up jenkins. shane On Thu, Sep 4, 2014 at 1:27 PM, shane knapp skn...@berkeley.edu wrote: it's a faulty power switch on the firewall, which has been swapped out. we're about to reboot and be good to go. On Thu, Sep 4, 2014 at 1:19 PM, shane knapp skn...@berkeley.edu wrote: looks like some hardware failed, and we're swapping in a replacement. i don't have more specific information yet -- including *what* failed, as our sysadmin is super busy ATM. the root cause was an incorrect circuit being switched off during building maintenance. on a side note, this incident will be accelerating our plan to move the entire jenkins infrastructure in to a managed datacenter environment. this will be our major push over the next couple of weeks. more details about this, also, as soon as i get them. i'm very sorry about the downtime, we'll get everything up and running ASAP. On Thu, Sep 4, 2014 at 12:27 PM, shane knapp skn...@berkeley.edu wrote: looks like a power outage in soda hall. more updates as they happen. On Thu, Sep 4, 2014 at 12:25 PM, shane knapp skn...@berkeley.edu wrote: i am trying to get things up and running, but it looks like either the firewall gateway or jenkins server itself is down. i'll update as soon as i know more. -- You received this message because you are subscribed to the Google Groups amp-infra group. To unsubscribe from this group and stop receiving emails from it, send an email to amp-infra+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: amplab jenkins is down
i'm going to restart jenkins and see if that fixes things. On Thu, Sep 4, 2014 at 4:56 PM, shane knapp skn...@berkeley.edu wrote: looking On Thu, Sep 4, 2014 at 4:21 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: It appears that our main man is having trouble https://amplab.cs.berkeley.edu/jenkins/view/Pull%20Request%20Builders/job/SparkPullRequestBuilder/ hearing new requests https://github.com/apache/spark/pull/2277#issuecomment-54549106. Do we need some smelling salts? On Thu, Sep 4, 2014 at 5:49 PM, shane knapp skn...@berkeley.edu wrote: i'd ping the Jenkinsmench... the master was completely offline, so any new jobs wouldn't have reached it. any jobs that were queued when power was lost probably started up, but jobs that were running would fail. On Thu, Sep 4, 2014 at 2:45 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Woohoo! Thanks Shane. Do you know if queued PR builds will automatically be picked up? Or do we have to ping the Jenkinmensch manually from each PR? Nick On Thu, Sep 4, 2014 at 5:37 PM, shane knapp skn...@berkeley.edu wrote: AND WE'RE UP! sorry that this took so long... i'll send out a more detailed explanation of what happened soon. now, off to back up jenkins. shane On Thu, Sep 4, 2014 at 1:27 PM, shane knapp skn...@berkeley.edu wrote: it's a faulty power switch on the firewall, which has been swapped out. we're about to reboot and be good to go. On Thu, Sep 4, 2014 at 1:19 PM, shane knapp skn...@berkeley.edu wrote: looks like some hardware failed, and we're swapping in a replacement. i don't have more specific information yet -- including *what* failed, as our sysadmin is super busy ATM. the root cause was an incorrect circuit being switched off during building maintenance. on a side note, this incident will be accelerating our plan to move the entire jenkins infrastructure in to a managed datacenter environment. this will be our major push over the next couple of weeks. more details about this, also, as soon as i get them. i'm very sorry about the downtime, we'll get everything up and running ASAP. On Thu, Sep 4, 2014 at 12:27 PM, shane knapp skn...@berkeley.edu wrote: looks like a power outage in soda hall. more updates as they happen. On Thu, Sep 4, 2014 at 12:25 PM, shane knapp skn...@berkeley.edu wrote: i am trying to get things up and running, but it looks like either the firewall gateway or jenkins server itself is down. i'll update as soon as i know more. -- You received this message because you are subscribed to the Google Groups amp-infra group. To unsubscribe from this group and stop receiving emails from it, send an email to amp-infra+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: amplab jenkins is down
yep. that's exactly the behavior i saw earlier, and will be figuring out first thing tomorrow morning. i bet it's an environment issues on the slaves. On Thu, Sep 4, 2014 at 7:10 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Looks like during the last build https://amplab.cs.berkeley.edu/jenkins/view/Pull%20Request%20Builders/job/SparkPullRequestBuilder/19797/console Jenkins was unable to execute a git fetch? On Thu, Sep 4, 2014 at 7:58 PM, shane knapp skn...@berkeley.edu wrote: i'm going to restart jenkins and see if that fixes things. On Thu, Sep 4, 2014 at 4:56 PM, shane knapp skn...@berkeley.edu wrote: looking On Thu, Sep 4, 2014 at 4:21 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: It appears that our main man is having trouble https://amplab.cs.berkeley.edu/jenkins/view/Pull%20Request%20Builders/job/SparkPullRequestBuilder/ hearing new requests https://github.com/apache/spark/pull/2277#issuecomment-54549106. Do we need some smelling salts? On Thu, Sep 4, 2014 at 5:49 PM, shane knapp skn...@berkeley.edu wrote: i'd ping the Jenkinsmench... the master was completely offline, so any new jobs wouldn't have reached it. any jobs that were queued when power was lost probably started up, but jobs that were running would fail. On Thu, Sep 4, 2014 at 2:45 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Woohoo! Thanks Shane. Do you know if queued PR builds will automatically be picked up? Or do we have to ping the Jenkinmensch manually from each PR? Nick On Thu, Sep 4, 2014 at 5:37 PM, shane knapp skn...@berkeley.edu wrote: AND WE'RE UP! sorry that this took so long... i'll send out a more detailed explanation of what happened soon. now, off to back up jenkins. shane On Thu, Sep 4, 2014 at 1:27 PM, shane knapp skn...@berkeley.edu wrote: it's a faulty power switch on the firewall, which has been swapped out. we're about to reboot and be good to go. On Thu, Sep 4, 2014 at 1:19 PM, shane knapp skn...@berkeley.edu wrote: looks like some hardware failed, and we're swapping in a replacement. i don't have more specific information yet -- including *what* failed, as our sysadmin is super busy ATM. the root cause was an incorrect circuit being switched off during building maintenance. on a side note, this incident will be accelerating our plan to move the entire jenkins infrastructure in to a managed datacenter environment. this will be our major push over the next couple of weeks. more details about this, also, as soon as i get them. i'm very sorry about the downtime, we'll get everything up and running ASAP. On Thu, Sep 4, 2014 at 12:27 PM, shane knapp skn...@berkeley.edu wrote: looks like a power outage in soda hall. more updates as they happen. On Thu, Sep 4, 2014 at 12:25 PM, shane knapp skn...@berkeley.edu wrote: i am trying to get things up and running, but it looks like either the firewall gateway or jenkins server itself is down. i'll update as soon as i know more. -- You received this message because you are subscribed to the Google Groups amp-infra group. To unsubscribe from this group and stop receiving emails from it, send an email to amp-infra+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: amplab jenkins is down
it's looking like everything except the pull request builders are working. i'm going to be working on getting this resolved today. On Fri, Sep 5, 2014 at 8:18 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Hmm, looks like at least some builds https://amplab.cs.berkeley.edu/jenkins/view/Pull%20Request%20Builders/job/SparkPullRequestBuilder/19804/consoleFull are working now, though this last one was from ~5 hours ago. On Fri, Sep 5, 2014 at 1:02 AM, shane knapp skn...@berkeley.edu wrote: yep. that's exactly the behavior i saw earlier, and will be figuring out first thing tomorrow morning. i bet it's an environment issues on the slaves. On Thu, Sep 4, 2014 at 7:10 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Looks like during the last build https://amplab.cs.berkeley.edu/jenkins/view/Pull%20Request%20Builders/job/SparkPullRequestBuilder/19797/console Jenkins was unable to execute a git fetch? On Thu, Sep 4, 2014 at 7:58 PM, shane knapp skn...@berkeley.edu wrote: i'm going to restart jenkins and see if that fixes things. On Thu, Sep 4, 2014 at 4:56 PM, shane knapp skn...@berkeley.edu wrote: looking On Thu, Sep 4, 2014 at 4:21 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: It appears that our main man is having trouble https://amplab.cs.berkeley.edu/jenkins/view/Pull%20Request%20Builders/job/SparkPullRequestBuilder/ hearing new requests https://github.com/apache/spark/pull/2277#issuecomment-54549106. Do we need some smelling salts? On Thu, Sep 4, 2014 at 5:49 PM, shane knapp skn...@berkeley.edu wrote: i'd ping the Jenkinsmench... the master was completely offline, so any new jobs wouldn't have reached it. any jobs that were queued when power was lost probably started up, but jobs that were running would fail. On Thu, Sep 4, 2014 at 2:45 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Woohoo! Thanks Shane. Do you know if queued PR builds will automatically be picked up? Or do we have to ping the Jenkinmensch manually from each PR? Nick On Thu, Sep 4, 2014 at 5:37 PM, shane knapp skn...@berkeley.edu wrote: AND WE'RE UP! sorry that this took so long... i'll send out a more detailed explanation of what happened soon. now, off to back up jenkins. shane On Thu, Sep 4, 2014 at 1:27 PM, shane knapp skn...@berkeley.edu wrote: it's a faulty power switch on the firewall, which has been swapped out. we're about to reboot and be good to go. On Thu, Sep 4, 2014 at 1:19 PM, shane knapp skn...@berkeley.edu wrote: looks like some hardware failed, and we're swapping in a replacement. i don't have more specific information yet -- including *what* failed, as our sysadmin is super busy ATM. the root cause was an incorrect circuit being switched off during building maintenance. on a side note, this incident will be accelerating our plan to move the entire jenkins infrastructure in to a managed datacenter environment. this will be our major push over the next couple of weeks. more details about this, also, as soon as i get them. i'm very sorry about the downtime, we'll get everything up and running ASAP. On Thu, Sep 4, 2014 at 12:27 PM, shane knapp skn...@berkeley.edu wrote: looks like a power outage in soda hall. more updates as they happen. On Thu, Sep 4, 2014 at 12:25 PM, shane knapp skn...@berkeley.edu wrote: i am trying to get things up and running, but it looks like either the firewall gateway or jenkins server itself is down. i'll update as soon as i know more. -- You received this message because you are subscribed to the Google Groups amp-infra group. To unsubscribe from this group and stop receiving emails from it, send an email to amp-infra+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
yet another jenkins restart early thursday morning -- 730am PDT (and a brief update on our new jenkins infra)
since the power incident last thursday, the github pull request builder plugin is still not really working 100%. i found an open issue w/jenkins[1] that could definitely be affecting us, i will be pausing builds early thursday morning and then restarting jenkins. i'll send out a reminder tomorrow, and if this causes any problems for you, please let me know and we can work out a better time. but, now for some good news! yesterday morning, we racked and stacked the systems for the new jenkins instance in the berkeley datacenter. tomorrow i should be able to log in to them and start getting them set up and configured. this is a major step in getting us in to a much more 'production' style environment! anyways: thanks for your patience, and i think we've all learned that hard powering down your build system is a definite recipe for disaster. :) shane [1] -- https://issues.jenkins-ci.org/browse/JENKINS-22509
Re: yet another jenkins restart early thursday morning -- 730am PDT (and a brief update on our new jenkins infra)
that's kinda what we're hoping as well. :) On Wed, Sep 10, 2014 at 2:46 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I'm looking forward to this. :) Looks like Jenkins is having trouble triggering builds for new commits or after user requests (e.g. https://github.com/apache/spark/pull/2339#issuecomment-55165937). Hopefully that will be resolved tomorrow. Nick On Tue, Sep 9, 2014 at 5:00 PM, shane knapp skn...@berkeley.edu wrote: since the power incident last thursday, the github pull request builder plugin is still not really working 100%. i found an open issue w/jenkins[1] that could definitely be affecting us, i will be pausing builds early thursday morning and then restarting jenkins. i'll send out a reminder tomorrow, and if this causes any problems for you, please let me know and we can work out a better time. but, now for some good news! yesterday morning, we racked and stacked the systems for the new jenkins instance in the berkeley datacenter. tomorrow i should be able to log in to them and start getting them set up and configured. this is a major step in getting us in to a much more 'production' style environment! anyways: thanks for your patience, and i think we've all learned that hard powering down your build system is a definite recipe for disaster. :) shane [1] -- https://issues.jenkins-ci.org/browse/JENKINS-22509
Re: yet another jenkins restart early thursday morning -- 730am PDT (and a brief update on our new jenkins infra)
jenkins is now in quiet mode, and a restart is happening soon. On Wed, Sep 10, 2014 at 3:44 PM, shane knapp skn...@berkeley.edu wrote: that's kinda what we're hoping as well. :) On Wed, Sep 10, 2014 at 2:46 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I'm looking forward to this. :) Looks like Jenkins is having trouble triggering builds for new commits or after user requests (e.g. https://github.com/apache/spark/pull/2339#issuecomment-55165937). Hopefully that will be resolved tomorrow. Nick On Tue, Sep 9, 2014 at 5:00 PM, shane knapp skn...@berkeley.edu wrote: since the power incident last thursday, the github pull request builder plugin is still not really working 100%. i found an open issue w/jenkins[1] that could definitely be affecting us, i will be pausing builds early thursday morning and then restarting jenkins. i'll send out a reminder tomorrow, and if this causes any problems for you, please let me know and we can work out a better time. but, now for some good news! yesterday morning, we racked and stacked the systems for the new jenkins instance in the berkeley datacenter. tomorrow i should be able to log in to them and start getting them set up and configured. this is a major step in getting us in to a much more 'production' style environment! anyways: thanks for your patience, and i think we've all learned that hard powering down your build system is a definite recipe for disaster. :) shane [1] -- https://issues.jenkins-ci.org/browse/JENKINS-22509
Re: yet another jenkins restart early thursday morning -- 730am PDT (and a brief update on our new jenkins infra)
...and the restart is done. On Thu, Sep 11, 2014 at 7:38 AM, shane knapp skn...@berkeley.edu wrote: jenkins is now in quiet mode, and a restart is happening soon. On Wed, Sep 10, 2014 at 3:44 PM, shane knapp skn...@berkeley.edu wrote: that's kinda what we're hoping as well. :) On Wed, Sep 10, 2014 at 2:46 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I'm looking forward to this. :) Looks like Jenkins is having trouble triggering builds for new commits or after user requests (e.g. https://github.com/apache/spark/pull/2339#issuecomment-55165937). Hopefully that will be resolved tomorrow. Nick On Tue, Sep 9, 2014 at 5:00 PM, shane knapp skn...@berkeley.edu wrote: since the power incident last thursday, the github pull request builder plugin is still not really working 100%. i found an open issue w/jenkins[1] that could definitely be affecting us, i will be pausing builds early thursday morning and then restarting jenkins. i'll send out a reminder tomorrow, and if this causes any problems for you, please let me know and we can work out a better time. but, now for some good news! yesterday morning, we racked and stacked the systems for the new jenkins instance in the berkeley datacenter. tomorrow i should be able to log in to them and start getting them set up and configured. this is a major step in getting us in to a much more 'production' style environment! anyways: thanks for your patience, and i think we've all learned that hard powering down your build system is a definite recipe for disaster. :) shane [1] -- https://issues.jenkins-ci.org/browse/JENKINS-22509
Re: yet another jenkins restart early thursday morning -- 730am PDT (and a brief update on our new jenkins infra)
you can just click on 'rebuild', if you'd like. what project specifically? (i had forgotten that i'd killed https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/557/, which i just started a rebuild on) On Thu, Sep 11, 2014 at 9:15 AM, Matthew Farrellee m...@redhat.com wrote: shane, is there anything we should do for pull requests that failed, but for unrelated issues? best, matt On 09/11/2014 11:29 AM, shane knapp wrote: ...and the restart is done. On Thu, Sep 11, 2014 at 7:38 AM, shane knapp skn...@berkeley.edu wrote: jenkins is now in quiet mode, and a restart is happening soon. On Wed, Sep 10, 2014 at 3:44 PM, shane knapp skn...@berkeley.edu wrote: that's kinda what we're hoping as well. :) On Wed, Sep 10, 2014 at 2:46 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I'm looking forward to this. :) Looks like Jenkins is having trouble triggering builds for new commits or after user requests (e.g. https://github.com/apache/spark/pull/2339#issuecomment-55165937). Hopefully that will be resolved tomorrow. Nick On Tue, Sep 9, 2014 at 5:00 PM, shane knapp skn...@berkeley.edu wrote: since the power incident last thursday, the github pull request builder plugin is still not really working 100%. i found an open issue w/jenkins[1] that could definitely be affecting us, i will be pausing builds early thursday morning and then restarting jenkins. i'll send out a reminder tomorrow, and if this causes any problems for you, please let me know and we can work out a better time. but, now for some good news! yesterday morning, we racked and stacked the systems for the new jenkins instance in the berkeley datacenter. tomorrow i should be able to log in to them and start getting them set up and configured. this is a major step in getting us in to a much more 'production' style environment! anyways: thanks for your patience, and i think we've all learned that hard powering down your build system is a definite recipe for disaster. :) shane [1] -- https://issues.jenkins-ci.org/browse/JENKINS-22509
FYI: jenkins systems patched to fix bash exploit
all of our systems were affected by the shellshock bug, and i've just patched everything w/the latest fix from redhat: https://access.redhat.com/articles/1200223 we're not running bash.x86_64 0:4.1.2-15.el6_5.2 on all of our systems. shane
Re: FYI: jenkins systems patched to fix bash exploit
we're not running bash.x86_64 0:4.1.2-15.el6_5.2 on all of our systems. s/not/now :)
jenkins downtime/system upgrade wednesday morning, 730am PDT
happy monday, everyone! remember a few weeks back when i upgraded jenkins, and unwittingly began DOSing our system due to massive log spam? well, that bug has been fixed w/the current release and i'd like to get our logging levels back to something more verbose that we have now. downtime will be from 730am-1000am PDT (i do expect this to be done well before 1000am) the update will be from 1.578 - 1.582 changelog here: http://jenkins-ci.org/changelog please let me know if there are any questions or concerns. thanks! shane, your friendly devops engineer
Re: FYI: i've doubled the jenkins executors for every build node
yeah, this is why i'm gonna keep a close eye on things this week... as for VMs vs containers, please do the latter more than the former. one of our longer-term plans here at the lab is to move most of our jenkins infra to VMs, and running tests w/nested VMs is Bad[tm]. On Mon, Sep 29, 2014 at 2:25 PM, Reynold Xin r...@databricks.com wrote: Thanks. We might see more failures due to contention on resources. Fingers acrossed ... At some point it might make sense to run the tests in a VM or container. On Mon, Sep 29, 2014 at 2:20 PM, shane knapp skn...@berkeley.edu wrote: we were running at 8 executors per node, and BARELY even stressing the machines (32 cores, ~230G RAM). in the interest of actually using system resources, and giving ourselves some headroom, i upped the executors to 16 per node. i'll be keeping an eye on ganglia for the rest of the week to make sure everything's cool. i hope you all enjoy your freshly allocated capacity! :) shane
Re: jenkins downtime/system upgrade wednesday morning, 730am PDT
https://issues.apache.org/jira/browse/SPARK-3745 On Tue, Sep 30, 2014 at 10:22 AM, shane knapp skn...@berkeley.edu wrote: (this time, reply to all) nice catch. there's a bug in spark/dev/check-license, which i've confirmed from the CLI. i'll open a bug and PR to fix it. On Mon, Sep 29, 2014 at 8:00 PM, Nan Zhu zhunanmcg...@gmail.com wrote: Just noticed these lines in the jenkins log = Running Apache RAT checks = Attempting to fetch rat Launching rat from /home/jenkins/workspace/SparkPullRequestBuilder/lib/apache-rat-0.10.jar Error: Invalid or corrupt jarfile /home/jenkins/workspace/SparkPullRequestBuilder/lib/apache-rat-0.10.jar RAT checks passed. Something wrong? Best, -- Nan Zhu On Monday, September 29, 2014 at 4:43 PM, shane knapp wrote: happy monday, everyone! remember a few weeks back when i upgraded jenkins, and unwittingly began DOSing our system due to massive log spam? well, that bug has been fixed w/the current release and i'd like to get our logging levels back to something more verbose that we have now. downtime will be from 730am-1000am PDT (i do expect this to be done well before 1000am) the update will be from 1.578 - 1.582 changelog here: http://jenkins-ci.org/changelog please let me know if there are any questions or concerns. thanks! shane, your friendly devops engineer
Re: jenkins downtime/system upgrade wednesday morning, 730am PDT
reminder: this is happening tomorrow morning. i will be putting jenkins in to quiet mode at ~7am, and then doing the upgrade once any stray builds finish. On Mon, Sep 29, 2014 at 1:43 PM, shane knapp skn...@berkeley.edu wrote: happy monday, everyone! remember a few weeks back when i upgraded jenkins, and unwittingly began DOSing our system due to massive log spam? well, that bug has been fixed w/the current release and i'd like to get our logging levels back to something more verbose that we have now. downtime will be from 730am-1000am PDT (i do expect this to be done well before 1000am) the update will be from 1.578 - 1.582 changelog here: http://jenkins-ci.org/changelog please let me know if there are any questions or concerns. thanks! shane, your friendly devops engineer
Re: amplab jenkins is down
as of this morning, i've got the new jenkins up, with all of the current builds set up (but failing). i'm in the middle of playing setup/debug whack-a-mole, but we're getting there. my guess would be early next week for the switchover. On Wed, Oct 1, 2014 at 12:53 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: On Thu, Sep 4, 2014 at 4:19 PM, shane knapp skn...@berkeley.edu wrote: on a side note, this incident will be accelerating our plan to move the entire jenkins infrastructure in to a managed datacenter environment. this will be our major push over the next couple of weeks. more details about this, also, as soon as i get them. Are there any updates on this move of the Jenkins infrastructure to a managed datacenter? I remember it being mentioned that another benefit of this move would be reduced flakiness when Jenkins tries to checkout patches for testing. For some reason, I'm getting a lot of those https://github.com/apache/spark/pull/2606#issuecomment-57514540 today. Nick
emergency jenkins restart -- massive security patch released
https://wiki.jenkins-ci.org/display/SECURITY/Jenkins+Security+Advisory+2014-10-01 there's some pretty big stuff that's been identified and we need to get this upgraded asap. i'll be killing off what's currently running, and will retrigger them all once we're done. sorry for the inconvenience. shane
Re: emergency jenkins restart -- massive security patch released
update complete. i'm retriggering builds now. On Fri, Oct 3, 2014 at 10:51 AM, shane knapp skn...@berkeley.edu wrote: https://wiki.jenkins-ci.org/display/SECURITY/Jenkins+Security+Advisory+2014-10-01 there's some pretty big stuff that's been identified and we need to get this upgraded asap. i'll be killing off what's currently running, and will retrigger them all once we're done. sorry for the inconvenience. shane
Re: new jenkins update + tentative release date
AND WE ARE LIIIVE! https://amplab.cs.berkeley.edu/jenkins/ have at it, folks! On Mon, Oct 13, 2014 at 10:15 AM, shane knapp skn...@berkeley.edu wrote: quick update: we should be back up and running in the next ~60mins. On Mon, Oct 13, 2014 at 7:54 AM, shane knapp skn...@berkeley.edu wrote: Jenkins is in quiet mode and the move will be starting after i have my coffee. :) On Sun, Oct 12, 2014 at 11:26 PM, Josh Rosen rosenvi...@gmail.com wrote: Reminder: this Jenkins migration is happening tomorrow morning (Monday). On Fri, Oct 10, 2014 at 1:01 PM, shane knapp skn...@berkeley.edu wrote: reminder: this IS happening, first thing monday morning PDT. :) On Wed, Oct 8, 2014 at 3:01 PM, shane knapp skn...@berkeley.edu wrote: greetings! i've got some updates regarding our new jenkins infrastructure, as well as the initial date and plan for rolling things out: *** current testing/build break whack-a-mole: a lot of out of date artifacts are cached in the current jenkins, which has caused a few builds during my testing to break due to dependency resolution failure[1][2]. bumping these versions can cause your builds to fail, due to public api changes and the like. consider yourself warned that some projects might require some debugging... :) tomorrow, i will be at databricks working w/@joshrosen to make sure that the spark builds have any bugs hammered out. *** deployment plan: unless something completely horrible happens, THE NEW JENKINS WILL GO LIVE ON MONDAY (october 13th). all jenkins infrastructure will be DOWN for the entirety of the day (starting at ~8am). this means no builds, period. i'm hoping that the downtime will be much shorter than this, but we'll have to see how everything goes. all test/build history WILL BE PRESERVED. i will be rsyncing the jenkins jobs/ directory over, complete w/history as part of the deployment. once i'm feeling good about the state of things, i'll point the original url to the new instances and send out an all clear. if you are a student at UC berkeley, you can log in to jenkins using your LDAP login, and (by default) view but not change plans. if you do not have a UC berkeley LDAP login, you can still view plans anonymously. IF YOU ARE A PLAN ADMIN, THEN PLEASE REACH OUT, ASAP, PRIVATELY AND I WILL SET UP ADMIN ACCESS TO YOUR BUILDS. *** post deployment plan: fix all of the things that break! i will be keeping a VERY close eye on the builds, checking for breaks, and helping out where i can. if the situation is dire, i can always roll back to the old jenkins infra... but i hope we never get to that point! :) i'm hoping that things will go smoothly, but please be patient as i'm certain we'll hit a few bumps in the road. please let me know if you guys have any comments/questions/concerns... :) shane 1 - https://github.com/bigdatagenomics/bdg-services/pull/18 2 - https://github.com/bigdatagenomics/avocado/pull/111
Re: new jenkins update + tentative release date
On Mon, Oct 13, 2014 at 2:28 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Thanks for doing this work Shane. So is Jenkins in the new datacenter now? Do you know if the problems with checking out patches from GitHub should be resolved now? Here's an example from the past hour https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21702/console . yeah, i just noticed that we're still having the checkout issues. i was really hoping that the better network would just make this go away... guess i'll be doing a deeper dive now. i would just up the timeout, but that's not coming out for a little while yet: https://issues.jenkins-ci.org/browse/JENKINS-20387 (we are currently running the latest -- 2.2.7, and the timeout field is coming in 2.3, whenever that is) i'll try and strace/replicate it locally as well.
Re: new jenkins update + tentative release date
ok, i found something that may help: https://issues.jenkins-ci.org/browse/JENKINS-20445?focusedCommentId=195638page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-195638 i set this to 20 minutes... let's see if that helps. On Mon, Oct 13, 2014 at 2:48 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Ah, that sucks. Thank you for looking into this. On Mon, Oct 13, 2014 at 5:43 PM, shane knapp skn...@berkeley.edu wrote: On Mon, Oct 13, 2014 at 2:28 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Thanks for doing this work Shane. So is Jenkins in the new datacenter now? Do you know if the problems with checking out patches from GitHub should be resolved now? Here's an example from the past hour https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21702/console . yeah, i just noticed that we're still having the checkout issues. i was really hoping that the better network would just make this go away... guess i'll be doing a deeper dive now. i would just up the timeout, but that's not coming out for a little while yet: https://issues.jenkins-ci.org/browse/JENKINS-20387 (we are currently running the latest -- 2.2.7, and the timeout field is coming in 2.3, whenever that is) i'll try and strace/replicate it locally as well.
short jenkins downtime -- trying to get to the bottom of the git fetch timeouts
i'm going to be downgrading our git plugin (from 2.2.7 to 2.2.2) to see if that helps w/the git fetch timeouts. this will require a short downtime (~20 mins for builds to finish, ~20 mins to downgrade), and will hopefully give us some insight in to wtf is going on. thanks for your patience... shane
Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts
ok, we're up and building... :crossesfingersfortheumpteenthtime: On Wed, Oct 15, 2014 at 1:59 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I support this effort. :thumbsup: On Wed, Oct 15, 2014 at 4:52 PM, shane knapp skn...@berkeley.edu wrote: i'm going to be downgrading our git plugin (from 2.2.7 to 2.2.2) to see if that helps w/the git fetch timeouts. this will require a short downtime (~20 mins for builds to finish, ~20 mins to downgrade), and will hopefully give us some insight in to wtf is going on. thanks for your patience... shane -- You received this message because you are subscribed to the Google Groups amp-infra group. To unsubscribe from this group and stop receiving emails from it, send an email to amp-infra+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts
four builds triggered and no timeouts. :crossestoes: :) On Wed, Oct 15, 2014 at 2:19 PM, shane knapp skn...@berkeley.edu wrote: ok, we're up and building... :crossesfingersfortheumpteenthtime: On Wed, Oct 15, 2014 at 1:59 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I support this effort. :thumbsup: On Wed, Oct 15, 2014 at 4:52 PM, shane knapp skn...@berkeley.edu wrote: i'm going to be downgrading our git plugin (from 2.2.7 to 2.2.2) to see if that helps w/the git fetch timeouts. this will require a short downtime (~20 mins for builds to finish, ~20 mins to downgrade), and will hopefully give us some insight in to wtf is going on. thanks for your patience... shane -- You received this message because you are subscribed to the Google Groups amp-infra group. To unsubscribe from this group and stop receiving emails from it, send an email to amp-infra+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts
ok, we've had about 10 spark pull request builds go through w/o any git timeouts. it seems that the git timeout issue might be licked. i will be definitely be keeping an eye on this for the next few days. thanks for being patient! shane On Wed, Oct 15, 2014 at 2:27 PM, shane knapp skn...@berkeley.edu wrote: four builds triggered and no timeouts. :crossestoes: :) On Wed, Oct 15, 2014 at 2:19 PM, shane knapp skn...@berkeley.edu wrote: ok, we're up and building... :crossesfingersfortheumpteenthtime: On Wed, Oct 15, 2014 at 1:59 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I support this effort. :thumbsup: On Wed, Oct 15, 2014 at 4:52 PM, shane knapp skn...@berkeley.edu wrote: i'm going to be downgrading our git plugin (from 2.2.7 to 2.2.2) to see if that helps w/the git fetch timeouts. this will require a short downtime (~20 mins for builds to finish, ~20 mins to downgrade), and will hopefully give us some insight in to wtf is going on. thanks for your patience... shane -- You received this message because you are subscribed to the Google Groups amp-infra group. To unsubscribe from this group and stop receiving emails from it, send an email to amp-infra+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts
the bad news is that we've had a couple more failures due to timeouts, but the good news is that the frequency that these happen has decreased significantly (3 in the past ~18hr). seems like the git plugin downgrade has helped relieve the problem, but hasn't fixed it. i'll be looking in to this more today. On Wed, Oct 15, 2014 at 7:05 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: A quick scan through the Spark PR board https://spark-prs.appspot.com/ shows no recent failures related to this git checkout problem. Looks promising! Nick On Wed, Oct 15, 2014 at 6:10 PM, shane knapp skn...@berkeley.edu wrote: ok, we've had about 10 spark pull request builds go through w/o any git timeouts. it seems that the git timeout issue might be licked. i will be definitely be keeping an eye on this for the next few days. thanks for being patient! shane On Wed, Oct 15, 2014 at 2:27 PM, shane knapp skn...@berkeley.edu wrote: four builds triggered and no timeouts. :crossestoes: :) On Wed, Oct 15, 2014 at 2:19 PM, shane knapp skn...@berkeley.edu wrote: ok, we're up and building... :crossesfingersfortheumpteenthtime: On Wed, Oct 15, 2014 at 1:59 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I support this effort. :thumbsup: On Wed, Oct 15, 2014 at 4:52 PM, shane knapp skn...@berkeley.edu wrote: i'm going to be downgrading our git plugin (from 2.2.7 to 2.2.2) to see if that helps w/the git fetch timeouts. this will require a short downtime (~20 mins for builds to finish, ~20 mins to downgrade), and will hopefully give us some insight in to wtf is going on. thanks for your patience... shane -- You received this message because you are subscribed to the Google Groups amp-infra group. To unsubscribe from this group and stop receiving emails from it, send an email to amp-infra+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts
yeah, at this point it might be worth trying. :) the absolutely irritating thing is that i am not seeing this happen w/any other jobs other that the spark prb, nor does it seem to correlate w/time of day, network or system load, or what slave it runs on. nor are we hitting our limit of connections on github. i really, truly hate non-deterministic failures. i'm also going to write an email to support@github and see if they have any insight in to this as well. On Thu, Oct 16, 2014 at 12:51 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Thanks for continuing to look into this, Shane. One suggestion that Patrick brought up, if we have trouble getting to the bottom of this, is doing the git checkout ourselves in the run-tests-jenkins script and cutting out the Jenkins git plugin entirely. That way we can script retries and post friendlier messages about timeouts if they still occur by ourselves. Do you think that’s worth trying at some point? Nick On Thu, Oct 16, 2014 at 2:04 PM, shane knapp skn...@berkeley.edu wrote: the bad news is that we've had a couple more failures due to timeouts, but the good news is that the frequency that these happen has decreased significantly (3 in the past ~18hr). seems like the git plugin downgrade has helped relieve the problem, but hasn't fixed it. i'll be looking in to this more today. On Wed, Oct 15, 2014 at 7:05 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: A quick scan through the Spark PR board https://spark-prs.appspot.com/ shows no recent failures related to this git checkout problem. Looks promising! Nick On Wed, Oct 15, 2014 at 6:10 PM, shane knapp skn...@berkeley.edu wrote: ok, we've had about 10 spark pull request builds go through w/o any git timeouts. it seems that the git timeout issue might be licked. i will be definitely be keeping an eye on this for the next few days. thanks for being patient! shane On Wed, Oct 15, 2014 at 2:27 PM, shane knapp skn...@berkeley.edu wrote: four builds triggered and no timeouts. :crossestoes: :) On Wed, Oct 15, 2014 at 2:19 PM, shane knapp skn...@berkeley.edu wrote: ok, we're up and building... :crossesfingersfortheumpteenthtime: On Wed, Oct 15, 2014 at 1:59 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I support this effort. :thumbsup: On Wed, Oct 15, 2014 at 4:52 PM, shane knapp skn...@berkeley.edu wrote: i'm going to be downgrading our git plugin (from 2.2.7 to 2.2.2) to see if that helps w/the git fetch timeouts. this will require a short downtime (~20 mins for builds to finish, ~20 mins to downgrade), and will hopefully give us some insight in to wtf is going on. thanks for your patience... shane -- You received this message because you are subscribed to the Google Groups amp-infra group. To unsubscribe from this group and stop receiving emails from it, send an email to amp-infra+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: something wrong with Jenkins or something untested merged?
ok, so earlier today i installed a 2nd JDK within jenkins (7u71), which fixed the SparkR build but apparently made Spark itself quite unhappy. i removed that JDK, triggered a build ( https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21943/console), and it compiled kinesis w/o dying a fiery death. apparently 7u71 is stricter when compiling. sad times. sorry about that! shane On Mon, Oct 20, 2014 at 5:16 PM, Patrick Wendell pwend...@gmail.com wrote: The failure is in the Kinesis compoent, can you reproduce this if you build with -Pkinesis-asl? - Patrick On Mon, Oct 20, 2014 at 5:08 PM, shane knapp skn...@berkeley.edu wrote: hmm, strange. i'll take a look. On Mon, Oct 20, 2014 at 5:11 PM, Nan Zhu zhunanmcg...@gmail.com wrote: yes, I can compile locally, too but it seems that Jenkins is not happy now... https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/ All failed to compile Best, -- Nan Zhu On Monday, October 20, 2014 at 7:56 PM, Ted Yu wrote: I performed build on latest master branch but didn't get compilation error. FYI On Mon, Oct 20, 2014 at 3:51 PM, Nan Zhu zhunanmcg...@gmail.com (mailto:zhunanmcg...@gmail.com) wrote: Hi, I just submitted a patch https://github.com/apache/spark/pull/2864/files with one line change but the Jenkins told me it's failed to compile on the unrelated files? https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21935/console Best, Nan
Re: something wrong with Jenkins or something untested merged?
thanks, patrick! :) On Mon, Oct 20, 2014 at 5:35 PM, Patrick Wendell pwend...@gmail.com wrote: I created an issue to fix this: https://issues.apache.org/jira/browse/SPARK-4021 On Mon, Oct 20, 2014 at 5:32 PM, Patrick Wendell pwend...@gmail.com wrote: Thanks Shane - we should fix the source code issues in the Kinesis code that made stricter Java compilers reject it. - Patrick On Mon, Oct 20, 2014 at 5:28 PM, shane knapp skn...@berkeley.edu wrote: ok, so earlier today i installed a 2nd JDK within jenkins (7u71), which fixed the SparkR build but apparently made Spark itself quite unhappy. i removed that JDK, triggered a build ( https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21943/console ), and it compiled kinesis w/o dying a fiery death. apparently 7u71 is stricter when compiling. sad times. sorry about that! shane On Mon, Oct 20, 2014 at 5:16 PM, Patrick Wendell pwend...@gmail.com wrote: The failure is in the Kinesis compoent, can you reproduce this if you build with -Pkinesis-asl? - Patrick On Mon, Oct 20, 2014 at 5:08 PM, shane knapp skn...@berkeley.edu wrote: hmm, strange. i'll take a look. On Mon, Oct 20, 2014 at 5:11 PM, Nan Zhu zhunanmcg...@gmail.com wrote: yes, I can compile locally, too but it seems that Jenkins is not happy now... https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/ All failed to compile Best, -- Nan Zhu On Monday, October 20, 2014 at 7:56 PM, Ted Yu wrote: I performed build on latest master branch but didn't get compilation error. FYI On Mon, Oct 20, 2014 at 3:51 PM, Nan Zhu zhunanmcg...@gmail.com (mailto:zhunanmcg...@gmail.com) wrote: Hi, I just submitted a patch https://github.com/apache/spark/pull/2864/files with one line change but the Jenkins told me it's failed to compile on the unrelated files? https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21935/console Best, Nan
Re: something wrong with Jenkins or something untested merged?
i'm currently in a meeting and will be starting to do some tests in ~1 hour or so. On Tue, Oct 21, 2014 at 11:07 AM, Nan Zhu zhunanmcg...@gmail.com wrote: I agree with Sean I just compiled spark core successfully with 7u71 in Mac OS X On Tue, Oct 21, 2014 at 1:11 PM, Josh Rosen rosenvi...@gmail.com wrote: Ah, that makes sense. I had forgotten that there was a JIRA for this: https://issues.apache.org/jira/browse/SPARK-4021 On October 21, 2014 at 10:08:58 AM, Patrick Wendell (pwend...@gmail.com) wrote: Josh - the errors that broke our build indicated that JDK5 was being used. Somehow the upgrade caused our build to use a much older Java version. See the JIRA for more details. On Tue, Oct 21, 2014 at 10:05 AM, Josh Rosen rosenvi...@gmail.com wrote: I find it concerning that there's a JDK version that breaks out build, since we're supposed to support Java 7. Is 7u71 an upgrade or downgrade from the JDK that we used before? Is there an easy way to fix our build so that it compiles with 7u71's stricter settings? I'm not sure why the New PRB is failing here. It was originally created as a clone of the main pull request builder job. I checked the configuration history and confirmed that there aren't any settings that we've forgotten to copy over (e.g. their configurations haven't diverged), so I'm not sure what's causing this. - Josh On October 21, 2014 at 6:35:39 AM, Nan Zhu (zhunanmcg...@gmail.com) wrote: weird.two buildings (one triggered by New, one triggered by Old) were executed in the same node, amp-jenkins-slave-01, one compiles, one not... Best, -- Nan Zhu On Tuesday, October 21, 2014 at 9:39 AM, Nan Zhu wrote: seems that all PRs built by NewSparkPRBuilder suffers from 7u71, while SparkPRBuilder is working fine Best, -- Nan Zhu On Tuesday, October 21, 2014 at 9:22 AM, Cheng Lian wrote: It's a new pull request builder written by Josh, integrated into our state-of-the-art PR dashboard :) On 10/21/14 9:33 PM, Nan Zhu wrote: just curious...what is this NewSparkPullRequestBuilder? Best, -- Nan Zhu On Tuesday, October 21, 2014 at 8:30 AM, Cheng Lian wrote: Hm, seems that 7u71 comes back again. Observed similar Kinesis compilation error just now: https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/410/consoleFull Checked Jenkins slave nodes, saw /usr/java/latest points to jdk1.7.0_71. However, /usr/bin/javac -version says: Eclipse Java Compiler 0.894_R34x, 3.4.2 release, Copyright IBM Corp 2000, 2008. All rights reserved. Which JDK is actually used by Jenkins? Cheng On 10/21/14 8:28 AM, shane knapp wrote: ok, so earlier today i installed a 2nd JDK within jenkins (7u71), which fixed the SparkR build but apparently made Spark itself quite unhappy. i removed that JDK, triggered a build ( https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21943/console), and it compiled kinesis w/o dying a fiery death. apparently 7u71 is stricter when compiling. sad times. sorry about that! shane On Mon, Oct 20, 2014 at 5:16 PM, Patrick Wendell pwend...@gmail.com (mailto: pwend...@gmail.com) wrote: The failure is in the Kinesis compoent, can you reproduce this if you build with -Pkinesis-asl? - Patrick On Mon, Oct 20, 2014 at 5:08 PM, shane knapp skn...@berkeley.edu (mailto: skn...@berkeley.edu) wrote: hmm, strange. i'll take a look. On Mon, Oct 20, 2014 at 5:11 PM, Nan Zhu zhunanmcg...@gmail.com (mailto: zhunanmcg...@gmail.com) wrote: yes, I can compile locally, too but it seems that Jenkins is not happy now... https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/ All failed to compile Best, -- Nan Zhu On Monday, October 20, 2014 at 7:56 PM, Ted Yu wrote: I performed build on latest master branch but didn't get compilation error. FYI On Mon, Oct 20, 2014 at 3:51 PM, Nan Zhu zhunanmcg...@gmail.com (mailto:zhunanmcg...@gmail.com) (mailto:zhunanmcg...@gmail.com) wrote: Hi, I just submitted a patch https://github.com/apache/spark/pull/2864/files with one line change but the Jenkins told me it's failed to compile on the unrelated files? https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21935/console Best, Nan
Re: something wrong with Jenkins or something untested merged?
ok, i did some testing and found out what's happening. https://issues.apache.org/jira/browse/SPARK-4021 here's the TL;DR: jenkins ignores what JDKs are installed via the web interface when there's more than one defined, and falls back to whatever is default on the slave the test is run on. in this case, it's openjdk 7u65... and spark compilation fails. i've removed the 2nd JDK (7u71) from jenkins, and everything is back to normal. On Tue, Oct 21, 2014 at 11:51 AM, shane knapp skn...@berkeley.edu wrote: i'm currently in a meeting and will be starting to do some tests in ~1 hour or so. On Tue, Oct 21, 2014 at 11:07 AM, Nan Zhu zhunanmcg...@gmail.com wrote: I agree with Sean I just compiled spark core successfully with 7u71 in Mac OS X On Tue, Oct 21, 2014 at 1:11 PM, Josh Rosen rosenvi...@gmail.com wrote: Ah, that makes sense. I had forgotten that there was a JIRA for this: https://issues.apache.org/jira/browse/SPARK-4021 On October 21, 2014 at 10:08:58 AM, Patrick Wendell (pwend...@gmail.com) wrote: Josh - the errors that broke our build indicated that JDK5 was being used. Somehow the upgrade caused our build to use a much older Java version. See the JIRA for more details. On Tue, Oct 21, 2014 at 10:05 AM, Josh Rosen rosenvi...@gmail.com wrote: I find it concerning that there's a JDK version that breaks out build, since we're supposed to support Java 7. Is 7u71 an upgrade or downgrade from the JDK that we used before? Is there an easy way to fix our build so that it compiles with 7u71's stricter settings? I'm not sure why the New PRB is failing here. It was originally created as a clone of the main pull request builder job. I checked the configuration history and confirmed that there aren't any settings that we've forgotten to copy over (e.g. their configurations haven't diverged), so I'm not sure what's causing this. - Josh On October 21, 2014 at 6:35:39 AM, Nan Zhu (zhunanmcg...@gmail.com) wrote: weird.two buildings (one triggered by New, one triggered by Old) were executed in the same node, amp-jenkins-slave-01, one compiles, one not... Best, -- Nan Zhu On Tuesday, October 21, 2014 at 9:39 AM, Nan Zhu wrote: seems that all PRs built by NewSparkPRBuilder suffers from 7u71, while SparkPRBuilder is working fine Best, -- Nan Zhu On Tuesday, October 21, 2014 at 9:22 AM, Cheng Lian wrote: It's a new pull request builder written by Josh, integrated into our state-of-the-art PR dashboard :) On 10/21/14 9:33 PM, Nan Zhu wrote: just curious...what is this NewSparkPullRequestBuilder? Best, -- Nan Zhu On Tuesday, October 21, 2014 at 8:30 AM, Cheng Lian wrote: Hm, seems that 7u71 comes back again. Observed similar Kinesis compilation error just now: https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/410/consoleFull Checked Jenkins slave nodes, saw /usr/java/latest points to jdk1.7.0_71. However, /usr/bin/javac -version says: Eclipse Java Compiler 0.894_R34x, 3.4.2 release, Copyright IBM Corp 2000, 2008. All rights reserved. Which JDK is actually used by Jenkins? Cheng On 10/21/14 8:28 AM, shane knapp wrote: ok, so earlier today i installed a 2nd JDK within jenkins (7u71), which fixed the SparkR build but apparently made Spark itself quite unhappy. i removed that JDK, triggered a build ( https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21943/console), and it compiled kinesis w/o dying a fiery death. apparently 7u71 is stricter when compiling. sad times. sorry about that! shane On Mon, Oct 20, 2014 at 5:16 PM, Patrick Wendell pwend...@gmail.com (mailto: pwend...@gmail.com) wrote: The failure is in the Kinesis compoent, can you reproduce this if you build with -Pkinesis-asl? - Patrick On Mon, Oct 20, 2014 at 5:08 PM, shane knapp skn...@berkeley.edu (mailto: skn...@berkeley.edu) wrote: hmm, strange. i'll take a look. On Mon, Oct 20, 2014 at 5:11 PM, Nan Zhu zhunanmcg...@gmail.com (mailto: zhunanmcg...@gmail.com) wrote: yes, I can compile locally, too but it seems that Jenkins is not happy now... https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/ All failed to compile Best, -- Nan Zhu On Monday, October 20, 2014 at 7:56 PM, Ted Yu wrote: I performed build on latest master branch but didn't get compilation error. FYI On Mon, Oct 20, 2014 at 3:51 PM, Nan Zhu zhunanmcg...@gmail.com (mailto: zhunanmcg...@gmail.com) (mailto:zhunanmcg...@gmail.com) wrote: Hi, I just submitted a patch
Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts
i've seen a few more builds fail w/timeouts and it appears that we're definitely NOT hitting any rate limiting. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22005/console [jenkins@amp-jenkins-slave-01 ~]$ curl -i -H Authorization: token REDACTED https://api.github.com | grep Rate X-RateLimit-Limit: 5000 X-RateLimit-Remaining: 4997 X-RateLimit-Reset: 1413929848 Access-Control-Expose-Headers: ETag, Link, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval On Sat, Oct 18, 2014 at 12:44 AM, Davies Liu dav...@databricks.com wrote: Cool, the recent 4 build had used the new configs, thanks! Let's run more builds. Davies On Fri, Oct 17, 2014 at 11:06 PM, Josh Rosen rosenvi...@gmail.com wrote: I think that the fix was applied. Take a look at https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21874/consoleFull Here, I see a fetch command that mentions this specific PR branch rather than the wildcard that we had before: git fetch --tags --progress https://github.com/apache/spark.git +refs/pull/2840/*:refs/remotes/origin/pr/2840/* # timeout=15 Do you have an example of a Spark PRB build that’s still failing with the old fetch failure? - Josh On October 17, 2014 at 11:03:14 PM, Davies Liu (dav...@databricks.com) wrote: How can we know the changes has been applied? I had checked several recent builds, they all use the original configs. Davies On Fri, Oct 17, 2014 at 6:17 PM, Josh Rosen rosenvi...@gmail.com wrote: FYI, I edited the Spark Pull Request Builder job to try this out. Let’s see if it works (I’ll be around to revert if it doesn’t). On October 17, 2014 at 5:26:56 PM, Davies Liu (dav...@databricks.com) wrote: One finding is that all the timeout happened with this command: git fetch --tags --progress https://github.com/apache/spark.git +refs/pull/*:refs/remotes/origin/pr/* I'm thinking that maybe this may be a expensive call, we could try to use a more cheap one: git fetch --tags --progress https://github.com/apache/spark.git +refs/pull/XXX/*:refs/remotes/origin/pr/XXX/* XXX is the PullRequestID, The configuration support parameters [1], so we could put this in : +refs/pull//${ghprbPullId}/*:refs/remotes/origin/pr/${ghprbPullId}/* I have not tested this yet, could you give this a try? Davies [1] https://wiki.jenkins-ci.org/display/JENKINS/GitHub+pull+request+builder+plugin On Fri, Oct 17, 2014 at 5:00 PM, shane knapp skn...@berkeley.edu wrote: actually, nvm, you have to be run that command from our servers to affect our limit. run it all you want from your own machines! :P On Fri, Oct 17, 2014 at 4:59 PM, shane knapp skn...@berkeley.edu wrote: yep, and i will tell you guys ONLY if you promise to NOT try this yourselves... checking the rate limit also counts as a hit and increments our numbers: # curl -i https://api.github.com/users/whatever 2 /dev/null | egrep ^X-Rate X-RateLimit-Limit: 60 X-RateLimit-Remaining: 51 X-RateLimit-Reset: 1413590269 (yes, that is the exact url that they recommended on the github site lol) so, earlier today, we had a spark build fail w/a git timeout at 10:57am, but there were only ~7 builds run that hour, so that points to us NOT hitting the rate limit... at least for this fail. whee! is it beer-thirty yet? shane On Fri, Oct 17, 2014 at 4:52 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Wow, thanks for this deep dive Shane. Is there a way to check if we are getting hit by rate limiting directly, or do we need to contact GitHub for that? 2014년 10월 17일 금요일, shane knappskn...@berkeley.edu님이 작성한 메시지: quick update: here are some stats i scraped over the past week of ALL pull request builder projects and timeout failures. due to the large number of spark ghprb jobs, i don't have great records earlier than oct 7th. the data is current up until ~230pm today: spark and new spark ghprb total builds vs git fetch timeouts: $ for x in 10-{09..17}; do passed=$(grep $x SORTED.passed | grep -i spark | wc -l); failed=$(grep $x SORTED | grep -i spark | wc -l); let total=passed+failed; fail_percent=$(echo scale=2; $failed/$total | bc | sed s/^\.//g); line=$x -- total builds: $total\tp/f: $passed/$failed\tfail%: $fail_percent%; echo -e $line; done 10-09 -- total builds: 140 p/f: 92/48 fail%: 34% 10-10 -- total builds: 65 p/f: 59/6 fail%: 09% 10-11 -- total builds: 29 p/f: 29/0 fail%: 0% 10-12 -- total builds: 24 p/f: 21/3 fail%: 12% 10-13 -- total builds: 39 p/f: 35/4 fail%: 10% 10-14 -- total builds: 7 p/f: 5/2 fail%: 28% 10-15 -- total builds: 37 p/f: 34/3 fail%: 08% 10-16 -- total builds: 71 p/f: 59/12 fail%: 16% 10-17 -- total builds: 26 p/f: 20/6 fail%: 23% all other ghprb builds vs git fetch timeouts
your weekly git timeout update! TL;DR: i'm now almost certain we're not hitting rate limits.
so, things look like they've stabilized significantly over the past 10 days, and without any changes on our end: snip $ /root/tools/get_timeouts.sh 10 timeouts by date: 2014-10-14 -- 2 2014-10-16 -- 1 2014-10-19 -- 1 2014-10-20 -- 2 2014-10-23 -- 5 timeouts by project: 5 NewSparkPullRequestBuilder 5 SparkPullRequestBuilder 1 Tachyon-Pull-Request-Builder total builds (excepting aborted by a user): 602 total percentage of builds timing out: 01 /snip the NewSparkPullRequestBuilder failures are spread over five different days (10-14 through 10-20), and the SparkPullRequestBuilder failures all happened yesterday. there were a LOT of SparkPullRequestBuilder builds yesterday (60), and the failures happened during these hours (first number == number of builds failed, second number == hour of the day): snip $ cat timeouts-102414-130817 | grep SparkPullRequestBuilder | grep 2014-10-23 | awk '{print$3}' | awk -F: '{print$1'} | sort | uniq -c 1 03 2 20 1 22 1 23 /snip however, the number of total SparkPullRequestBuilder builds during these times don't seem egregious: snip 4 03 9 20 4 22 9 23 /snip nor does the total for ALL builds at those times: snip 5 03 9 20 7 22 11 23 /snip 9 builds was the largest number of SparkPullRequestBuilder builds per hour, but there were other hours with 5, 6 or 7 builds/hour that didn't have a timeout issue. in fact, hour 16 (4pm) had the most builds running total yesterday, which includes 7 SparkPullRequestBuilder builds, and nothing timed out. most of the pull request builder hits on github are authenticated w/an oauth token. this gives us 5000 hits/hour, and unauthed gives us 60/hour. in conclusion: there is no way are we hitting github often enough to be rate limited. i think i've finally ruled that out completely. :)
jenkins downtime tomorrow morning ~6am-8am PDT
i'll be bringing jenkins down tomorrow morning for some system maintenance and to get our backups kicked off. i do expect to have the system back up and running before 8am. please let me know ASAP if i need to reschedule this. thanks, shane
jenkins emergency restart now, was Re: jenkins downtime tomorrow morning ~6am-8am PDT
so, i'm having a race condition between a plugin i installed putting jenkins in to quiet mode and it failing to perform a backup from this past weekend. i'll need to restart the process and get it out of the constantly-in-to-quiet-mode cycle it's in now. this will be quick, and i'll restart the jobs i've killed. this DOES NOT effect the restart/maintenance tomorrow morning. sorry about the inconvenience, shane On Mon, Oct 27, 2014 at 10:46 AM, shane knapp skn...@berkeley.edu wrote: i'll be bringing jenkins down tomorrow morning for some system maintenance and to get our backups kicked off. i do expect to have the system back up and running before 8am. please let me know ASAP if i need to reschedule this. thanks, shane
Re: jenkins emergency restart now, was Re: jenkins downtime tomorrow morning ~6am-8am PDT
ok we're back up and building. i've retriggered the jobs i killed. On Mon, Oct 27, 2014 at 1:24 PM, shane knapp skn...@berkeley.edu wrote: so, i'm having a race condition between a plugin i installed putting jenkins in to quiet mode and it failing to perform a backup from this past weekend. i'll need to restart the process and get it out of the constantly-in-to-quiet-mode cycle it's in now. this will be quick, and i'll restart the jobs i've killed. this DOES NOT effect the restart/maintenance tomorrow morning. sorry about the inconvenience, shane On Mon, Oct 27, 2014 at 10:46 AM, shane knapp skn...@berkeley.edu wrote: i'll be bringing jenkins down tomorrow morning for some system maintenance and to get our backups kicked off. i do expect to have the system back up and running before 8am. please let me know ASAP if i need to reschedule this. thanks, shane
Re: jenkins downtime tomorrow morning ~6am-8am PDT
this is done, and jenkins is up and building again. On Mon, Oct 27, 2014 at 10:46 AM, shane knapp skn...@berkeley.edu wrote: i'll be bringing jenkins down tomorrow morning for some system maintenance and to get our backups kicked off. i do expect to have the system back up and running before 8am. please let me know ASAP if i need to reschedule this. thanks, shane
[important] jenkins down
i noticed that there were no builds, and noticed that it's throwing a bunch of exceptions in the log file. i'm looking in to this right now and will update when i get things rolling again. sorry for the inconvenience, shane
Re: [important] jenkins down
ok, we're back up and building now... looks like there was a seriously bad git (or github) plugin update that caused all sorts of unintended consequences, mostly with cron stacktracing. i'll take a closer look and see if i can find out exactly what happened, but suffice to say, we'll be really cautious when updating even recommended plugins. sorry for the disruption! shane On Thu, Nov 20, 2014 at 10:21 AM, shane knapp skn...@berkeley.edu wrote: i noticed that there were no builds, and noticed that it's throwing a bunch of exceptions in the log file. i'm looking in to this right now and will update when i get things rolling again. sorry for the inconvenience, shane
jenkins downtime: 730-930am, 12/12/14
i'll send out a reminder next week, but i wanted to give a heads up: i'll be bringing down the entire jenkins infrastructure for reboots and system updates. please let me know if there are any conflicts with this, thanks! shane
adding new jenkins worker nodes to eventually replace existing ones
i just turned up a new jenkins slave (amp-jenkins-worker-01) to ensure it builds properly. these machines have half the ram, same number of processors and more disk, which will hopefully help us achieve more than the ~15-20% system utilization we're getting on the current amp-jenkins-slave-{01..05} nodes. instead of 5 super beefy slaves w/16 workers each, we're planning on 8 less beefy slaves w/12 workers each. this should definitely cut down on the build queue, and not impact build times in a negative way at all. i'll keep a close eye on amp-jenkins-worker-01 before i start releasing the other seven in to the wild. there should be a minimal user impact, but if i happen to miss something, please don't hesitate to let me know! thanks, shane
Re: adding new jenkins worker nodes to eventually replace existing ones
forgot to install git on this node. /headdesk i retirggered the failed spark prb jobs. On Tue, Dec 9, 2014 at 10:49 AM, shane knapp skn...@berkeley.edu wrote: i just turned up a new jenkins slave (amp-jenkins-worker-01) to ensure it builds properly. these machines have half the ram, same number of processors and more disk, which will hopefully help us achieve more than the ~15-20% system utilization we're getting on the current amp-jenkins-slave-{01..05} nodes. instead of 5 super beefy slaves w/16 workers each, we're planning on 8 less beefy slaves w/12 workers each. this should definitely cut down on the build queue, and not impact build times in a negative way at all. i'll keep a close eye on amp-jenkins-worker-01 before i start releasing the other seven in to the wild. there should be a minimal user impact, but if i happen to miss something, please don't hesitate to let me know! thanks, shane
Re: jenkins downtime: 730-930am, 12/12/14
reminder -- this is happening friday morning @ 730am! On Mon, Dec 1, 2014 at 5:10 PM, shane knapp skn...@berkeley.edu wrote: i'll send out a reminder next week, but i wanted to give a heads up: i'll be bringing down the entire jenkins infrastructure for reboots and system updates. please let me know if there are any conflicts with this, thanks! shane
Re: jenkins downtime: 730-930am, 12/12/14
reminder: jenkins is going down NOW. On Thu, Dec 11, 2014 at 3:08 PM, shane knapp skn...@berkeley.edu wrote: here's the plan... reboots, of course, come last. :) pause build queue at 7am, kill off (and eventually retrigger) any stragglers at 8am. then begin maintenance: all systems: * yum update all servers (amp-jekins-master, amp-jenkins-slave-{01..05}, amp-jenkins-worker-{01..08}) * reboots jenkins slaves: * install python2.7 (along side 2.6, which would remain the default) * install numpy 1.9.1 (currently on 1.4, breaking some spark branch builds) * add new slaves to the master, remove old ones (keep them around just in case) there will be no jenkins system or plugin upgrades at this time. things there seems to be working just fine! i'm expecting to be up and building by 9am at the latest. i'll update this thread w/any new time estimates. word. shane, your rained-in devops guy :) On Wed, Dec 10, 2014 at 11:28 AM, shane knapp skn...@berkeley.edu wrote: reminder -- this is happening friday morning @ 730am! On Mon, Dec 1, 2014 at 5:10 PM, shane knapp skn...@berkeley.edu wrote: i'll send out a reminder next week, but i wanted to give a heads up: i'll be bringing down the entire jenkins infrastructure for reboots and system updates. please let me know if there are any conflicts with this, thanks! shane
Re: jenkins downtime: 730-930am, 12/12/14
downtime is extended to 10am PST so that i can finish testing the numpy upgrade... besides that, everything looks good and the system updates and reboots went off w/o a hitch. shane On Fri, Dec 12, 2014 at 7:26 AM, shane knapp skn...@berkeley.edu wrote: reminder: jenkins is going down NOW. On Thu, Dec 11, 2014 at 3:08 PM, shane knapp skn...@berkeley.edu wrote: here's the plan... reboots, of course, come last. :) pause build queue at 7am, kill off (and eventually retrigger) any stragglers at 8am. then begin maintenance: all systems: * yum update all servers (amp-jekins-master, amp-jenkins-slave-{01..05}, amp-jenkins-worker-{01..08}) * reboots jenkins slaves: * install python2.7 (along side 2.6, which would remain the default) * install numpy 1.9.1 (currently on 1.4, breaking some spark branch builds) * add new slaves to the master, remove old ones (keep them around just in case) there will be no jenkins system or plugin upgrades at this time. things there seems to be working just fine! i'm expecting to be up and building by 9am at the latest. i'll update this thread w/any new time estimates. word. shane, your rained-in devops guy :) On Wed, Dec 10, 2014 at 11:28 AM, shane knapp skn...@berkeley.edu wrote: reminder -- this is happening friday morning @ 730am! On Mon, Dec 1, 2014 at 5:10 PM, shane knapp skn...@berkeley.edu wrote: i'll send out a reminder next week, but i wanted to give a heads up: i'll be bringing down the entire jenkins infrastructure for reboots and system updates. please let me know if there are any conflicts with this, thanks! shane
Re: jenkins downtime: 730-930am, 12/12/14
ok, we're back up w/all new jenkins workers. i'll be keeping an eye on these pretty closely today for any build failures caused by the new systems, and if things look bleak, i'll switch back to the original five. thanks for your patience! On Fri, Dec 12, 2014 at 8:47 AM, shane knapp skn...@berkeley.edu wrote: downtime is extended to 10am PST so that i can finish testing the numpy upgrade... besides that, everything looks good and the system updates and reboots went off w/o a hitch. shane On Fri, Dec 12, 2014 at 7:26 AM, shane knapp skn...@berkeley.edu wrote: reminder: jenkins is going down NOW. On Thu, Dec 11, 2014 at 3:08 PM, shane knapp skn...@berkeley.edu wrote: here's the plan... reboots, of course, come last. :) pause build queue at 7am, kill off (and eventually retrigger) any stragglers at 8am. then begin maintenance: all systems: * yum update all servers (amp-jekins-master, amp-jenkins-slave-{01..05}, amp-jenkins-worker-{01..08}) * reboots jenkins slaves: * install python2.7 (along side 2.6, which would remain the default) * install numpy 1.9.1 (currently on 1.4, breaking some spark branch builds) * add new slaves to the master, remove old ones (keep them around just in case) there will be no jenkins system or plugin upgrades at this time. things there seems to be working just fine! i'm expecting to be up and building by 9am at the latest. i'll update this thread w/any new time estimates. word. shane, your rained-in devops guy :) On Wed, Dec 10, 2014 at 11:28 AM, shane knapp skn...@berkeley.edu wrote: reminder -- this is happening friday morning @ 730am! On Mon, Dec 1, 2014 at 5:10 PM, shane knapp skn...@berkeley.edu wrote: i'll send out a reminder next week, but i wanted to give a heads up: i'll be bringing down the entire jenkins infrastructure for reboots and system updates. please let me know if there are any conflicts with this, thanks! shane
Re: jenkins downtime: 730-930am, 12/12/14
josh rosen has this PR open to address the streaming test failures: https://github.com/apache/spark/pull/3687 On Sun, Dec 14, 2014 at 8:21 AM, WangTaoTheTonic barneystin...@aliyun.com wrote: Jenkins is still not available now as some unit tests(about streaming) failed all the time. Does it have something to do with this update? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/jenkins-downtime-730-930am-12-12-14-tp9583p9778.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Archiving XML test reports for analysis
right now, the following logs are archived on to the master: local log_files=$( find .\ -name unit-tests.log -o\ -path ./sql/hive/target/HiveCompatibilitySuite.failed -o\ -path ./sql/hive/target/HiveCompatibilitySuite.hiveFailed -o\ -path ./sql/hive/target/HiveCompatibilitySuite.wrong ) regarding dumping stuff to S3 -- thankfully, since we're not looking at a lot of disk usage, i don't see a problem w/this. we could tar/zip up the XML for each build and just dump it there. what builds are we thinking about? spark pull request builder? what others? On Mon, Dec 15, 2014 at 1:33 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Every time we run a test cycle on our Jenkins cluster, we generate hundreds of XML reports covering all the tests we have (e.g. `streaming/target/test-reports/org.apache.spark.streaming.util.WriteAheadLogSuite.xml`). These reports contain interesting information about whether tests succeeded or failed, and how long they took to complete. There is also detailed information about the environment they ran in. It might be valuable to have a window into all these reports across all Jenkins builds and across all time, and use that to track basic statistics about our tests. That could give us basic insight into what tests are flaky or slow, and perhaps drive other improvements to our testing infrastructure that we can't see just yet. Do people think that would be valuable? Do we already have something like this? I'm thinking for starters it might be cool if we automatically uploaded all the XML test reports from the Master and the Pull Request builders to an S3 bucket and just opened it up for the dev community to analyze. Nick
Re: Archiving XML test reports for analysis
i have no problem w/storing all of the logs. :) i also have no problem w/donated S3 buckets. :) On Mon, Dec 15, 2014 at 2:39 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: How about all of them https://amplab.cs.berkeley.edu/jenkins/view/Spark/? How much data per day would it roughly be if we uploaded all the logs for all these builds? Also, would Databricks be willing to offer up an S3 bucket for this purpose? Nick On Mon Dec 15 2014 at 11:48:44 AM shane knapp skn...@berkeley.edu wrote: right now, the following logs are archived on to the master: local log_files=$( find .\ -name unit-tests.log -o\ -path ./sql/hive/target/HiveCompatibilitySuite.failed -o\ -path ./sql/hive/target/HiveCompatibilitySuite.hiveFailed -o\ -path ./sql/hive/target/HiveCompatibilitySuite.wrong ) regarding dumping stuff to S3 -- thankfully, since we're not looking at a lot of disk usage, i don't see a problem w/this. we could tar/zip up the XML for each build and just dump it there. what builds are we thinking about? spark pull request builder? what others? On Mon, Dec 15, 2014 at 1:33 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Every time we run a test cycle on our Jenkins cluster, we generate hundreds of XML reports covering all the tests we have (e.g. `streaming/target/test-reports/org.apache.spark.streaming.util.WriteAheadLogSuite.xml`). These reports contain interesting information about whether tests succeeded or failed, and how long they took to complete. There is also detailed information about the environment they ran in. It might be valuable to have a window into all these reports across all Jenkins builds and across all time, and use that to track basic statistics about our tests. That could give us basic insight into what tests are flaky or slow, and perhaps drive other improvements to our testing infrastructure that we can't see just yet. Do people think that would be valuable? Do we already have something like this? I'm thinking for starters it might be cool if we automatically uploaded all the XML test reports from the Master and the Pull Request builders to an S3 bucket and just opened it up for the dev community to analyze. Nick
Re: Jenkins install reference
here's the wiki describing the system setup: https://cwiki.apache.org/confluence/display/SPARK/Spark+QA+Infrastructure we have 1 master and 8 worker nodes, 12 executors per worker (we'd be better off w/more and smaller worker nodes however). you don't need to install sbt -- it's in the build/ directory. the pull request builder builds in parallel, but the master builds require specific ports to be reserved and each build effectively locks down a worker until it's done. since we have 8 worker nodes, it's not *that* big of a deal... shane On Tue, Feb 3, 2015 at 4:36 AM, scwf wangf...@huawei.com wrote: Here my question is: 1 How to set jenkins to make it build for multi PR parallel?. or one machine only support one PR building? 2 do we need install sbt on the CI machine since the script dev/run-tests will auto fetch the sbt jar ? - Fei On 2015/2/3 15:53, scwf wrote: Hi, all we want to set up a CI env for spark in our team, is there any reference of how to install jenkins over spark? Thanks Fei - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: spark 1.3 sbt build seems to be broken
here's the hash of the breaking commit: Started on Feb 5, 2015 12:01:01 PM Using strategy: Default [poll] Last Built Revision: Revision de112a2096a2b84ce2cac112f12b50b5068d6c35 (refs/remotes/origin/branch-1.3) git ls-remote -h https://github.com/apache/spark.git branch-1.3 # timeout=10 [poll] Latest remote head revision is: fba2dc663a644cfe76a744b5cace93e9d6646a25 Done. Took 2.5 sec Changes found from: https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-SBT/18/pollingLog/ On Thu, Feb 5, 2015 at 5:01 PM, shane knapp skn...@berkeley.edu wrote: https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-SBT/ we're seeing java OOMs and heap space errors: https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/19/console https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/18/console memory leak? i checked the systems (ganglia + logging in and 'free -g') and there's nothing going on there. 20 is building right now: https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-SBT/20/console
spark 1.3 sbt build seems to be broken
https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-SBT/ we're seeing java OOMs and heap space errors: https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/19/console https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/18/console memory leak? i checked the systems (ganglia + logging in and 'free -g') and there's nothing going on there. 20 is building right now: https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-SBT/20/console
Re: quick jenkins restart tomorrow morning, ~7am PST
i'm actually going to do this now -- it's really quiet today. there are two spark pull request builds running, which i will kill and retrigger once jenkins is back up: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27689/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27690/ On Wed, Feb 18, 2015 at 12:55 PM, shane knapp skn...@berkeley.edu wrote: i'll be kicking jenkins to up the open file limits on the workers. it should be a very short downtime, and i'll post updates on my progress tomorrow. shane
quick jenkins restart tomorrow morning, ~7am PST
i'll be kicking jenkins to up the open file limits on the workers. it should be a very short downtime, and i'll post updates on my progress tomorrow. shane
Re: emergency jenkins restart soon
the master builds triggered around ~1am last night (according to the logs), so it looks like we're back in business. On Wed, Jan 28, 2015 at 10:32 PM, shane knapp skn...@berkeley.edu wrote: np! the master builds haven't triggered yet, but let's give the rube goldberg machine a minute to get it's bearings. On Wed, Jan 28, 2015 at 10:31 PM, Reynold Xin r...@databricks.com wrote: Thanks for doing that, Shane! On Wed, Jan 28, 2015 at 10:29 PM, shane knapp skn...@berkeley.edu wrote: jenkins is back up and all builds have been retriggered... things are building and looking good, and i'll keep an eye on the spark master builds tonite and tomorrow. On Wed, Jan 28, 2015 at 9:56 PM, shane knapp skn...@berkeley.edu wrote: the spark master builds stopped triggering ~yesterday and the logs don't show anything. i'm going to give the current batch of spark pull request builder jobs a little more time (~30 mins) to finish, then kill whatever is left and restart jenkins. anything that was queued or killed will be retriggered once jenkins is back up. sorry for the inconvenience, we'll get this sorted asap. thanks, shane
Re: emergency jenkins restart soon
jenkins is back up and all builds have been retriggered... things are building and looking good, and i'll keep an eye on the spark master builds tonite and tomorrow. On Wed, Jan 28, 2015 at 9:56 PM, shane knapp skn...@berkeley.edu wrote: the spark master builds stopped triggering ~yesterday and the logs don't show anything. i'm going to give the current batch of spark pull request builder jobs a little more time (~30 mins) to finish, then kill whatever is left and restart jenkins. anything that was queued or killed will be retriggered once jenkins is back up. sorry for the inconvenience, we'll get this sorted asap. thanks, shane
Re: emergency jenkins restart soon
np! the master builds haven't triggered yet, but let's give the rube goldberg machine a minute to get it's bearings. On Wed, Jan 28, 2015 at 10:31 PM, Reynold Xin r...@databricks.com wrote: Thanks for doing that, Shane! On Wed, Jan 28, 2015 at 10:29 PM, shane knapp skn...@berkeley.edu wrote: jenkins is back up and all builds have been retriggered... things are building and looking good, and i'll keep an eye on the spark master builds tonite and tomorrow. On Wed, Jan 28, 2015 at 9:56 PM, shane knapp skn...@berkeley.edu wrote: the spark master builds stopped triggering ~yesterday and the logs don't show anything. i'm going to give the current batch of spark pull request builder jobs a little more time (~30 mins) to finish, then kill whatever is left and restart jenkins. anything that was queued or killed will be retriggered once jenkins is back up. sorry for the inconvenience, we'll get this sorted asap. thanks, shane
adding some temporary jenkins worker nodes...
...to help w/the build backlog. let's all welcome amp-jenkins-slave-{01..03} back to the fray!
jenkins redirect down (but jenkins is up!), lots of potential
UC Berkeley had some major maintenance done this past weekend, and long story short, not everything came back. our primary webserver's NFS is down and that means we're not serving websites, meaning that the redirect to jenkins is failing. jenkins is still up, and building some jobs, but we will probably see pull request builder failures, and other transient issues. SCM-polling builds should be fine. there is no ETA on when this will be fixed, but once our amplab.cs.berkeley.edu/jenkins redir is working, i will let everyone know. i'm trying to get more status updates as they come. i'm really sorry about the inconvenience. shane
Re: jenkins redirect down (but jenkins is up!), lots of potential
the regular url is working now, thanks for your patience. On Mon, Jan 5, 2015 at 2:25 PM, Josh Rosen rosenvi...@gmail.com wrote: The pull request builder and SCM-polling builds appear to be working fine, but the links in pull request comments won't work because the AMP Lab webserver is still down. In the meantime, though, you can continue to access Jenkins through https://hadrian.ist.berkeley.edu/jenkins/ On Mon, Jan 5, 2015 at 10:37 AM, shane knapp skn...@berkeley.edu wrote: UC Berkeley had some major maintenance done this past weekend, and long story short, not everything came back. our primary webserver's NFS is down and that means we're not serving websites, meaning that the redirect to jenkins is failing. jenkins is still up, and building some jobs, but we will probably see pull request builder failures, and other transient issues. SCM-polling builds should be fine. there is no ETA on when this will be fixed, but once our amplab.cs.berkeley.edu/jenkins redir is working, i will let everyone know. i'm trying to get more status updates as they come. i'm really sorry about the inconvenience. shane
Re: extended jenkins downtime monday, march 16th, plus some hints at the future
ok, we're back up and building. upgrading the github plugin (and possibly EnvInject) caused the stacktraces, so i've kept those at the old versions that were working before. jenkins and the rest of the plugins are updated and we're g2g. i'll be, of course, keeping an eye on things today and will squash anything else that pops up. On Mon, Mar 16, 2015 at 9:06 AM, shane knapp skn...@berkeley.edu wrote: looks like we're having some issues w/the pull request builder and cron stacktraces in the logs. i'll be investigating further and will update when i figure out what's going on. On Mon, Mar 16, 2015 at 7:51 AM, shane knapp skn...@berkeley.edu wrote: this is starting now. On Fri, Mar 13, 2015 at 10:12 AM, shane knapp skn...@berkeley.edu wrote: i'll be taking jenkins down for some much-needed plugin updates, as well as potentially upgrading jenkins itself. this will start at 730am PDT, and i'm hoping to have everything up by noon. the move to the anaconda python will take place in the next couple of weeks as i'm in the process of rebuilding my staging environment (much needed) to better reflect production, and allow me to better test the change. and finally, some teasers for what's coming up in the next month or so: * move to a fully puppetized environment (yay no more shell script deployments!) * virtualized workers (including multiple OSes -- OS X, ubuntu, ..., profit?) more details as they come. happy friday! shane
Re: extended jenkins downtime monday, march 16th, plus some hints at the future
this is starting now. On Fri, Mar 13, 2015 at 10:12 AM, shane knapp skn...@berkeley.edu wrote: i'll be taking jenkins down for some much-needed plugin updates, as well as potentially upgrading jenkins itself. this will start at 730am PDT, and i'm hoping to have everything up by noon. the move to the anaconda python will take place in the next couple of weeks as i'm in the process of rebuilding my staging environment (much needed) to better reflect production, and allow me to better test the change. and finally, some teasers for what's coming up in the next month or so: * move to a fully puppetized environment (yay no more shell script deployments!) * virtualized workers (including multiple OSes -- OS X, ubuntu, ..., profit?) more details as they come. happy friday! shane
extended jenkins downtime monday, march 16th, plus some hints at the future
i'll be taking jenkins down for some much-needed plugin updates, as well as potentially upgrading jenkins itself. this will start at 730am PDT, and i'm hoping to have everything up by noon. the move to the anaconda python will take place in the next couple of weeks as i'm in the process of rebuilding my staging environment (much needed) to better reflect production, and allow me to better test the change. and finally, some teasers for what's coming up in the next month or so: * move to a fully puppetized environment (yay no more shell script deployments!) * virtualized workers (including multiple OSes -- OS X, ubuntu, ..., profit?) more details as they come. happy friday! shane
jenkins httpd being flaky
we just started having issues when visiting jenkins and getting 503 service unavailable errors. i'm on it and will report back with an all-clear.
Re: jenkins httpd being flaky
ok, things seem to have stabilized... httpd hasn't flaked since ~noon, the hanging PRB job on amp-jenkins-worker-06 was removed w/the restart and things are now building. i cancelled and retriggered a bunch of PRB builds, btw: 4848 (https://github.com/apache/spark/pull/3699) 5922 (https://github.com/apache/spark/pull/4733) 5987 (https://github.com/apache/spark/pull/4986) 6222 (https://github.com/apache/spark/pull/4964) 6325 (https://github.com/apache/spark/pull/5018) as well as: spark-master-maven-with-yarn sorry for the inconvenience... i'm still a little stumped as to what happened, but i think it was a confluence of events (httpd flaking, problems at github, mercury in retrograde, friday thinking it's monday). shane On Fri, Mar 13, 2015 at 1:08 PM, shane knapp skn...@berkeley.edu wrote: i tried a couple of things, but will also be doing a jenkins reboot as soon as the current batch of builds finish. On Fri, Mar 13, 2015 at 12:40 PM, shane knapp skn...@berkeley.edu wrote: ok we have a few different things happening: 1) httpd on the jenkins master is randomly (though not currently) flaking out and causing visits to the site to return a 503. nothing in the logs shows any problems. 2) there are some github timeouts, which i tracked down and think it's a problem with github themselves (see: https://status.github.com/ and scroll down to 'mean hook delivery time') 3) we have one spark job w/a strange ivy lock issue, that i just retriggered (https://github.com/apache/spark/pull/4964) 4) there's an errant, unkillable pull request builder job ( https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28574/console ) more updates forthcoming. On Fri, Mar 13, 2015 at 12:04 PM, shane knapp skn...@berkeley.edu wrote: we just started having issues when visiting jenkins and getting 503 service unavailable errors. i'm on it and will report back with an all-clear.
Re: jenkins httpd being flaky
i tried a couple of things, but will also be doing a jenkins reboot as soon as the current batch of builds finish. On Fri, Mar 13, 2015 at 12:40 PM, shane knapp skn...@berkeley.edu wrote: ok we have a few different things happening: 1) httpd on the jenkins master is randomly (though not currently) flaking out and causing visits to the site to return a 503. nothing in the logs shows any problems. 2) there are some github timeouts, which i tracked down and think it's a problem with github themselves (see: https://status.github.com/ and scroll down to 'mean hook delivery time') 3) we have one spark job w/a strange ivy lock issue, that i just retriggered (https://github.com/apache/spark/pull/4964) 4) there's an errant, unkillable pull request builder job ( https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28574/console ) more updates forthcoming. On Fri, Mar 13, 2015 at 12:04 PM, shane knapp skn...@berkeley.edu wrote: we just started having issues when visiting jenkins and getting 503 service unavailable errors. i'm on it and will report back with an all-clear.
Re: PR Builder timing out due to ivy cache lock
i'm thinking that this was something transient, and hopefully won't happen again. a ton of weird stuff happened around the time of this failure (see my flaky httpd email), and this was the only build exhibiting this behavior. i'll keep an eye out for this failure over the weekend... On Fri, Mar 13, 2015 at 12:03 PM, Hari Shreedharan hshreedha...@cloudera.com wrote: Here you are: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28571/consoleFull On Fri, Mar 13, 2015 at 11:58 AM, shane knapp skn...@berkeley.edu wrote: link to a build, please? On Fri, Mar 13, 2015 at 11:53 AM, Hari Shreedharan hshreedha...@cloudera.com wrote: Looks like something is causing the PR Builder to timeout since this morning with the ivy cache being locked. Any idea what is happening?
jenkins upgraded to 1.606....
...due to some big security fixes: https://wiki.jenkins-ci.org/display/SECURITY/Jenkins+Security+Advisory+2015-03-23 :) shane
short jenkins 7am downtime tomorrow morning (3-5-15)
the master and workers need some system and package updates, and i'll also be rebooting the machines as well. this shouldn't take very long to perform, and i expect jenkins to be back up and building by 9am at the *latest*. important note: i will NOT be updating jenkins or any of the plugins during this maintenance! as always, please let me know if you have any questions or concerns. danke shane
[jenkins infra -- pls read ] installing anaconda, moving default python from 2.6 - 2.7
good morning, developers! TL;DR: i will be installing anaconda and setting it in the system PATH so that your python will default to 2.7, as well as it taking over management of all of the sci-py packages. this is potentially a big change, so i'll be testing locally on my staging instance before deployment to the wide world. deployment is *tentatively* next monday, march 2nd. a little background: the jenkins test infra is currently (and happily) managed by a set of tools that allow me to set up and deploy new workers, manage their packages and make sure that all spark and research projects can happily and successfully build. we're currently at the state where ~50 or so packages are installed and configured on each worker. this is getting a little cumbersome, as the package-to-build dep tree is getting pretty large. the biggest offender is the science-based python infrastructure. everything is blindly installed w/yum and pip, so it's hard to control *exactly* what version of any given library is as compared to what's on a dev's laptop. the solution: anaconda (https://store.continuum.io/cshop/anaconda/)! everything is centralized! i can manage specific versions much easier! what this means to you: * python 2.7 will be the default system python. * 2.6 will still be installed and available (/usr/bin/python or /usr/bin/python/2.6) what you need to do: * install anaconda, have it update your PATH * build locally and try to fix any bugs (for spark, this should just work) * if you have problems, reach out to me and i'll see what i can do to help. if we can't get your stuff running under python2.7, we can default to 2.6 via a job config change. what i will be doing: * setting up anaconda on my staging instance and spot-testing a lot of builds before deployment please let me know if there are any issues/concerns... i'll be posting updates this week and will let everyone know if there are any changes to the Plan[tm]. your friendly devops engineer, shane
Re: [jenkins infra -- pls read ] installing anaconda, moving default python from 2.6 - 2.7
On Mon, Feb 23, 2015 at 11:36 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: The first concern for Spark will probably be to ensure that we still build and test against Python 2.6, since that's the minimum version of Python we support. sounds good... we can set up separate 2.6 builds on specific versions... this could allow you to easily differentiate between baseline and latest and greatest if you wanted. it'll have a little bit more administrative overhead, due to more jobs needing configs, but offers more flexibility. let me know what you think. Otherwise this seems OK. We use numpy and other Python packages in PySpark, but I don't think we're pinned to any particular version of those packages. cool. i'll start mucking about and let you guys know how it goes. shane
Re: [jenkins infra -- pls read ] installing anaconda, moving default python from 2.6 - 2.7
i'm going to punt on this until after the next spark 1.3 release (2-3 weeks?). since i'll be installing a bunch of other packages (including mongodb), i'd rather wait and be safe. :) the full install list is forthcoming, and i'll update the spark infra wiki w/what's installed on the workers. shane On Mon, Feb 23, 2015 at 11:13 AM, shane knapp skn...@berkeley.edu wrote: good morning, developers! TL;DR: i will be installing anaconda and setting it in the system PATH so that your python will default to 2.7, as well as it taking over management of all of the sci-py packages. this is potentially a big change, so i'll be testing locally on my staging instance before deployment to the wide world. deployment is *tentatively* next monday, march 2nd. a little background: the jenkins test infra is currently (and happily) managed by a set of tools that allow me to set up and deploy new workers, manage their packages and make sure that all spark and research projects can happily and successfully build. we're currently at the state where ~50 or so packages are installed and configured on each worker. this is getting a little cumbersome, as the package-to-build dep tree is getting pretty large. the biggest offender is the science-based python infrastructure. everything is blindly installed w/yum and pip, so it's hard to control *exactly* what version of any given library is as compared to what's on a dev's laptop. the solution: anaconda (https://store.continuum.io/cshop/anaconda/)! everything is centralized! i can manage specific versions much easier! what this means to you: * python 2.7 will be the default system python. * 2.6 will still be installed and available (/usr/bin/python or /usr/bin/python/2.6) what you need to do: * install anaconda, have it update your PATH * build locally and try to fix any bugs (for spark, this should just work) * if you have problems, reach out to me and i'll see what i can do to help. if we can't get your stuff running under python2.7, we can default to 2.6 via a job config change. what i will be doing: * setting up anaconda on my staging instance and spot-testing a lot of builds before deployment please let me know if there are any issues/concerns... i'll be posting updates this week and will let everyone know if there are any changes to the Plan[tm]. your friendly devops engineer, shane
Re: [ERROR] bin/compute-classpath.sh: fails with false positive test for java 1.7 vs 1.6
it's not downgraded, it's your /etc/alternatives setup that's causing this. you can update all of those entries by executing the following commands (as root): update-alternatives --install /usr/bin/java java /usr/java/latest/bin/java 1 update-alternatives --install /usr/bin/javah javah /usr/java/latest/bin/javah 1 update-alternatives --install /usr/bin/javac javac /usr/java/latest/bin/javac 1 update-alternatives --install /usr/bin/jar jar /usr/java/latest/bin/jar 1 (i have the latest jdk installed in /usr/java/ with a /usr/java/latest/ symlink pointing to said jdk's dir) On Tue, Feb 24, 2015 at 3:32 PM, Mike Hynes 91m...@gmail.com wrote: I don't see any version flag for /usr/bin/jar, but I think I see the problem now; the openjdk version is 7, but javac -version gives 1.6.0_34; so spark was compiled with java 6 despite the system using jre 1.7. Thanks for the sanity check! Now I just need to find out why javac is downgraded on the system.. On 2/24/15, Sean Owen so...@cloudera.com wrote: So you mean that the script is checking for this error, and takes it as a sign that you compiled with java 6. Your command seems to confirm that reading the assembly jar does fail on your system though. What version does the jar command show? are you sure you don't have JRE 7 but JDK 6 installed? On Tue, Feb 24, 2015 at 11:02 PM, Mike Hynes 91m...@gmail.com wrote: ./bin/compute-classpath.sh fails with error: gt; jar -tf assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop1.0.4.jar nonexistent/class/path java.util.zip.ZipException: invalid CEN header (bad signature) at java.util.zip.ZipFile.open(Native Method) at java.util.zip.ZipFile.init(ZipFile.java:132) at java.util.zip.ZipFile.init(ZipFile.java:93) at sun.tools.jar.Main.list(Main.java:997) at sun.tools.jar.Main.run(Main.java:242) at sun.tools.jar.Main.main(Main.java:1167) However, I both compiled the distribution and am running spark with Java 1.7; $ java -version java version 1.7.0_75 OpenJDK Runtime Environment (IcedTea 2.5.4) (7u75-2.5.4-1~trusty1) OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode) on a system running Ubuntu: $ uname -srpov Linux 3.13.0-44-generic #73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014 x86_64 GNU/Linux $ uname -srpo Linux 3.13.0-44-generic x86_64 GNU/Linux This problem was reproduced on Arch Linux: $ uname -srpo Linux 3.18.5-1-ARCH x86_64 GNU/Linux with $ java -version java version 1.7.0_75 OpenJDK Runtime Environment (IcedTea 2.5.4) (Arch Linux build 7.u75_2.5.4-1-x86_64) OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode) In both of these cases, the problem is not the java versioning; neither system even has a java 6 installation. This seems like a false positive to me in compute-classpath.sh. When I comment out the relevant lines in compute-classpath.sh, the scripts start-{master,slaves,...}.sh all run fine, and I have no problem launching applications. Could someone please offer some insight into this issue? Thanks, Mike - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org -- Thanks, Mike - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Jenkins down
jenkins is currently unreachable. i'm not entirely sure why, as i can't ssh in to the box and see what's going on. i've filed a ticket and will let everyone know when i have more information. shane
Re: Jenkins down
looks like we had a power failure on campus, and our datacenter is working to bring things back up: http://systemstatus.berkeley.edu/ On Fri, Apr 24, 2015 at 11:24 AM, shane knapp skn...@berkeley.edu wrote: jenkins is currently unreachable. i'm not entirely sure why, as i can't ssh in to the box and see what's going on. i've filed a ticket and will let everyone know when i have more information. shane
Re: Jenkins down
thanks everyone! happy friday! :) On Fri, Apr 24, 2015 at 3:37 PM, York, Brennon brennon.y...@capitalone.com wrote: Ditto to Reynold. Thanks a bunch for all the updates and work Shane! On 4/24/15, 3:25 PM, Reynold Xin r...@databricks.com wrote: Thanks for looking into this, Shane. On Fri, Apr 24, 2015 at 3:18 PM, shane knapp skn...@berkeley.edu wrote: ok, jenkins is back up and building. we have a few things to mop up here (ganglia is sad), but i think we'll be good for the afternoon. shane On Fri, Apr 24, 2015 at 2:17 PM, shane knapp skn...@berkeley.edu wrote: ok, power has been restored and jenkins is back up. we might be taking things down again to fix up some power mis-cabling (jon and i are in the colo, and the jenkins master wasn't on the UPS and needs to be). more updates as they come. sorry for the inconvenience. On Fri, Apr 24, 2015 at 11:33 AM, shane knapp skn...@berkeley.edu wrote: looks like we had a power failure on campus, and our datacenter is working to bring things back up: http://systemstatus.berkeley.edu/ On Fri, Apr 24, 2015 at 11:24 AM, shane knapp skn...@berkeley.edu wrote: jenkins is currently unreachable. i'm not entirely sure why, as i can't ssh in to the box and see what's going on. i've filed a ticket and will let everyone know when i have more information. shane The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer. -- You received this message because you are subscribed to the Google Groups amp-infra group. To unsubscribe from this group and stop receiving emails from it, send an email to amp-infra+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Jenkins down
ok, power has been restored and jenkins is back up. we might be taking things down again to fix up some power mis-cabling (jon and i are in the colo, and the jenkins master wasn't on the UPS and needs to be). more updates as they come. sorry for the inconvenience. On Fri, Apr 24, 2015 at 11:33 AM, shane knapp skn...@berkeley.edu wrote: looks like we had a power failure on campus, and our datacenter is working to bring things back up: http://systemstatus.berkeley.edu/ On Fri, Apr 24, 2015 at 11:24 AM, shane knapp skn...@berkeley.edu wrote: jenkins is currently unreachable. i'm not entirely sure why, as i can't ssh in to the box and see what's going on. i've filed a ticket and will let everyone know when i have more information. shane
Re: [discuss] ending support for Java 6?
something to keep in mind: we can easily support java 6 for the build environment, particularly if there's a definite EOL. i'd like to fix our java versioning 'problem', and this could be a big instigator... right now we're hackily setting java_home in test invocation on jenkins, which really isn't the best. if i decide, within jenkins, to reconfigure every build to 'do the right thing' WRT java version, then i will clean up the old mess and pay down on some technical debt. or i can just install java 6 and we use that as JAVA_HOME on a build-by-build basis. this will be a few days of prep and another morning-long downtime if i do the right thing (within jenkins), and only a couple of hours the hacky way (system level). either way, we can test on java 6. :) On Thu, Apr 30, 2015 at 1:00 PM, Koert Kuipers ko...@tresata.com wrote: nicholas started it! :) for java 6 i would have said the same thing about 1 year ago: it is foolish to drop it. but i think the time is right about now. about half our clients are on java 7 and the other half have active plans to migrate to it within 6 months. On Thu, Apr 30, 2015 at 3:57 PM, Reynold Xin r...@databricks.com wrote: Guys thanks for chiming in, but please focus on Java here. Python is an entirely separate issue. On Thu, Apr 30, 2015 at 12:53 PM, Koert Kuipers ko...@tresata.com wrote: i am not sure eol means much if it is still actively used. we have a lot of clients with centos 5 (for which we still support python 2.4 in some form or another, fun!). most of them are on centos 6, which means python 2.6. by cutting out python 2.6 you would cut out the majority of the actual clusters i am aware of. unless you intention is to truly make something academic i dont think that is wise. On Thu, Apr 30, 2015 at 3:48 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: (On that note, I think Python 2.6 should be next on the chopping block sometime later this year, but that’s for another thread.) (To continue the parenthetical, Python 2.6 was in fact EOL-ed in October of 2013. https://www.python.org/download/releases/2.6.9/) On Thu, Apr 30, 2015 at 3:18 PM Nicholas Chammas nicholas.cham...@gmail.com wrote: I understand the concern about cutting out users who still use Java 6, and I don't have numbers about how many people are still using Java 6. But I want to say at a high level that I support deprecating older versions of stuff to reduce our maintenance burden and let us use more modern patterns in our code. Maintenance always costs way more than initial development over the lifetime of a project, and for that reason anti-support is just as important as support. (On that note, I think Python 2.6 should be next on the chopping block sometime later this year, but that's for another thread.) Nick On Thu, Apr 30, 2015 at 3:03 PM Reynold Xin r...@databricks.com wrote: This has been discussed a few times in the past, but now Oracle has ended support for Java 6 for over a year, I wonder if we should just drop Java 6 support. There is one outstanding issue Tom has brought to my attention: PySpark on YARN doesn't work well with Java 7/8, but we have an outstanding pull request to fix that. https://issues.apache.org/jira/browse/SPARK-6869 https://issues.apache.org/jira/browse/SPARK-1920
Re: [discuss] ending support for Java 6?
...and now the workers all have java6 installed. https://issues.apache.org/jira/browse/SPARK-1437 sadly, the built-in jenkins jdk management doesn't allow us to choose a JDK version within matrix projects... so we need to manage this stuff manually. On Sun, May 3, 2015 at 8:57 AM, shane knapp skn...@berkeley.edu wrote: that bug predates my time at the amplab... :) anyways, just to restate: jenkins currently only builds w/java 7. if you folks need 6, i can make it happen, but it will be a (smallish) bit of work. shane On Sun, May 3, 2015 at 2:14 AM, Sean Owen so...@cloudera.com wrote: Should be, but isn't what Jenkins does. https://issues.apache.org/jira/browse/SPARK-1437 At this point it might be simpler to just decide that 1.5 will require Java 7 and then the Jenkins setup is correct. (NB: you can also solve this by setting bootclasspath to JDK 6 libs even when using javac 7+ but I think this is overly complicated.) On Sun, May 3, 2015 at 5:52 AM, Mridul Muralidharan mri...@gmail.com wrote: Hi Shane, Since we are still maintaining support for jdk6, jenkins should be using jdk6 [1] to ensure we do not inadvertently use jdk7 or higher api which breaks source level compat. -source and -target is insufficient to ensure api usage is conformant with the minimum jdk version we are supporting. Regards, Mridul [1] Not jdk7 as you mentioned On Sat, May 2, 2015 at 8:53 PM, shane knapp skn...@berkeley.edu wrote: that's kinda what we're doing right now, java 7 is the default/standard on our jenkins. or, i vote we buy a butler's outfit for thomas and have a second jenkins instance... ;)
Re: [discuss] ending support for Java 6?
sgtm On Mon, May 4, 2015 at 11:23 AM, Patrick Wendell pwend...@gmail.com wrote: If we just set JAVA_HOME in dev/run-test-jenkins, I think it should work. On Mon, May 4, 2015 at 7:20 PM, shane knapp skn...@berkeley.edu wrote: ...and now the workers all have java6 installed. https://issues.apache.org/jira/browse/SPARK-1437 sadly, the built-in jenkins jdk management doesn't allow us to choose a JDK version within matrix projects... so we need to manage this stuff manually. On Sun, May 3, 2015 at 8:57 AM, shane knapp skn...@berkeley.edu wrote: that bug predates my time at the amplab... :) anyways, just to restate: jenkins currently only builds w/java 7. if you folks need 6, i can make it happen, but it will be a (smallish) bit of work. shane On Sun, May 3, 2015 at 2:14 AM, Sean Owen so...@cloudera.com wrote: Should be, but isn't what Jenkins does. https://issues.apache.org/jira/browse/SPARK-1437 At this point it might be simpler to just decide that 1.5 will require Java 7 and then the Jenkins setup is correct. (NB: you can also solve this by setting bootclasspath to JDK 6 libs even when using javac 7+ but I think this is overly complicated.) On Sun, May 3, 2015 at 5:52 AM, Mridul Muralidharan mri...@gmail.com wrote: Hi Shane, Since we are still maintaining support for jdk6, jenkins should be using jdk6 [1] to ensure we do not inadvertently use jdk7 or higher api which breaks source level compat. -source and -target is insufficient to ensure api usage is conformant with the minimum jdk version we are supporting. Regards, Mridul [1] Not jdk7 as you mentioned On Sat, May 2, 2015 at 8:53 PM, shane knapp skn...@berkeley.edu wrote: that's kinda what we're doing right now, java 7 is the default/standard on our jenkins. or, i vote we buy a butler's outfit for thomas and have a second jenkins instance... ;)
Re: github pull request builder FAIL, now WIN(-ish)
sure, i'll kill all of the current spark prb build... On Mon, Apr 27, 2015 at 11:34 AM, Reynold Xin r...@databricks.com wrote: Shane - can we purge all the outstanding builds so we are not running stuff against stale PRs? On Mon, Apr 27, 2015 at 11:30 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: And unfortunately, many Jenkins executor slots are being taken by stale Spark PRs... On Mon, Apr 27, 2015 at 2:25 PM shane knapp skn...@berkeley.edu wrote: anyways, the build queue is SLAMMED... we're going to need at least a day to catch up w/this. i'll be keeping an eye on system loads and whatnot all day today. whee! On Mon, Apr 27, 2015 at 11:18 AM, shane knapp skn...@berkeley.edu wrote: somehow, the power outage on friday caused the pull request builder to lose it's config entirely... i'm not sure why, but after i added the oauth token back, we're now catching up on the weekend's pull request builds. have i mentioned how much i hate this plugin? ;) sorry for the inconvenience... shane
Re: github pull request builder FAIL, now WIN(-ish)
never mind, looks like you guys are already on it. :) On Mon, Apr 27, 2015 at 11:35 AM, shane knapp skn...@berkeley.edu wrote: sure, i'll kill all of the current spark prb build... On Mon, Apr 27, 2015 at 11:34 AM, Reynold Xin r...@databricks.com wrote: Shane - can we purge all the outstanding builds so we are not running stuff against stale PRs? On Mon, Apr 27, 2015 at 11:30 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: And unfortunately, many Jenkins executor slots are being taken by stale Spark PRs... On Mon, Apr 27, 2015 at 2:25 PM shane knapp skn...@berkeley.edu wrote: anyways, the build queue is SLAMMED... we're going to need at least a day to catch up w/this. i'll be keeping an eye on system loads and whatnot all day today. whee! On Mon, Apr 27, 2015 at 11:18 AM, shane knapp skn...@berkeley.edu wrote: somehow, the power outage on friday caused the pull request builder to lose it's config entirely... i'm not sure why, but after i added the oauth token back, we're now catching up on the weekend's pull request builds. have i mentioned how much i hate this plugin? ;) sorry for the inconvenience... shane
github pull request builder FAIL, now WIN(-ish)
somehow, the power outage on friday caused the pull request builder to lose it's config entirely... i'm not sure why, but after i added the oauth token back, we're now catching up on the weekend's pull request builds. have i mentioned how much i hate this plugin? ;) sorry for the inconvenience... shane
Re: github pull request builder FAIL, now WIN(-ish)
anyways, the build queue is SLAMMED... we're going to need at least a day to catch up w/this. i'll be keeping an eye on system loads and whatnot all day today. whee! On Mon, Apr 27, 2015 at 11:18 AM, shane knapp skn...@berkeley.edu wrote: somehow, the power outage on friday caused the pull request builder to lose it's config entirely... i'm not sure why, but after i added the oauth token back, we're now catching up on the weekend's pull request builds. have i mentioned how much i hate this plugin? ;) sorry for the inconvenience... shane