Re: jenkins maintenance/downtime, aug 28th, 730am-9am PDT

2014-08-28 Thread shane knapp
jenkins is now coming down.


On Thu, Aug 28, 2014 at 7:19 AM, shane knapp skn...@berkeley.edu wrote:

 reminder:  this is starting in 10 minutes


 On Wed, Aug 27, 2014 at 4:13 PM, shane knapp skn...@berkeley.edu wrote:

 tomorrow morning i will be upgrading jenkins to the latest/greatest
 (1.577).

 at 730am, i will put jenkins in to a quiet period, so no new builds will
 be accepted.  once any running builds are finished, i will be taking
 jenkins down for the upgrade.

 depending on what and how many jobs are running, i'm expecting this to
 take, at most, an hour.

 i'll send out an update tomorrow morning right before i begin, and will
 send out updates and an all-clear once we're up and running again.

 1.577 release notes:
 http://jenkins-ci.org/changelog

 please let me know if there are any questions/concerns.  thanks in
 advance!

 shane





Re: jenkins maintenance/downtime, aug 28th, 730am-9am PDT

2014-08-28 Thread shane knapp
jenkins is upgraded, but a few jobs sneaked in before i could do the plugin
updates.  i've put jenkins in quiet mode again, and once the spark builds
finish, i'll restart jenkins to enable the plugin updates and we'll be good
to go.

let's all take a moment to bask in the glory of the shiny new UI!  :)


On Thu, Aug 28, 2014 at 7:46 AM, shane knapp skn...@berkeley.edu wrote:

 jenkins is now coming down.


 On Thu, Aug 28, 2014 at 7:19 AM, shane knapp skn...@berkeley.edu wrote:

 reminder:  this is starting in 10 minutes


 On Wed, Aug 27, 2014 at 4:13 PM, shane knapp skn...@berkeley.edu wrote:

 tomorrow morning i will be upgrading jenkins to the latest/greatest
 (1.577).

 at 730am, i will put jenkins in to a quiet period, so no new builds will
 be accepted.  once any running builds are finished, i will be taking
 jenkins down for the upgrade.

 depending on what and how many jobs are running, i'm expecting this to
 take, at most, an hour.

 i'll send out an update tomorrow morning right before i begin, and will
 send out updates and an all-clear once we're up and running again.

 1.577 release notes:
 http://jenkins-ci.org/changelog

 please let me know if there are any questions/concerns.  thanks in
 advance!

 shane






Re: jenkins maintenance/downtime, aug 28th, 730am-9am PDT

2014-08-28 Thread shane knapp
this one job is blocking the jenkins restart:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19406/

i'm about to kill it so that i can get this done.  i'll restart the job
after jenkins is back up.


On Thu, Aug 28, 2014 at 7:51 AM, shane knapp skn...@berkeley.edu wrote:

 jenkins is upgraded, but a few jobs sneaked in before i could do the
 plugin updates.  i've put jenkins in quiet mode again, and once the spark
 builds finish, i'll restart jenkins to enable the plugin updates and we'll
 be good to go.

 let's all take a moment to bask in the glory of the shiny new UI!  :)


 On Thu, Aug 28, 2014 at 7:46 AM, shane knapp skn...@berkeley.edu wrote:

 jenkins is now coming down.


 On Thu, Aug 28, 2014 at 7:19 AM, shane knapp skn...@berkeley.edu wrote:

 reminder:  this is starting in 10 minutes


 On Wed, Aug 27, 2014 at 4:13 PM, shane knapp skn...@berkeley.edu
 wrote:

 tomorrow morning i will be upgrading jenkins to the latest/greatest
 (1.577).

 at 730am, i will put jenkins in to a quiet period, so no new builds
 will be accepted.  once any running builds are finished, i will be taking
 jenkins down for the upgrade.

 depending on what and how many jobs are running, i'm expecting this to
 take, at most, an hour.

 i'll send out an update tomorrow morning right before i begin, and will
 send out updates and an all-clear once we're up and running again.

 1.577 release notes:
 http://jenkins-ci.org/changelog

 please let me know if there are any questions/concerns.  thanks in
 advance!

 shane







Re: jenkins maintenance/downtime, aug 28th, 730am-9am PDT

2014-08-28 Thread shane knapp
all clear:  jenkins and all plugins have been updated!


On Thu, Aug 28, 2014 at 7:51 AM, shane knapp skn...@berkeley.edu wrote:

 jenkins is upgraded, but a few jobs sneaked in before i could do the
 plugin updates.  i've put jenkins in quiet mode again, and once the spark
 builds finish, i'll restart jenkins to enable the plugin updates and we'll
 be good to go.

 let's all take a moment to bask in the glory of the shiny new UI!  :)


 On Thu, Aug 28, 2014 at 7:46 AM, shane knapp skn...@berkeley.edu wrote:

 jenkins is now coming down.


 On Thu, Aug 28, 2014 at 7:19 AM, shane knapp skn...@berkeley.edu wrote:

 reminder:  this is starting in 10 minutes


 On Wed, Aug 27, 2014 at 4:13 PM, shane knapp skn...@berkeley.edu
 wrote:

 tomorrow morning i will be upgrading jenkins to the latest/greatest
 (1.577).

 at 730am, i will put jenkins in to a quiet period, so no new builds
 will be accepted.  once any running builds are finished, i will be taking
 jenkins down for the upgrade.

 depending on what and how many jobs are running, i'm expecting this to
 take, at most, an hour.

 i'll send out an update tomorrow morning right before i begin, and will
 send out updates and an all-clear once we're up and running again.

 1.577 release notes:
 http://jenkins-ci.org/changelog

 please let me know if there are any questions/concerns.  thanks in
 advance!

 shane







Re: jenkins maintenance/downtime, aug 28th, 730am-9am PDT

2014-08-28 Thread shane knapp
no problem!

also, i retriggered:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19406
it's currently:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19411


On Thu, Aug 28, 2014 at 9:46 AM, Reynold Xin r...@databricks.com wrote:

 Thanks for doing this, Shane.


 On Thursday, August 28, 2014, shane knapp skn...@berkeley.edu wrote:

 all clear:  jenkins and all plugins have been updated!


 On Thu, Aug 28, 2014 at 7:51 AM, shane knapp skn...@berkeley.edu wrote:

  jenkins is upgraded, but a few jobs sneaked in before i could do the
  plugin updates.  i've put jenkins in quiet mode again, and once the
 spark
  builds finish, i'll restart jenkins to enable the plugin updates and
 we'll
  be good to go.
 
  let's all take a moment to bask in the glory of the shiny new UI!  :)
 
 
  On Thu, Aug 28, 2014 at 7:46 AM, shane knapp skn...@berkeley.edu
 wrote:
 
  jenkins is now coming down.
 
 
  On Thu, Aug 28, 2014 at 7:19 AM, shane knapp skn...@berkeley.edu
 wrote:
 
  reminder:  this is starting in 10 minutes
 
 
  On Wed, Aug 27, 2014 at 4:13 PM, shane knapp skn...@berkeley.edu
  wrote:
 
  tomorrow morning i will be upgrading jenkins to the latest/greatest
  (1.577).
 
  at 730am, i will put jenkins in to a quiet period, so no new builds
  will be accepted.  once any running builds are finished, i will be
 taking
  jenkins down for the upgrade.
 
  depending on what and how many jobs are running, i'm expecting this
 to
  take, at most, an hour.
 
  i'll send out an update tomorrow morning right before i begin, and
 will
  send out updates and an all-clear once we're up and running again.
 
  1.577 release notes:
  http://jenkins-ci.org/changelog
 
  please let me know if there are any questions/concerns.  thanks in
  advance!
 
  shane
 
 
 
 
 

  --
 You received this message because you are subscribed to the Google Groups
 amp-infra group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to amp-infra+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



emergency jenkins restart, aug 29th, 730am-9am PDT -- plus a postmortem

2014-08-28 Thread shane knapp
as with all software upgrades, sometimes things don't always work as
expected.

a recent change to stapler[1], to verbosely
report NotExportableExceptions[2] is spamming our jenkins log file with
stack traces, which is growing rather quickly (1.2G since 9am).  this has
been reported to the jenkins jira[3], and a fix has been pushed and will be
rolled out soon[4].

this isn't affecting any builds, and jenkins is happily humming along.

in the interim, so that we don't run out of disk space, i will be
redirecting the jenkins logs tommorow morning to /dev/null for the long
weekend.

once a real fix has been released, i will update any packages needed and
redirect the logging back to the log file.

other than a short downtime, this will have no user-facing impact.

please let me know if you have any questions/concerns.

thanks for your patience!

shane the new guy  :)

[1] -- https://wiki.jenkins-ci.org/display/JENKINS/Architecture
[2] --
https://github.com/stapler/stapler/commit/ed2cb8b04c1514377f3a8bfbd567f050a67c6e1c
[3] --
https://issues.jenkins-ci.org/browse/JENKINS-24458?focusedCommentId=209247
[4] --
https://github.com/stapler/stapler/commit/e2b39098ca1f61a58970b8a41a3ae79053cf30e3


Re: emergency jenkins restart, aug 29th, 730am-9am PDT -- plus a postmortem

2014-08-29 Thread shane knapp
reminder:   this is happening right now.  jenkins is currently in quiet
mode, and in ~30 minutes, will be briefly going down.


On Thu, Aug 28, 2014 at 1:03 PM, shane knapp skn...@berkeley.edu wrote:

 as with all software upgrades, sometimes things don't always work as
 expected.

 a recent change to stapler[1], to verbosely
 report NotExportableExceptions[2] is spamming our jenkins log file with
 stack traces, which is growing rather quickly (1.2G since 9am).  this has
 been reported to the jenkins jira[3], and a fix has been pushed and will be
 rolled out soon[4].

 this isn't affecting any builds, and jenkins is happily humming along.

 in the interim, so that we don't run out of disk space, i will be
 redirecting the jenkins logs tommorow morning to /dev/null for the long
 weekend.

 once a real fix has been released, i will update any packages needed and
 redirect the logging back to the log file.

 other than a short downtime, this will have no user-facing impact.

 please let me know if you have any questions/concerns.

 thanks for your patience!

 shane the new guy  :)

 [1] -- https://wiki.jenkins-ci.org/display/JENKINS/Architecture
 [2] --
 https://github.com/stapler/stapler/commit/ed2cb8b04c1514377f3a8bfbd567f050a67c6e1c
 [3] --
 https://issues.jenkins-ci.org/browse/JENKINS-24458?focusedCommentId=209247
 [4] --
 https://github.com/stapler/stapler/commit/e2b39098ca1f61a58970b8a41a3ae79053cf30e3



Re: emergency jenkins restart, aug 29th, 730am-9am PDT -- plus a postmortem

2014-08-29 Thread shane knapp
this is done.


On Fri, Aug 29, 2014 at 7:32 AM, shane knapp skn...@berkeley.edu wrote:

 reminder:   this is happening right now.  jenkins is currently in quiet
 mode, and in ~30 minutes, will be briefly going down.


 On Thu, Aug 28, 2014 at 1:03 PM, shane knapp skn...@berkeley.edu wrote:

 as with all software upgrades, sometimes things don't always work as
 expected.

 a recent change to stapler[1], to verbosely
 report NotExportableExceptions[2] is spamming our jenkins log file with
 stack traces, which is growing rather quickly (1.2G since 9am).  this has
 been reported to the jenkins jira[3], and a fix has been pushed and will be
 rolled out soon[4].

 this isn't affecting any builds, and jenkins is happily humming along.

 in the interim, so that we don't run out of disk space, i will be
 redirecting the jenkins logs tommorow morning to /dev/null for the long
 weekend.

 once a real fix has been released, i will update any packages needed and
 redirect the logging back to the log file.

 other than a short downtime, this will have no user-facing impact.

 please let me know if you have any questions/concerns.

 thanks for your patience!

 shane the new guy  :)

 [1] -- https://wiki.jenkins-ci.org/display/JENKINS/Architecture
 [2] --
 https://github.com/stapler/stapler/commit/ed2cb8b04c1514377f3a8bfbd567f050a67c6e1c
 [3] --
 https://issues.jenkins-ci.org/browse/JENKINS-24458?focusedCommentId=209247
 [4] --
 https://github.com/stapler/stapler/commit/e2b39098ca1f61a58970b8a41a3ae79053cf30e3





new jenkins plugin installed and ready for use

2014-08-29 Thread shane knapp
i have always found the 'Rebuild' plugin super useful:
https://wiki.jenkins-ci.org/display/JENKINS/Rebuild+Plugin

this is installed and enables.  enjoy!

shane


hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread shane knapp
so, i had a meeting w/the databricks guys on friday and they recommended i
send an email out to the list to say 'hi' and give you guys a quick intro.
 :)

hi!  i'm shane knapp, the new AMPLab devops engineer, and will be spending
time getting the jenkins build infrastructure up to production quality.
 much of this will be 'under the covers' work, like better system level
auth, backups, etc, but some will definitely be user facing:  timely
jenkins updates, debugging broken build infrastructure and some plugin
support.

i've been working in the bay area now since 1997 at many different
companies, and my last 10 years has been split between google and palantir.
 i'm a huge proponent of OSS, and am really happy to be able to help with
the work you guys are doing!

if anyone has any requests/questions/comments, feel free to drop me a line!

shane


Re: quick jenkins restart

2014-09-02 Thread shane knapp
and we're back and building!


On Tue, Sep 2, 2014 at 5:07 PM, shane knapp skn...@berkeley.edu wrote:

 since our queue is really short, i'm waiting for a couple of builds to
 finish and will be restarting jenkins to install/update some plugins.  the
 github pull request builder looks like it has some fixes to reduce spammy
 github calls, and reduce any potential rate limiting.

 i'll let everyone know when it's back up...  this should be super quick
 (~15 mins for tests to finish, ~2 mins for jenkins to restart).

 thanks in advance!

 shane



amplab jenkins is down

2014-09-04 Thread shane knapp
i am trying to get things up and running, but it looks like either the
firewall gateway or jenkins server itself is down.  i'll update as soon as
i know more.


Re: amplab jenkins is down

2014-09-04 Thread shane knapp
looks like a power outage in soda hall.  more updates as they happen.


On Thu, Sep 4, 2014 at 12:25 PM, shane knapp skn...@berkeley.edu wrote:

 i am trying to get things up and running, but it looks like either the
 firewall gateway or jenkins server itself is down.  i'll update as soon as
 i know more.



Re: amplab jenkins is down

2014-09-04 Thread shane knapp
looks like some hardware failed, and we're swapping in a replacement.  i
don't have more specific information yet -- including *what* failed, as our
sysadmin is super busy ATM.  the root cause was an incorrect circuit being
switched off during building maintenance.

on a side note, this incident will be accelerating our plan to move the
entire jenkins infrastructure in to a managed datacenter environment.  this
will be our major push over the next couple of weeks.  more details about
this, also, as soon as i get them.

i'm very sorry about the downtime, we'll get everything up and running ASAP.


On Thu, Sep 4, 2014 at 12:27 PM, shane knapp skn...@berkeley.edu wrote:

 looks like a power outage in soda hall.  more updates as they happen.


 On Thu, Sep 4, 2014 at 12:25 PM, shane knapp skn...@berkeley.edu wrote:

 i am trying to get things up and running, but it looks like either the
 firewall gateway or jenkins server itself is down.  i'll update as soon as
 i know more.





Re: amplab jenkins is down

2014-09-04 Thread shane knapp
it's a faulty power switch on the firewall, which has been swapped out.
 we're about to reboot and be good to go.


On Thu, Sep 4, 2014 at 1:19 PM, shane knapp skn...@berkeley.edu wrote:

 looks like some hardware failed, and we're swapping in a replacement.  i
 don't have more specific information yet -- including *what* failed, as our
 sysadmin is super busy ATM.  the root cause was an incorrect circuit being
 switched off during building maintenance.

 on a side note, this incident will be accelerating our plan to move the
 entire jenkins infrastructure in to a managed datacenter environment.  this
 will be our major push over the next couple of weeks.  more details about
 this, also, as soon as i get them.

 i'm very sorry about the downtime, we'll get everything up and running
 ASAP.


 On Thu, Sep 4, 2014 at 12:27 PM, shane knapp skn...@berkeley.edu wrote:

 looks like a power outage in soda hall.  more updates as they happen.


 On Thu, Sep 4, 2014 at 12:25 PM, shane knapp skn...@berkeley.edu wrote:

 i am trying to get things up and running, but it looks like either the
 firewall gateway or jenkins server itself is down.  i'll update as soon as
 i know more.






Re: amplab jenkins is down

2014-09-04 Thread shane knapp
AND WE'RE UP!

sorry that this took so long...  i'll send out a more detailed explanation
of what happened soon.

now, off to back up jenkins.

shane


On Thu, Sep 4, 2014 at 1:27 PM, shane knapp skn...@berkeley.edu wrote:

 it's a faulty power switch on the firewall, which has been swapped out.
  we're about to reboot and be good to go.


 On Thu, Sep 4, 2014 at 1:19 PM, shane knapp skn...@berkeley.edu wrote:

 looks like some hardware failed, and we're swapping in a replacement.  i
 don't have more specific information yet -- including *what* failed, as our
 sysadmin is super busy ATM.  the root cause was an incorrect circuit being
 switched off during building maintenance.

 on a side note, this incident will be accelerating our plan to move the
 entire jenkins infrastructure in to a managed datacenter environment.  this
 will be our major push over the next couple of weeks.  more details about
 this, also, as soon as i get them.

 i'm very sorry about the downtime, we'll get everything up and running
 ASAP.


 On Thu, Sep 4, 2014 at 12:27 PM, shane knapp skn...@berkeley.edu wrote:

 looks like a power outage in soda hall.  more updates as they happen.


 On Thu, Sep 4, 2014 at 12:25 PM, shane knapp skn...@berkeley.edu
 wrote:

 i am trying to get things up and running, but it looks like either the
 firewall gateway or jenkins server itself is down.  i'll update as soon as
 i know more.







Re: amplab jenkins is down

2014-09-04 Thread shane knapp
looking


On Thu, Sep 4, 2014 at 4:21 PM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 It appears that our main man is having trouble
 https://amplab.cs.berkeley.edu/jenkins/view/Pull%20Request%20Builders/job/SparkPullRequestBuilder/
  hearing new requests
 https://github.com/apache/spark/pull/2277#issuecomment-54549106.

 Do we need some smelling salts?


 On Thu, Sep 4, 2014 at 5:49 PM, shane knapp skn...@berkeley.edu wrote:

 i'd ping the Jenkinsmench...  the master was completely offline, so any
 new
 jobs wouldn't have reached it.  any jobs that were queued when power was
 lost probably started up, but jobs that were running would fail.


 On Thu, Sep 4, 2014 at 2:45 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com
  wrote:

  Woohoo! Thanks Shane.
 
  Do you know if queued PR builds will automatically be picked up? Or do
 we
  have to ping the Jenkinmensch manually from each PR?
 
  Nick
 
 
  On Thu, Sep 4, 2014 at 5:37 PM, shane knapp skn...@berkeley.edu
 wrote:
 
  AND WE'RE UP!
 
  sorry that this took so long...  i'll send out a more detailed
 explanation
  of what happened soon.
 
  now, off to back up jenkins.
 
  shane
 
 
  On Thu, Sep 4, 2014 at 1:27 PM, shane knapp skn...@berkeley.edu
 wrote:
 
   it's a faulty power switch on the firewall, which has been swapped
 out.
we're about to reboot and be good to go.
  
  
   On Thu, Sep 4, 2014 at 1:19 PM, shane knapp skn...@berkeley.edu
  wrote:
  
   looks like some hardware failed, and we're swapping in a
 replacement.
  i
   don't have more specific information yet -- including *what* failed,
  as our
   sysadmin is super busy ATM.  the root cause was an incorrect circuit
  being
   switched off during building maintenance.
  
   on a side note, this incident will be accelerating our plan to move
 the
   entire jenkins infrastructure in to a managed datacenter
 environment.
  this
   will be our major push over the next couple of weeks.  more details
  about
   this, also, as soon as i get them.
  
   i'm very sorry about the downtime, we'll get everything up and
 running
   ASAP.
  
  
   On Thu, Sep 4, 2014 at 12:27 PM, shane knapp skn...@berkeley.edu
  wrote:
  
   looks like a power outage in soda hall.  more updates as they
 happen.
  
  
   On Thu, Sep 4, 2014 at 12:25 PM, shane knapp skn...@berkeley.edu
   wrote:
  
   i am trying to get things up and running, but it looks like either
  the
   firewall gateway or jenkins server itself is down.  i'll update as
  soon as
   i know more.
  
  
  
  
  
 
 
   --
  You received this message because you are subscribed to the Google
 Groups
  amp-infra group.
  To unsubscribe from this group and stop receiving emails from it, send
 an
  email to amp-infra+unsubscr...@googlegroups.com.
  For more options, visit https://groups.google.com/d/optout.
 





Re: amplab jenkins is down

2014-09-04 Thread shane knapp
i'm going to restart jenkins and see if that fixes things.


On Thu, Sep 4, 2014 at 4:56 PM, shane knapp skn...@berkeley.edu wrote:

 looking


 On Thu, Sep 4, 2014 at 4:21 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 It appears that our main man is having trouble
 https://amplab.cs.berkeley.edu/jenkins/view/Pull%20Request%20Builders/job/SparkPullRequestBuilder/
  hearing new requests
 https://github.com/apache/spark/pull/2277#issuecomment-54549106.

 Do we need some smelling salts?


 On Thu, Sep 4, 2014 at 5:49 PM, shane knapp skn...@berkeley.edu wrote:

 i'd ping the Jenkinsmench...  the master was completely offline, so any
 new
 jobs wouldn't have reached it.  any jobs that were queued when power was
 lost probably started up, but jobs that were running would fail.


 On Thu, Sep 4, 2014 at 2:45 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com
  wrote:

  Woohoo! Thanks Shane.
 
  Do you know if queued PR builds will automatically be picked up? Or do
 we
  have to ping the Jenkinmensch manually from each PR?
 
  Nick
 
 
  On Thu, Sep 4, 2014 at 5:37 PM, shane knapp skn...@berkeley.edu
 wrote:
 
  AND WE'RE UP!
 
  sorry that this took so long...  i'll send out a more detailed
 explanation
  of what happened soon.
 
  now, off to back up jenkins.
 
  shane
 
 
  On Thu, Sep 4, 2014 at 1:27 PM, shane knapp skn...@berkeley.edu
 wrote:
 
   it's a faulty power switch on the firewall, which has been swapped
 out.
we're about to reboot and be good to go.
  
  
   On Thu, Sep 4, 2014 at 1:19 PM, shane knapp skn...@berkeley.edu
  wrote:
  
   looks like some hardware failed, and we're swapping in a
 replacement.
  i
   don't have more specific information yet -- including *what*
 failed,
  as our
   sysadmin is super busy ATM.  the root cause was an incorrect
 circuit
  being
   switched off during building maintenance.
  
   on a side note, this incident will be accelerating our plan to
 move the
   entire jenkins infrastructure in to a managed datacenter
 environment.
  this
   will be our major push over the next couple of weeks.  more details
  about
   this, also, as soon as i get them.
  
   i'm very sorry about the downtime, we'll get everything up and
 running
   ASAP.
  
  
   On Thu, Sep 4, 2014 at 12:27 PM, shane knapp skn...@berkeley.edu
  wrote:
  
   looks like a power outage in soda hall.  more updates as they
 happen.
  
  
   On Thu, Sep 4, 2014 at 12:25 PM, shane knapp skn...@berkeley.edu
 
   wrote:
  
   i am trying to get things up and running, but it looks like
 either
  the
   firewall gateway or jenkins server itself is down.  i'll update
 as
  soon as
   i know more.
  
  
  
  
  
 
 
   --
  You received this message because you are subscribed to the Google
 Groups
  amp-infra group.
  To unsubscribe from this group and stop receiving emails from it, send
 an
  email to amp-infra+unsubscr...@googlegroups.com.
  For more options, visit https://groups.google.com/d/optout.
 






Re: amplab jenkins is down

2014-09-04 Thread shane knapp
yep.  that's exactly the behavior i saw earlier, and will be figuring out
first thing tomorrow morning.  i bet it's an environment issues on the
slaves.


On Thu, Sep 4, 2014 at 7:10 PM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 Looks like during the last build
 https://amplab.cs.berkeley.edu/jenkins/view/Pull%20Request%20Builders/job/SparkPullRequestBuilder/19797/console
 Jenkins was unable to execute a git fetch?


 On Thu, Sep 4, 2014 at 7:58 PM, shane knapp skn...@berkeley.edu wrote:

 i'm going to restart jenkins and see if that fixes things.


 On Thu, Sep 4, 2014 at 4:56 PM, shane knapp skn...@berkeley.edu wrote:

 looking


 On Thu, Sep 4, 2014 at 4:21 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 It appears that our main man is having trouble
 https://amplab.cs.berkeley.edu/jenkins/view/Pull%20Request%20Builders/job/SparkPullRequestBuilder/
  hearing new requests
 https://github.com/apache/spark/pull/2277#issuecomment-54549106.

 Do we need some smelling salts?


 On Thu, Sep 4, 2014 at 5:49 PM, shane knapp skn...@berkeley.edu
 wrote:

 i'd ping the Jenkinsmench...  the master was completely offline, so
 any new
 jobs wouldn't have reached it.  any jobs that were queued when power
 was
 lost probably started up, but jobs that were running would fail.


 On Thu, Sep 4, 2014 at 2:45 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com
  wrote:

  Woohoo! Thanks Shane.
 
  Do you know if queued PR builds will automatically be picked up? Or
 do we
  have to ping the Jenkinmensch manually from each PR?
 
  Nick
 
 
  On Thu, Sep 4, 2014 at 5:37 PM, shane knapp skn...@berkeley.edu
 wrote:
 
  AND WE'RE UP!
 
  sorry that this took so long...  i'll send out a more detailed
 explanation
  of what happened soon.
 
  now, off to back up jenkins.
 
  shane
 
 
  On Thu, Sep 4, 2014 at 1:27 PM, shane knapp skn...@berkeley.edu
 wrote:
 
   it's a faulty power switch on the firewall, which has been
 swapped out.
we're about to reboot and be good to go.
  
  
   On Thu, Sep 4, 2014 at 1:19 PM, shane knapp skn...@berkeley.edu
  wrote:
  
   looks like some hardware failed, and we're swapping in a
 replacement.
  i
   don't have more specific information yet -- including *what*
 failed,
  as our
   sysadmin is super busy ATM.  the root cause was an incorrect
 circuit
  being
   switched off during building maintenance.
  
   on a side note, this incident will be accelerating our plan to
 move the
   entire jenkins infrastructure in to a managed datacenter
 environment.
  this
   will be our major push over the next couple of weeks.  more
 details
  about
   this, also, as soon as i get them.
  
   i'm very sorry about the downtime, we'll get everything up and
 running
   ASAP.
  
  
   On Thu, Sep 4, 2014 at 12:27 PM, shane knapp 
 skn...@berkeley.edu
  wrote:
  
   looks like a power outage in soda hall.  more updates as they
 happen.
  
  
   On Thu, Sep 4, 2014 at 12:25 PM, shane knapp 
 skn...@berkeley.edu
   wrote:
  
   i am trying to get things up and running, but it looks like
 either
  the
   firewall gateway or jenkins server itself is down.  i'll
 update as
  soon as
   i know more.
  
  
  
  
  
 
 
   --
  You received this message because you are subscribed to the Google
 Groups
  amp-infra group.
  To unsubscribe from this group and stop receiving emails from it,
 send an
  email to amp-infra+unsubscr...@googlegroups.com.
  For more options, visit https://groups.google.com/d/optout.
 








Re: amplab jenkins is down

2014-09-05 Thread shane knapp
it's looking like everything except the pull request builders are working.
 i'm going to be working on getting this resolved today.


On Fri, Sep 5, 2014 at 8:18 AM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 Hmm, looks like at least some builds
 https://amplab.cs.berkeley.edu/jenkins/view/Pull%20Request%20Builders/job/SparkPullRequestBuilder/19804/consoleFull
 are working now, though this last one was from ~5 hours ago.


 On Fri, Sep 5, 2014 at 1:02 AM, shane knapp skn...@berkeley.edu wrote:

 yep.  that's exactly the behavior i saw earlier, and will be figuring out
 first thing tomorrow morning.  i bet it's an environment issues on the
 slaves.


 On Thu, Sep 4, 2014 at 7:10 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Looks like during the last build
 https://amplab.cs.berkeley.edu/jenkins/view/Pull%20Request%20Builders/job/SparkPullRequestBuilder/19797/console
 Jenkins was unable to execute a git fetch?


 On Thu, Sep 4, 2014 at 7:58 PM, shane knapp skn...@berkeley.edu wrote:

 i'm going to restart jenkins and see if that fixes things.


 On Thu, Sep 4, 2014 at 4:56 PM, shane knapp skn...@berkeley.edu
 wrote:

 looking


 On Thu, Sep 4, 2014 at 4:21 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 It appears that our main man is having trouble
 https://amplab.cs.berkeley.edu/jenkins/view/Pull%20Request%20Builders/job/SparkPullRequestBuilder/
  hearing new requests
 https://github.com/apache/spark/pull/2277#issuecomment-54549106.

 Do we need some smelling salts?


 On Thu, Sep 4, 2014 at 5:49 PM, shane knapp skn...@berkeley.edu
 wrote:

 i'd ping the Jenkinsmench...  the master was completely offline, so
 any new
 jobs wouldn't have reached it.  any jobs that were queued when power
 was
 lost probably started up, but jobs that were running would fail.


 On Thu, Sep 4, 2014 at 2:45 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com
  wrote:

  Woohoo! Thanks Shane.
 
  Do you know if queued PR builds will automatically be picked up?
 Or do we
  have to ping the Jenkinmensch manually from each PR?
 
  Nick
 
 
  On Thu, Sep 4, 2014 at 5:37 PM, shane knapp skn...@berkeley.edu
 wrote:
 
  AND WE'RE UP!
 
  sorry that this took so long...  i'll send out a more detailed
 explanation
  of what happened soon.
 
  now, off to back up jenkins.
 
  shane
 
 
  On Thu, Sep 4, 2014 at 1:27 PM, shane knapp skn...@berkeley.edu
 wrote:
 
   it's a faulty power switch on the firewall, which has been
 swapped out.
we're about to reboot and be good to go.
  
  
   On Thu, Sep 4, 2014 at 1:19 PM, shane knapp 
 skn...@berkeley.edu
  wrote:
  
   looks like some hardware failed, and we're swapping in a
 replacement.
  i
   don't have more specific information yet -- including *what*
 failed,
  as our
   sysadmin is super busy ATM.  the root cause was an incorrect
 circuit
  being
   switched off during building maintenance.
  
   on a side note, this incident will be accelerating our plan to
 move the
   entire jenkins infrastructure in to a managed datacenter
 environment.
  this
   will be our major push over the next couple of weeks.  more
 details
  about
   this, also, as soon as i get them.
  
   i'm very sorry about the downtime, we'll get everything up and
 running
   ASAP.
  
  
   On Thu, Sep 4, 2014 at 12:27 PM, shane knapp 
 skn...@berkeley.edu
  wrote:
  
   looks like a power outage in soda hall.  more updates as they
 happen.
  
  
   On Thu, Sep 4, 2014 at 12:25 PM, shane knapp 
 skn...@berkeley.edu
   wrote:
  
   i am trying to get things up and running, but it looks like
 either
  the
   firewall gateway or jenkins server itself is down.  i'll
 update as
  soon as
   i know more.
  
  
  
  
  
 
 
   --
  You received this message because you are subscribed to the Google
 Groups
  amp-infra group.
  To unsubscribe from this group and stop receiving emails from it,
 send an
  email to amp-infra+unsubscr...@googlegroups.com.
  For more options, visit https://groups.google.com/d/optout.
 










yet another jenkins restart early thursday morning -- 730am PDT (and a brief update on our new jenkins infra)

2014-09-09 Thread shane knapp
since the power incident last thursday, the github pull request builder
plugin is still not really working 100%.  i found an open issue
w/jenkins[1] that could definitely be affecting us, i will be pausing
builds early thursday morning and then restarting jenkins.
i'll send out a reminder tomorrow, and if this causes any problems for you,
please let me know and we can work out a better time.

but, now for some good news!  yesterday morning, we racked and stacked the
systems for the new jenkins instance in the berkeley datacenter.  tomorrow
i should be able to log in to them and start getting them set up and
configured.  this is a major step in getting us in to a much more
'production' style environment!

anyways:  thanks for your patience, and i think we've all learned that hard
powering down your build system is a definite recipe for disaster.  :)

shane

[1] -- https://issues.jenkins-ci.org/browse/JENKINS-22509


Re: yet another jenkins restart early thursday morning -- 730am PDT (and a brief update on our new jenkins infra)

2014-09-10 Thread shane knapp
that's kinda what we're hoping as well.  :)

On Wed, Sep 10, 2014 at 2:46 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 I'm looking forward to this. :)

 Looks like Jenkins is having trouble triggering builds for new commits or
 after user requests (e.g.
 https://github.com/apache/spark/pull/2339#issuecomment-55165937).
 Hopefully that will be resolved tomorrow.

 Nick

 On Tue, Sep 9, 2014 at 5:00 PM, shane knapp skn...@berkeley.edu wrote:

 since the power incident last thursday, the github pull request builder
 plugin is still not really working 100%.  i found an open issue
 w/jenkins[1] that could definitely be affecting us, i will be pausing
 builds early thursday morning and then restarting jenkins.
 i'll send out a reminder tomorrow, and if this causes any problems for
 you,
 please let me know and we can work out a better time.

 but, now for some good news!  yesterday morning, we racked and stacked the
 systems for the new jenkins instance in the berkeley datacenter.  tomorrow
 i should be able to log in to them and start getting them set up and
 configured.  this is a major step in getting us in to a much more
 'production' style environment!

 anyways:  thanks for your patience, and i think we've all learned that
 hard
 powering down your build system is a definite recipe for disaster.  :)

 shane

 [1] -- https://issues.jenkins-ci.org/browse/JENKINS-22509





Re: yet another jenkins restart early thursday morning -- 730am PDT (and a brief update on our new jenkins infra)

2014-09-11 Thread shane knapp
jenkins is now in quiet mode, and a restart is happening soon.

On Wed, Sep 10, 2014 at 3:44 PM, shane knapp skn...@berkeley.edu wrote:

 that's kinda what we're hoping as well.  :)

 On Wed, Sep 10, 2014 at 2:46 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 I'm looking forward to this. :)

 Looks like Jenkins is having trouble triggering builds for new commits or
 after user requests (e.g.
 https://github.com/apache/spark/pull/2339#issuecomment-55165937).
 Hopefully that will be resolved tomorrow.

 Nick

 On Tue, Sep 9, 2014 at 5:00 PM, shane knapp skn...@berkeley.edu wrote:

 since the power incident last thursday, the github pull request builder
 plugin is still not really working 100%.  i found an open issue
 w/jenkins[1] that could definitely be affecting us, i will be pausing
 builds early thursday morning and then restarting jenkins.
 i'll send out a reminder tomorrow, and if this causes any problems for
 you,
 please let me know and we can work out a better time.

 but, now for some good news!  yesterday morning, we racked and stacked
 the
 systems for the new jenkins instance in the berkeley datacenter.
 tomorrow
 i should be able to log in to them and start getting them set up and
 configured.  this is a major step in getting us in to a much more
 'production' style environment!

 anyways:  thanks for your patience, and i think we've all learned that
 hard
 powering down your build system is a definite recipe for disaster.  :)

 shane

 [1] -- https://issues.jenkins-ci.org/browse/JENKINS-22509






Re: yet another jenkins restart early thursday morning -- 730am PDT (and a brief update on our new jenkins infra)

2014-09-11 Thread shane knapp
...and the restart is done.

On Thu, Sep 11, 2014 at 7:38 AM, shane knapp skn...@berkeley.edu wrote:

 jenkins is now in quiet mode, and a restart is happening soon.

 On Wed, Sep 10, 2014 at 3:44 PM, shane knapp skn...@berkeley.edu wrote:

 that's kinda what we're hoping as well.  :)

 On Wed, Sep 10, 2014 at 2:46 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 I'm looking forward to this. :)

 Looks like Jenkins is having trouble triggering builds for new commits
 or after user requests (e.g.
 https://github.com/apache/spark/pull/2339#issuecomment-55165937).
 Hopefully that will be resolved tomorrow.

 Nick

 On Tue, Sep 9, 2014 at 5:00 PM, shane knapp skn...@berkeley.edu wrote:

 since the power incident last thursday, the github pull request builder
 plugin is still not really working 100%.  i found an open issue
 w/jenkins[1] that could definitely be affecting us, i will be pausing
 builds early thursday morning and then restarting jenkins.
 i'll send out a reminder tomorrow, and if this causes any problems for
 you,
 please let me know and we can work out a better time.

 but, now for some good news!  yesterday morning, we racked and stacked
 the
 systems for the new jenkins instance in the berkeley datacenter.
 tomorrow
 i should be able to log in to them and start getting them set up and
 configured.  this is a major step in getting us in to a much more
 'production' style environment!

 anyways:  thanks for your patience, and i think we've all learned that
 hard
 powering down your build system is a definite recipe for disaster.  :)

 shane

 [1] -- https://issues.jenkins-ci.org/browse/JENKINS-22509







Re: yet another jenkins restart early thursday morning -- 730am PDT (and a brief update on our new jenkins infra)

2014-09-11 Thread shane knapp
you can just click on 'rebuild', if you'd like.  what project specifically?
 (i had forgotten that i'd killed
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/557/,
which i just started a rebuild on)

On Thu, Sep 11, 2014 at 9:15 AM, Matthew Farrellee m...@redhat.com wrote:

 shane,

 is there anything we should do for pull requests that failed, but for
 unrelated issues?

 best,


 matt

 On 09/11/2014 11:29 AM, shane knapp wrote:

 ...and the restart is done.

 On Thu, Sep 11, 2014 at 7:38 AM, shane knapp skn...@berkeley.edu wrote:

  jenkins is now in quiet mode, and a restart is happening soon.

 On Wed, Sep 10, 2014 at 3:44 PM, shane knapp skn...@berkeley.edu
 wrote:

  that's kinda what we're hoping as well.  :)

 On Wed, Sep 10, 2014 at 2:46 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

  I'm looking forward to this. :)

 Looks like Jenkins is having trouble triggering builds for new commits
 or after user requests (e.g.
 https://github.com/apache/spark/pull/2339#issuecomment-55165937).
 Hopefully that will be resolved tomorrow.

 Nick

 On Tue, Sep 9, 2014 at 5:00 PM, shane knapp skn...@berkeley.edu
 wrote:

  since the power incident last thursday, the github pull request
 builder
 plugin is still not really working 100%.  i found an open issue
 w/jenkins[1] that could definitely be affecting us, i will be pausing
 builds early thursday morning and then restarting jenkins.
 i'll send out a reminder tomorrow, and if this causes any problems for
 you,
 please let me know and we can work out a better time.

 but, now for some good news!  yesterday morning, we racked and stacked
 the
 systems for the new jenkins instance in the berkeley datacenter.
 tomorrow
 i should be able to log in to them and start getting them set up and
 configured.  this is a major step in getting us in to a much more
 'production' style environment!

 anyways:  thanks for your patience, and i think we've all learned that
 hard
 powering down your build system is a definite recipe for disaster.  :)

 shane

 [1] -- https://issues.jenkins-ci.org/browse/JENKINS-22509










FYI: jenkins systems patched to fix bash exploit

2014-09-26 Thread shane knapp
all of our systems were affected by the shellshock bug, and i've just
patched everything w/the latest fix from redhat:

https://access.redhat.com/articles/1200223

we're not running bash.x86_64 0:4.1.2-15.el6_5.2 on all of our systems.

shane


Re: FYI: jenkins systems patched to fix bash exploit

2014-09-26 Thread shane knapp


 we're not running bash.x86_64 0:4.1.2-15.el6_5.2 on all of our systems.

 s/not/now

:)


jenkins downtime/system upgrade wednesday morning, 730am PDT

2014-09-29 Thread shane knapp
happy monday, everyone!

remember a few weeks back when i upgraded jenkins, and unwittingly began
DOSing our system due to massive log spam?

well, that bug has been fixed w/the current release and i'd like to get our
logging levels back to something more verbose that we have now.

downtime will be from 730am-1000am PDT (i do expect this to be done well
before 1000am)

the update will be from 1.578 - 1.582

changelog here:  http://jenkins-ci.org/changelog

please let me know if there are any questions or concerns.  thanks!

shane, your friendly devops engineer


Re: FYI: i've doubled the jenkins executors for every build node

2014-09-29 Thread shane knapp
yeah, this is why i'm gonna keep a close eye on things this week...

as for VMs vs containers, please do the latter more than the former.  one
of our longer-term plans here at the lab is to move most of our jenkins
infra to VMs, and running tests w/nested VMs is Bad[tm].

On Mon, Sep 29, 2014 at 2:25 PM, Reynold Xin r...@databricks.com wrote:

 Thanks. We might see more failures due to contention on resources. Fingers
 acrossed ... At some point it might make sense to run the tests in a VM or
 container.


 On Mon, Sep 29, 2014 at 2:20 PM, shane knapp skn...@berkeley.edu wrote:

 we were running at 8 executors per node, and BARELY even stressing the
 machines (32 cores, ~230G RAM).

 in the interest of actually using system resources, and giving ourselves
 some headroom, i upped the executors to 16 per node.  i'll be keeping an
 eye on ganglia for the rest of the week to make sure everything's cool.

 i hope you all enjoy your freshly allocated capacity!  :)

 shane





Re: jenkins downtime/system upgrade wednesday morning, 730am PDT

2014-09-30 Thread shane knapp
https://issues.apache.org/jira/browse/SPARK-3745

On Tue, Sep 30, 2014 at 10:22 AM, shane knapp skn...@berkeley.edu wrote:

 (this time, reply to all)

 nice catch.  there's a bug in spark/dev/check-license, which i've
 confirmed from the CLI.  i'll open a bug and PR to fix it.

 On Mon, Sep 29, 2014 at 8:00 PM, Nan Zhu zhunanmcg...@gmail.com wrote:

  Just noticed these lines in the jenkins log

 =
 Running Apache RAT checks
 =
 Attempting to fetch rat
 Launching rat from 
 /home/jenkins/workspace/SparkPullRequestBuilder/lib/apache-rat-0.10.jar
 Error: Invalid or corrupt jarfile 
 /home/jenkins/workspace/SparkPullRequestBuilder/lib/apache-rat-0.10.jar
 RAT checks passed.


 Something wrong?


 Best,


 --
 Nan Zhu

 On Monday, September 29, 2014 at 4:43 PM, shane knapp wrote:

 happy monday, everyone!

 remember a few weeks back when i upgraded jenkins, and unwittingly began
 DOSing our system due to massive log spam?

 well, that bug has been fixed w/the current release and i'd like to get
 our
 logging levels back to something more verbose that we have now.

 downtime will be from 730am-1000am PDT (i do expect this to be done well
 before 1000am)

 the update will be from 1.578 - 1.582

 changelog here: http://jenkins-ci.org/changelog

 please let me know if there are any questions or concerns. thanks!

 shane, your friendly devops engineer






Re: jenkins downtime/system upgrade wednesday morning, 730am PDT

2014-09-30 Thread shane knapp
reminder:  this is happening tomorrow morning.  i will be putting jenkins
in to quiet mode at ~7am, and then doing the upgrade once any stray builds
finish.

On Mon, Sep 29, 2014 at 1:43 PM, shane knapp skn...@berkeley.edu wrote:

 happy monday, everyone!

 remember a few weeks back when i upgraded jenkins, and unwittingly began
 DOSing our system due to massive log spam?

 well, that bug has been fixed w/the current release and i'd like to get
 our logging levels back to something more verbose that we have now.

 downtime will be from 730am-1000am PDT (i do expect this to be done well
 before 1000am)

 the update will be from 1.578 - 1.582

 changelog here:  http://jenkins-ci.org/changelog

 please let me know if there are any questions or concerns.  thanks!

 shane, your friendly devops engineer



Re: amplab jenkins is down

2014-10-01 Thread shane knapp
as of this morning, i've got the new jenkins up, with all of the current
builds set up (but failing).  i'm in the middle of playing setup/debug
whack-a-mole, but we're getting there.  my guess would be early next week
for the switchover.

On Wed, Oct 1, 2014 at 12:53 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 On Thu, Sep 4, 2014 at 4:19 PM, shane knapp skn...@berkeley.edu wrote:

 on a side note, this incident will be accelerating our plan to move the
 entire jenkins infrastructure in to a managed datacenter environment.
 this
 will be our major push over the next couple of weeks.  more details about
 this, also, as soon as i get them.


 Are there any updates on this move of the Jenkins infrastructure to a
 managed datacenter?

 I remember it being mentioned that another benefit of this move would be
 reduced flakiness when Jenkins tries to checkout patches for testing. For
 some reason, I'm getting a lot of those
 https://github.com/apache/spark/pull/2606#issuecomment-57514540 today.

 Nick



emergency jenkins restart -- massive security patch released

2014-10-03 Thread shane knapp
https://wiki.jenkins-ci.org/display/SECURITY/Jenkins+Security+Advisory+2014-10-01

there's some pretty big stuff that's been identified and we need to get
this upgraded asap.

i'll be killing off what's currently running, and will retrigger them all
once we're done.

sorry for the inconvenience.

shane


Re: emergency jenkins restart -- massive security patch released

2014-10-03 Thread shane knapp
update complete.  i'm retriggering builds now.

On Fri, Oct 3, 2014 at 10:51 AM, shane knapp skn...@berkeley.edu wrote:


 https://wiki.jenkins-ci.org/display/SECURITY/Jenkins+Security+Advisory+2014-10-01

 there's some pretty big stuff that's been identified and we need to get
 this upgraded asap.

 i'll be killing off what's currently running, and will retrigger them all
 once we're done.

 sorry for the inconvenience.

 shane



Re: new jenkins update + tentative release date

2014-10-13 Thread shane knapp
AND WE ARE LIIIVE!

https://amplab.cs.berkeley.edu/jenkins/

have at it, folks!

On Mon, Oct 13, 2014 at 10:15 AM, shane knapp skn...@berkeley.edu wrote:

 quick update:  we should be back up and running in the next ~60mins.

 On Mon, Oct 13, 2014 at 7:54 AM, shane knapp skn...@berkeley.edu wrote:

 Jenkins is in quiet mode and the move will be starting after i have my
 coffee.  :)

 On Sun, Oct 12, 2014 at 11:26 PM, Josh Rosen rosenvi...@gmail.com
 wrote:

 Reminder: this Jenkins migration is happening tomorrow morning (Monday).

 On Fri, Oct 10, 2014 at 1:01 PM, shane knapp skn...@berkeley.edu
 wrote:

 reminder:  this IS happening, first thing monday morning PDT.  :)

 On Wed, Oct 8, 2014 at 3:01 PM, shane knapp skn...@berkeley.edu
 wrote:

  greetings!
 
  i've got some updates regarding our new jenkins infrastructure, as
 well as
  the initial date and plan for rolling things out:
 
  *** current testing/build break whack-a-mole:
  a lot of out of date artifacts are cached in the current jenkins,
 which
  has caused a few builds during my testing to break due to dependency
  resolution failure[1][2].
 
  bumping these versions can cause your builds to fail, due to public
 api
  changes and the like.  consider yourself warned that some projects
 might
  require some debugging...  :)
 
  tomorrow, i will be at databricks working w/@joshrosen to make sure
 that
  the spark builds have any bugs hammered out.
 
  ***  deployment plan:
  unless something completely horrible happens, THE NEW JENKINS WILL GO
 LIVE
  ON MONDAY (october 13th).
 
  all jenkins infrastructure will be DOWN for the entirety of the day
  (starting at ~8am).  this means no builds, period.  i'm hoping that
 the
  downtime will be much shorter than this, but we'll have to see how
  everything goes.
 
  all test/build history WILL BE PRESERVED.  i will be rsyncing the
 jenkins
  jobs/ directory over, complete w/history as part of the deployment.
 
  once i'm feeling good about the state of things, i'll point the
 original
  url to the new instances and send out an all clear.
 
  if you are a student at UC berkeley, you can log in to jenkins using
 your
  LDAP login, and (by default) view but not change plans.  if you do
 not have
  a UC berkeley LDAP login, you can still view plans anonymously.
 
  IF YOU ARE A PLAN ADMIN, THEN PLEASE REACH OUT, ASAP, PRIVATELY AND I
 WILL
  SET UP ADMIN ACCESS TO YOUR BUILDS.
 
  ***  post deployment plan:
  fix all of the things that break!
 
  i will be keeping a VERY close eye on the builds, checking for
 breaks, and
  helping out where i can.  if the situation is dire, i can always roll
 back
  to the old jenkins infra...  but i hope we never get to that point!
 :)
 
  i'm hoping that things will go smoothly, but please be patient as i'm
  certain we'll hit a few bumps in the road.
 
  please let me know if you guys have any
 comments/questions/concerns...  :)
 
  shane
 
  1 - https://github.com/bigdatagenomics/bdg-services/pull/18
  2 - https://github.com/bigdatagenomics/avocado/pull/111
 







Re: new jenkins update + tentative release date

2014-10-13 Thread shane knapp
On Mon, Oct 13, 2014 at 2:28 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 Thanks for doing this work Shane.

 So is Jenkins in the new datacenter now? Do you know if the problems with
 checking out patches from GitHub should be resolved now? Here's an
 example from the past hour
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21702/console
 .


 yeah, i just noticed that we're still having the checkout issues.  i was
really hoping that the better network would just make this go away...
 guess i'll be doing a deeper dive now.

i would just up the timeout, but that's not coming out for a little while
yet:
https://issues.jenkins-ci.org/browse/JENKINS-20387

(we are currently running the latest -- 2.2.7, and the timeout field is
coming in 2.3, whenever that is)

i'll try and strace/replicate it locally as well.


Re: new jenkins update + tentative release date

2014-10-13 Thread shane knapp
ok, i found something that may help:
https://issues.jenkins-ci.org/browse/JENKINS-20445?focusedCommentId=195638page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-195638

i set this to 20 minutes...  let's see if that helps.

On Mon, Oct 13, 2014 at 2:48 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 Ah, that sucks. Thank you for looking into this.

 On Mon, Oct 13, 2014 at 5:43 PM, shane knapp skn...@berkeley.edu wrote:

 On Mon, Oct 13, 2014 at 2:28 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Thanks for doing this work Shane.

 So is Jenkins in the new datacenter now? Do you know if the problems
 with checking out patches from GitHub should be resolved now? Here's an
 example from the past hour
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21702/console
 .


 yeah, i just noticed that we're still having the checkout issues.  i was
 really hoping that the better network would just make this go away...
  guess i'll be doing a deeper dive now.

 i would just up the timeout, but that's not coming out for a little while
 yet:
 https://issues.jenkins-ci.org/browse/JENKINS-20387

 (we are currently running the latest -- 2.2.7, and the timeout field is
 coming in 2.3, whenever that is)

 i'll try and strace/replicate it locally as well.






short jenkins downtime -- trying to get to the bottom of the git fetch timeouts

2014-10-15 Thread shane knapp
i'm going to be downgrading our git plugin (from 2.2.7 to 2.2.2) to see if
that helps w/the git fetch timeouts.

this will require a short downtime (~20 mins for builds to finish, ~20 mins
to downgrade), and will hopefully give us some insight in to wtf is going
on.

thanks for your patience...

shane


Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts

2014-10-15 Thread shane knapp
ok, we're up and building...  :crossesfingersfortheumpteenthtime:

On Wed, Oct 15, 2014 at 1:59 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 I support this effort. :thumbsup:

 On Wed, Oct 15, 2014 at 4:52 PM, shane knapp skn...@berkeley.edu wrote:

 i'm going to be downgrading our git plugin (from 2.2.7 to 2.2.2) to see if
 that helps w/the git fetch timeouts.

 this will require a short downtime (~20 mins for builds to finish, ~20
 mins
 to downgrade), and will hopefully give us some insight in to wtf is going
 on.

 thanks for your patience...

 shane


  --
 You received this message because you are subscribed to the Google Groups
 amp-infra group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to amp-infra+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts

2014-10-15 Thread shane knapp
four builds triggered  and no timeouts.  :crossestoes:  :)

On Wed, Oct 15, 2014 at 2:19 PM, shane knapp skn...@berkeley.edu wrote:

 ok, we're up and building...  :crossesfingersfortheumpteenthtime:

 On Wed, Oct 15, 2014 at 1:59 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 I support this effort. :thumbsup:

 On Wed, Oct 15, 2014 at 4:52 PM, shane knapp skn...@berkeley.edu wrote:

 i'm going to be downgrading our git plugin (from 2.2.7 to 2.2.2) to see
 if
 that helps w/the git fetch timeouts.

 this will require a short downtime (~20 mins for builds to finish, ~20
 mins
 to downgrade), and will hopefully give us some insight in to wtf is going
 on.

 thanks for your patience...

 shane


  --
 You received this message because you are subscribed to the Google Groups
 amp-infra group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to amp-infra+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.





Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts

2014-10-15 Thread shane knapp
ok, we've had about 10 spark pull request builds go through w/o any git
timeouts.  it seems that the git timeout issue might be licked.

i will be definitely be keeping an eye on this for the next few days.

thanks for being patient!

shane

On Wed, Oct 15, 2014 at 2:27 PM, shane knapp skn...@berkeley.edu wrote:

 four builds triggered  and no timeouts.  :crossestoes:  :)

 On Wed, Oct 15, 2014 at 2:19 PM, shane knapp skn...@berkeley.edu wrote:

 ok, we're up and building...  :crossesfingersfortheumpteenthtime:

 On Wed, Oct 15, 2014 at 1:59 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 I support this effort. :thumbsup:

 On Wed, Oct 15, 2014 at 4:52 PM, shane knapp skn...@berkeley.edu
 wrote:

 i'm going to be downgrading our git plugin (from 2.2.7 to 2.2.2) to see
 if
 that helps w/the git fetch timeouts.

 this will require a short downtime (~20 mins for builds to finish, ~20
 mins
 to downgrade), and will hopefully give us some insight in to wtf is
 going
 on.

 thanks for your patience...

 shane


  --
 You received this message because you are subscribed to the Google
 Groups amp-infra group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to amp-infra+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.






Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts

2014-10-16 Thread shane knapp
the bad news is that we've had a couple more failures due to timeouts, but
the good news is that the frequency that these happen has decreased
significantly (3 in the past ~18hr).

seems like the git plugin downgrade has helped relieve the problem, but
hasn't fixed it.  i'll be looking in to this more today.

On Wed, Oct 15, 2014 at 7:05 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 A quick scan through the Spark PR board https://spark-prs.appspot.com/ shows
 no recent failures related to this git checkout problem.

 Looks promising!

 Nick

 On Wed, Oct 15, 2014 at 6:10 PM, shane knapp skn...@berkeley.edu wrote:

 ok, we've had about 10 spark pull request builds go through w/o any git
 timeouts.  it seems that the git timeout issue might be licked.

 i will be definitely be keeping an eye on this for the next few days.

 thanks for being patient!

 shane

 On Wed, Oct 15, 2014 at 2:27 PM, shane knapp skn...@berkeley.edu wrote:

  four builds triggered  and no timeouts.  :crossestoes:  :)
 
  On Wed, Oct 15, 2014 at 2:19 PM, shane knapp skn...@berkeley.edu
 wrote:
 
  ok, we're up and building...  :crossesfingersfortheumpteenthtime:
 
  On Wed, Oct 15, 2014 at 1:59 PM, Nicholas Chammas 
  nicholas.cham...@gmail.com wrote:
 
  I support this effort. :thumbsup:
 
  On Wed, Oct 15, 2014 at 4:52 PM, shane knapp skn...@berkeley.edu
  wrote:
 
  i'm going to be downgrading our git plugin (from 2.2.7 to 2.2.2) to
 see
  if
  that helps w/the git fetch timeouts.
 
  this will require a short downtime (~20 mins for builds to finish,
 ~20
  mins
  to downgrade), and will hopefully give us some insight in to wtf is
  going
  on.
 
  thanks for your patience...
 
  shane
 
 
   --
  You received this message because you are subscribed to the Google
  Groups amp-infra group.
  To unsubscribe from this group and stop receiving emails from it, send
  an email to amp-infra+unsubscr...@googlegroups.com.
  For more options, visit https://groups.google.com/d/optout.
 
 
 
 





Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts

2014-10-16 Thread shane knapp
yeah, at this point it might be worth trying.  :)

the absolutely irritating thing is that i am not seeing this happen w/any
other jobs other that the spark prb, nor does it seem to correlate w/time
of day, network or system load, or what slave it runs on.  nor are we
hitting our limit of connections on github.  i really, truly hate
non-deterministic failures.

i'm also going to write an email to support@github and see if they have any
insight in to this as well.

On Thu, Oct 16, 2014 at 12:51 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 Thanks for continuing to look into this, Shane.

 One suggestion that Patrick brought up, if we have trouble getting to the
 bottom of this, is doing the git checkout ourselves in the
 run-tests-jenkins script and cutting out the Jenkins git plugin entirely.
 That way we can script retries and post friendlier messages about timeouts
 if they still occur by ourselves.

 Do you think that’s worth trying at some point?

 Nick
 ​

 On Thu, Oct 16, 2014 at 2:04 PM, shane knapp skn...@berkeley.edu wrote:

 the bad news is that we've had a couple more failures due to timeouts,
 but the good news is that the frequency that these happen has decreased
 significantly (3 in the past ~18hr).

 seems like the git plugin downgrade has helped relieve the problem, but
 hasn't fixed it.  i'll be looking in to this more today.

 On Wed, Oct 15, 2014 at 7:05 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 A quick scan through the Spark PR board https://spark-prs.appspot.com/ 
 shows
 no recent failures related to this git checkout problem.

 Looks promising!

 Nick

 On Wed, Oct 15, 2014 at 6:10 PM, shane knapp skn...@berkeley.edu
 wrote:

 ok, we've had about 10 spark pull request builds go through w/o any git
 timeouts.  it seems that the git timeout issue might be licked.

 i will be definitely be keeping an eye on this for the next few days.

 thanks for being patient!

 shane

 On Wed, Oct 15, 2014 at 2:27 PM, shane knapp skn...@berkeley.edu
 wrote:

  four builds triggered  and no timeouts.  :crossestoes:  :)
 
  On Wed, Oct 15, 2014 at 2:19 PM, shane knapp skn...@berkeley.edu
 wrote:
 
  ok, we're up and building...  :crossesfingersfortheumpteenthtime:
 
  On Wed, Oct 15, 2014 at 1:59 PM, Nicholas Chammas 
  nicholas.cham...@gmail.com wrote:
 
  I support this effort. :thumbsup:
 
  On Wed, Oct 15, 2014 at 4:52 PM, shane knapp skn...@berkeley.edu
  wrote:
 
  i'm going to be downgrading our git plugin (from 2.2.7 to 2.2.2)
 to see
  if
  that helps w/the git fetch timeouts.
 
  this will require a short downtime (~20 mins for builds to finish,
 ~20
  mins
  to downgrade), and will hopefully give us some insight in to wtf is
  going
  on.
 
  thanks for your patience...
 
  shane
 
 
   --
  You received this message because you are subscribed to the Google
  Groups amp-infra group.
  To unsubscribe from this group and stop receiving emails from it,
 send
  an email to amp-infra+unsubscr...@googlegroups.com.
  For more options, visit https://groups.google.com/d/optout.
 
 
 
 







Re: something wrong with Jenkins or something untested merged?

2014-10-20 Thread shane knapp
ok, so earlier today i installed a 2nd JDK within jenkins (7u71), which
fixed the SparkR build but apparently made Spark itself quite unhappy.  i
removed that JDK, triggered a build (
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21943/console),
and it compiled kinesis w/o dying a fiery death.

apparently 7u71 is stricter when compiling.  sad times.

sorry about that!

shane


On Mon, Oct 20, 2014 at 5:16 PM, Patrick Wendell pwend...@gmail.com wrote:

 The failure is in the Kinesis compoent, can you reproduce this if you
 build with -Pkinesis-asl?

 - Patrick

 On Mon, Oct 20, 2014 at 5:08 PM, shane knapp skn...@berkeley.edu wrote:
  hmm, strange.  i'll take a look.
 
  On Mon, Oct 20, 2014 at 5:11 PM, Nan Zhu zhunanmcg...@gmail.com wrote:
 
  yes, I can compile locally, too
 
  but it seems that Jenkins is not happy now...
  https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/
 
  All failed to compile
 
  Best,
 
  --
  Nan Zhu
 
 
  On Monday, October 20, 2014 at 7:56 PM, Ted Yu wrote:
 
   I performed build on latest master branch but didn't get compilation
  error.
  
   FYI
  
   On Mon, Oct 20, 2014 at 3:51 PM, Nan Zhu zhunanmcg...@gmail.com
  (mailto:zhunanmcg...@gmail.com) wrote:
Hi,
   
I just submitted a patch
  https://github.com/apache/spark/pull/2864/files
with one line change
   
but the Jenkins told me it's failed to compile on the unrelated
 files?
   
   
 
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21935/console
   
   
Best,
   
Nan
  
 
 



Re: something wrong with Jenkins or something untested merged?

2014-10-20 Thread shane knapp
thanks, patrick!

:)

On Mon, Oct 20, 2014 at 5:35 PM, Patrick Wendell pwend...@gmail.com wrote:

 I created an issue to fix this:

 https://issues.apache.org/jira/browse/SPARK-4021

 On Mon, Oct 20, 2014 at 5:32 PM, Patrick Wendell pwend...@gmail.com
 wrote:
  Thanks Shane - we should fix the source code issues in the Kinesis
  code that made stricter Java compilers reject it.
 
  - Patrick
 
  On Mon, Oct 20, 2014 at 5:28 PM, shane knapp skn...@berkeley.edu
 wrote:
  ok, so earlier today i installed a 2nd JDK within jenkins (7u71), which
  fixed the SparkR build but apparently made Spark itself quite unhappy.
 i
  removed that JDK, triggered a build
  (
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21943/console
 ),
  and it compiled kinesis w/o dying a fiery death.
 
  apparently 7u71 is stricter when compiling.  sad times.
 
  sorry about that!
 
  shane
 
 
  On Mon, Oct 20, 2014 at 5:16 PM, Patrick Wendell pwend...@gmail.com
 wrote:
 
  The failure is in the Kinesis compoent, can you reproduce this if you
  build with -Pkinesis-asl?
 
  - Patrick
 
  On Mon, Oct 20, 2014 at 5:08 PM, shane knapp skn...@berkeley.edu
 wrote:
   hmm, strange.  i'll take a look.
  
   On Mon, Oct 20, 2014 at 5:11 PM, Nan Zhu zhunanmcg...@gmail.com
 wrote:
  
   yes, I can compile locally, too
  
   but it seems that Jenkins is not happy now...
   https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/
  
   All failed to compile
  
   Best,
  
   --
   Nan Zhu
  
  
   On Monday, October 20, 2014 at 7:56 PM, Ted Yu wrote:
  
I performed build on latest master branch but didn't get
 compilation
   error.
   
FYI
   
On Mon, Oct 20, 2014 at 3:51 PM, Nan Zhu zhunanmcg...@gmail.com
   (mailto:zhunanmcg...@gmail.com) wrote:
 Hi,

 I just submitted a patch
   https://github.com/apache/spark/pull/2864/files
 with one line change

 but the Jenkins told me it's failed to compile on the unrelated
 files?


  
  
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21935/console


 Best,

 Nan
   
  
  
 
 



Re: something wrong with Jenkins or something untested merged?

2014-10-21 Thread shane knapp
i'm currently in a meeting and will be starting to do some tests in ~1 hour
or so.

On Tue, Oct 21, 2014 at 11:07 AM, Nan Zhu zhunanmcg...@gmail.com wrote:

 I agree with Sean

 I just compiled spark core successfully with 7u71 in Mac OS X

 On Tue, Oct 21, 2014 at 1:11 PM, Josh Rosen rosenvi...@gmail.com wrote:

 Ah, that makes sense.  I had forgotten that there was a JIRA for this:

 https://issues.apache.org/jira/browse/SPARK-4021

 On October 21, 2014 at 10:08:58 AM, Patrick Wendell (pwend...@gmail.com)
 wrote:

 Josh - the errors that broke our build indicated that JDK5 was being
 used. Somehow the upgrade caused our build to use a much older Java
 version. See the JIRA for more details.

 On Tue, Oct 21, 2014 at 10:05 AM, Josh Rosen rosenvi...@gmail.com
 wrote:
  I find it concerning that there's a JDK version that breaks out build,
 since
  we're supposed to support Java 7. Is 7u71 an upgrade or downgrade from
 the
  JDK that we used before? Is there an easy way to fix our build so that
 it
  compiles with 7u71's stricter settings?
 
  I'm not sure why the New PRB is failing here. It was originally
 created
  as a clone of the main pull request builder job. I checked the
 configuration
  history and confirmed that there aren't any settings that we've
 forgotten to
  copy over (e.g. their configurations haven't diverged), so I'm not sure
  what's causing this.
 
  - Josh
 
  On October 21, 2014 at 6:35:39 AM, Nan Zhu (zhunanmcg...@gmail.com)
 wrote:
 
  weird.two buildings (one triggered by New, one triggered by Old)
 were
  executed in the same node, amp-jenkins-slave-01, one compiles, one
 not...
 
  Best,
 
  --
  Nan Zhu
 
 
  On Tuesday, October 21, 2014 at 9:39 AM, Nan Zhu wrote:
 
  seems that all PRs built by NewSparkPRBuilder suffers from 7u71, while
  SparkPRBuilder is working fine
 
  Best,
 
  --
  Nan Zhu
 
 
  On Tuesday, October 21, 2014 at 9:22 AM, Cheng Lian wrote:
 
   It's a new pull request builder written by Josh, integrated into our
   state-of-the-art PR dashboard :)
  
   On 10/21/14 9:33 PM, Nan Zhu wrote:
just curious...what is this NewSparkPullRequestBuilder?
   
Best,
   
--
Nan Zhu
   
   
On Tuesday, October 21, 2014 at 8:30 AM, Cheng Lian wrote:
   

 Hm, seems that 7u71 comes back again. Observed similar Kinesis
 compilation error just now:

 https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/410/consoleFull


 Checked Jenkins slave nodes, saw /usr/java/latest points to
 jdk1.7.0_71. However, /usr/bin/javac -version says:

 
  Eclipse Java Compiler 0.894_R34x, 3.4.2 release, Copyright IBM
  Corp 2000, 2008. All rights reserved.
 


 Which JDK is actually used by Jenkins?


 Cheng


 On 10/21/14 8:28 AM, shane knapp wrote:

  ok, so earlier today i installed a 2nd JDK within jenkins
 (7u71),
  which fixed the SparkR build but apparently made Spark itself
 quite unhappy.
  i removed that JDK, triggered a build (
 
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21943/console),

  and it compiled kinesis w/o dying a fiery death. apparently
 7u71 is stricter
  when compiling. sad times. sorry about that! shane On Mon, Oct
 20, 2014 at
  5:16 PM, Patrick Wendell pwend...@gmail.com (mailto:
 pwend...@gmail.com)
  wrote:
   The failure is in the Kinesis compoent, can you reproduce
 this
   if you build with -Pkinesis-asl? - Patrick On Mon, Oct 20,
 2014 at 5:08 PM,
   shane knapp skn...@berkeley.edu (mailto:
 skn...@berkeley.edu) wrote:
hmm, strange. i'll take a look. On Mon, Oct 20, 2014 at
 5:11
PM, Nan Zhu zhunanmcg...@gmail.com (mailto:
 zhunanmcg...@gmail.com) wrote:
 yes, I can compile locally, too but it seems that
 Jenkins is
 not happy now...

 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/ All
 failed to compile Best, -- Nan Zhu On Monday, October
 20, 2014 at 7:56 PM,
 Ted Yu wrote:
  I performed build on latest master branch but didn't
 get
  compilation
 
 

 error.
  FYI On Mon, Oct 20, 2014 at 3:51 PM, Nan Zhu
  zhunanmcg...@gmail.com (mailto:zhunanmcg...@gmail.com)

 
 

 (mailto:zhunanmcg...@gmail.com) wrote:
   Hi, I just submitted a patch
  
  
 

 https://github.com/apache/spark/pull/2864/files
   with one line change but the Jenkins told me it's
 failed
   to compile on the unrelated
  
  
 



   
   
   
  
   files?
  
  
 



   
  
  
  
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21935/console
   Best, Nan
  
  
 
 
 



   
  
  
  
 
 
 







   
   
  
 
 





Re: something wrong with Jenkins or something untested merged?

2014-10-21 Thread shane knapp
ok, i did some testing and found out what's happening.

https://issues.apache.org/jira/browse/SPARK-4021

here's the TL;DR:
jenkins ignores what JDKs are installed via the web interface when there's
more than one defined, and falls back to whatever is default on the slave
the test is run on.  in this case, it's openjdk 7u65...  and spark
compilation fails.  i've removed the 2nd JDK (7u71) from jenkins, and
everything is back to normal.

On Tue, Oct 21, 2014 at 11:51 AM, shane knapp skn...@berkeley.edu wrote:

 i'm currently in a meeting and will be starting to do some tests in ~1
 hour or so.

 On Tue, Oct 21, 2014 at 11:07 AM, Nan Zhu zhunanmcg...@gmail.com wrote:

 I agree with Sean

 I just compiled spark core successfully with 7u71 in Mac OS X

 On Tue, Oct 21, 2014 at 1:11 PM, Josh Rosen rosenvi...@gmail.com wrote:

 Ah, that makes sense.  I had forgotten that there was a JIRA for this:

 https://issues.apache.org/jira/browse/SPARK-4021

 On October 21, 2014 at 10:08:58 AM, Patrick Wendell (pwend...@gmail.com)
 wrote:

 Josh - the errors that broke our build indicated that JDK5 was being
 used. Somehow the upgrade caused our build to use a much older Java
 version. See the JIRA for more details.

 On Tue, Oct 21, 2014 at 10:05 AM, Josh Rosen rosenvi...@gmail.com
 wrote:
  I find it concerning that there's a JDK version that breaks out build,
 since
  we're supposed to support Java 7. Is 7u71 an upgrade or downgrade from
 the
  JDK that we used before? Is there an easy way to fix our build so that
 it
  compiles with 7u71's stricter settings?
 
  I'm not sure why the New PRB is failing here. It was originally
 created
  as a clone of the main pull request builder job. I checked the
 configuration
  history and confirmed that there aren't any settings that we've
 forgotten to
  copy over (e.g. their configurations haven't diverged), so I'm not
 sure
  what's causing this.
 
  - Josh
 
  On October 21, 2014 at 6:35:39 AM, Nan Zhu (zhunanmcg...@gmail.com)
 wrote:
 
  weird.two buildings (one triggered by New, one triggered by Old)
 were
  executed in the same node, amp-jenkins-slave-01, one compiles, one
 not...
 
  Best,
 
  --
  Nan Zhu
 
 
  On Tuesday, October 21, 2014 at 9:39 AM, Nan Zhu wrote:
 
  seems that all PRs built by NewSparkPRBuilder suffers from 7u71,
 while
  SparkPRBuilder is working fine
 
  Best,
 
  --
  Nan Zhu
 
 
  On Tuesday, October 21, 2014 at 9:22 AM, Cheng Lian wrote:
 
   It's a new pull request builder written by Josh, integrated into
 our
   state-of-the-art PR dashboard :)
  
   On 10/21/14 9:33 PM, Nan Zhu wrote:
just curious...what is this NewSparkPullRequestBuilder?
   
Best,
   
--
Nan Zhu
   
   
On Tuesday, October 21, 2014 at 8:30 AM, Cheng Lian wrote:
   

 Hm, seems that 7u71 comes back again. Observed similar Kinesis
 compilation error just now:

 https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/410/consoleFull


 Checked Jenkins slave nodes, saw /usr/java/latest points to
 jdk1.7.0_71. However, /usr/bin/javac -version says:

 
  Eclipse Java Compiler 0.894_R34x, 3.4.2 release, Copyright
 IBM
  Corp 2000, 2008. All rights reserved.
 


 Which JDK is actually used by Jenkins?


 Cheng


 On 10/21/14 8:28 AM, shane knapp wrote:

  ok, so earlier today i installed a 2nd JDK within jenkins
 (7u71),
  which fixed the SparkR build but apparently made Spark itself
 quite unhappy.
  i removed that JDK, triggered a build (
 
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21943/console),

  and it compiled kinesis w/o dying a fiery death. apparently
 7u71 is stricter
  when compiling. sad times. sorry about that! shane On Mon,
 Oct 20, 2014 at
  5:16 PM, Patrick Wendell pwend...@gmail.com (mailto:
 pwend...@gmail.com)
  wrote:
   The failure is in the Kinesis compoent, can you reproduce
 this
   if you build with -Pkinesis-asl? - Patrick On Mon, Oct 20,
 2014 at 5:08 PM,
   shane knapp skn...@berkeley.edu (mailto:
 skn...@berkeley.edu) wrote:
hmm, strange. i'll take a look. On Mon, Oct 20, 2014 at
 5:11
PM, Nan Zhu zhunanmcg...@gmail.com (mailto:
 zhunanmcg...@gmail.com) wrote:
 yes, I can compile locally, too but it seems that
 Jenkins is
 not happy now...

 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/ All
 failed to compile Best, -- Nan Zhu On Monday, October
 20, 2014 at 7:56 PM,
 Ted Yu wrote:
  I performed build on latest master branch but didn't
 get
  compilation
 
 

 error.
  FYI On Mon, Oct 20, 2014 at 3:51 PM, Nan Zhu
  zhunanmcg...@gmail.com (mailto:
 zhunanmcg...@gmail.com)
 
 

 (mailto:zhunanmcg...@gmail.com) wrote:
   Hi, I just submitted a patch

Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts

2014-10-21 Thread shane knapp
i've seen a few more builds fail w/timeouts and it appears that we're
definitely NOT hitting any rate limiting.

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22005/console

[jenkins@amp-jenkins-slave-01 ~]$ curl -i -H Authorization: token
REDACTED https://api.github.com | grep Rate
X-RateLimit-Limit: 5000
X-RateLimit-Remaining: 4997
X-RateLimit-Reset: 1413929848
Access-Control-Expose-Headers: ETag, Link, X-GitHub-OTP, X-RateLimit-Limit,
X-RateLimit-Remaining, X-RateLimit-Reset, X-OAuth-Scopes,
X-Accepted-OAuth-Scopes, X-Poll-Interval

On Sat, Oct 18, 2014 at 12:44 AM, Davies Liu dav...@databricks.com wrote:

 Cool, the recent 4 build had used the new configs, thanks!

 Let's run more builds.

 Davies

 On Fri, Oct 17, 2014 at 11:06 PM, Josh Rosen rosenvi...@gmail.com wrote:
  I think that the fix was applied.  Take a look at
 
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21874/consoleFull
 
  Here, I see a fetch command that mentions this specific PR branch rather
  than the wildcard that we had before:
 
git fetch --tags --progress https://github.com/apache/spark.git
  +refs/pull/2840/*:refs/remotes/origin/pr/2840/* # timeout=15
 
 
  Do you have an example of a Spark PRB build that’s still failing with the
  old fetch failure?
 
  - Josh
 
  On October 17, 2014 at 11:03:14 PM, Davies Liu (dav...@databricks.com)
  wrote:
 
  How can we know the changes has been applied? I had checked several
  recent builds, they all use the original configs.
 
  Davies
 
  On Fri, Oct 17, 2014 at 6:17 PM, Josh Rosen rosenvi...@gmail.com
 wrote:
  FYI, I edited the Spark Pull Request Builder job to try this out. Let’s
  see
  if it works (I’ll be around to revert if it doesn’t).
 
  On October 17, 2014 at 5:26:56 PM, Davies Liu (dav...@databricks.com)
  wrote:
 
  One finding is that all the timeout happened with this command:
 
  git fetch --tags --progress https://github.com/apache/spark.git
  +refs/pull/*:refs/remotes/origin/pr/*
 
  I'm thinking that maybe this may be a expensive call, we could try to
  use a more cheap one:
 
  git fetch --tags --progress https://github.com/apache/spark.git
  +refs/pull/XXX/*:refs/remotes/origin/pr/XXX/*
 
  XXX is the PullRequestID,
 
  The configuration support parameters [1], so we could put this in :
 
  +refs/pull//${ghprbPullId}/*:refs/remotes/origin/pr/${ghprbPullId}/*
 
  I have not tested this yet, could you give this a try?
 
  Davies
 
 
  [1]
 
 
 https://wiki.jenkins-ci.org/display/JENKINS/GitHub+pull+request+builder+plugin
 
  On Fri, Oct 17, 2014 at 5:00 PM, shane knapp skn...@berkeley.edu
 wrote:
  actually, nvm, you have to be run that command from our servers to
 affect
  our limit. run it all you want from your own machines! :P
 
  On Fri, Oct 17, 2014 at 4:59 PM, shane knapp skn...@berkeley.edu
 wrote:
 
  yep, and i will tell you guys ONLY if you promise to NOT try this
  yourselves... checking the rate limit also counts as a hit and
  increments
  our numbers:
 
  # curl -i https://api.github.com/users/whatever 2 /dev/null | egrep
  ^X-Rate
  X-RateLimit-Limit: 60
  X-RateLimit-Remaining: 51
  X-RateLimit-Reset: 1413590269
 
  (yes, that is the exact url that they recommended on the github site
  lol)
 
  so, earlier today, we had a spark build fail w/a git timeout at
 10:57am,
  but there were only ~7 builds run that hour, so that points to us NOT
  hitting the rate limit... at least for this fail. whee!
 
  is it beer-thirty yet?
 
  shane
 
 
 
  On Fri, Oct 17, 2014 at 4:52 PM, Nicholas Chammas 
  nicholas.cham...@gmail.com wrote:
 
  Wow, thanks for this deep dive Shane. Is there a way to check if we
 are
  getting hit by rate limiting directly, or do we need to contact
 GitHub
  for that?
 
  2014년 10월 17일 금요일, shane knappskn...@berkeley.edu님이 작성한 메시지:
 
  quick update:
 
  here are some stats i scraped over the past week of ALL pull request
  builder projects and timeout failures. due to the large number of
  spark
  ghprb jobs, i don't have great records earlier than oct 7th. the
 data
  is
  current up until ~230pm today:
 
  spark and new spark ghprb total builds vs git fetch timeouts:
  $ for x in 10-{09..17}; do passed=$(grep $x SORTED.passed | grep -i
  spark | wc -l); failed=$(grep $x SORTED | grep -i spark | wc -l);
 let
  total=passed+failed; fail_percent=$(echo scale=2; $failed/$total |
  bc
  |
  sed s/^\.//g); line=$x -- total builds: $total\tp/f:
  $passed/$failed\tfail%: $fail_percent%; echo -e $line; done
  10-09 -- total builds: 140 p/f: 92/48 fail%: 34%
  10-10 -- total builds: 65 p/f: 59/6 fail%: 09%
  10-11 -- total builds: 29 p/f: 29/0 fail%: 0%
  10-12 -- total builds: 24 p/f: 21/3 fail%: 12%
  10-13 -- total builds: 39 p/f: 35/4 fail%: 10%
  10-14 -- total builds: 7 p/f: 5/2 fail%: 28%
  10-15 -- total builds: 37 p/f: 34/3 fail%: 08%
  10-16 -- total builds: 71 p/f: 59/12 fail%: 16%
  10-17 -- total builds: 26 p/f: 20/6 fail%: 23%
 
  all other ghprb builds vs git fetch timeouts

your weekly git timeout update! TL;DR: i'm now almost certain we're not hitting rate limits.

2014-10-24 Thread shane knapp
so, things look like they've stabilized significantly over the past 10
days, and without any changes on our end:
snip
$ /root/tools/get_timeouts.sh 10
timeouts by date:
2014-10-14 -- 2
2014-10-16 -- 1
2014-10-19 -- 1
2014-10-20 -- 2
2014-10-23 -- 5

timeouts by project:
  5 NewSparkPullRequestBuilder
  5 SparkPullRequestBuilder
  1 Tachyon-Pull-Request-Builder
total builds (excepting aborted by a user):
602

total percentage of builds timing out:
01
/snip

the NewSparkPullRequestBuilder failures are spread over five different days
(10-14 through 10-20), and the SparkPullRequestBuilder failures all
happened yesterday.  there were a LOT of SparkPullRequestBuilder builds
yesterday (60), and the failures happened during these hours (first number
== number of builds failed, second number == hour of the day):
snip
$ cat timeouts-102414-130817 | grep SparkPullRequestBuilder | grep
2014-10-23 | awk '{print$3}' | awk -F: '{print$1'} | sort | uniq -c
  1 03
  2 20
  1 22
  1 23
/snip

however, the number of total SparkPullRequestBuilder builds during these
times don't seem egregious:
snip
  4 03
  9 20
  4 22
  9 23
/snip

nor does the total for ALL builds at those times:
snip
  5 03
  9 20
  7 22
 11 23
/snip

9 builds was the largest number of SparkPullRequestBuilder builds per hour,
but there were other hours with 5, 6 or 7 builds/hour that didn't have a
timeout issue.

in fact, hour 16 (4pm) had the most builds running total yesterday, which
includes 7 SparkPullRequestBuilder builds, and nothing timed out.

most of the pull request builder hits on github are authenticated w/an
oauth token.  this gives us 5000 hits/hour, and unauthed gives us 60/hour.

in conclusion:  there is no way are we hitting github often enough to be
rate limited.  i think i've finally ruled that out completely.  :)


jenkins downtime tomorrow morning ~6am-8am PDT

2014-10-27 Thread shane knapp
i'll be bringing jenkins down tomorrow morning for some system maintenance
and to get our backups kicked off.

i do expect to have the system back up and running before 8am.

please let me know ASAP if i need to reschedule this.

thanks,

shane


jenkins emergency restart now, was Re: jenkins downtime tomorrow morning ~6am-8am PDT

2014-10-27 Thread shane knapp
so, i'm having a race condition between a plugin i installed putting
jenkins in to quiet mode and it failing to perform a backup from this past
weekend.  i'll need to restart the process and get it out of the
constantly-in-to-quiet-mode cycle it's in now.

this will be quick, and i'll restart the jobs i've killed.

this DOES NOT effect the restart/maintenance tomorrow morning.

sorry about the inconvenience,

shane

On Mon, Oct 27, 2014 at 10:46 AM, shane knapp skn...@berkeley.edu wrote:

 i'll be bringing jenkins down tomorrow morning for some system maintenance
 and to get our backups kicked off.

 i do expect to have the system back up and running before 8am.

 please let me know ASAP if i need to reschedule this.

 thanks,

 shane



Re: jenkins emergency restart now, was Re: jenkins downtime tomorrow morning ~6am-8am PDT

2014-10-27 Thread shane knapp
ok we're back up and building.  i've retriggered the jobs i killed.

On Mon, Oct 27, 2014 at 1:24 PM, shane knapp skn...@berkeley.edu wrote:

 so, i'm having a race condition between a plugin i installed putting
 jenkins in to quiet mode and it failing to perform a backup from this past
 weekend.  i'll need to restart the process and get it out of the
 constantly-in-to-quiet-mode cycle it's in now.

 this will be quick, and i'll restart the jobs i've killed.

 this DOES NOT effect the restart/maintenance tomorrow morning.

 sorry about the inconvenience,

 shane

 On Mon, Oct 27, 2014 at 10:46 AM, shane knapp skn...@berkeley.edu wrote:

 i'll be bringing jenkins down tomorrow morning for some system
 maintenance and to get our backups kicked off.

 i do expect to have the system back up and running before 8am.

 please let me know ASAP if i need to reschedule this.

 thanks,

 shane





Re: jenkins downtime tomorrow morning ~6am-8am PDT

2014-10-28 Thread shane knapp
this is done, and jenkins is up and building again.

On Mon, Oct 27, 2014 at 10:46 AM, shane knapp skn...@berkeley.edu wrote:

 i'll be bringing jenkins down tomorrow morning for some system maintenance
 and to get our backups kicked off.

 i do expect to have the system back up and running before 8am.

 please let me know ASAP if i need to reschedule this.

 thanks,

 shane



[important] jenkins down

2014-11-20 Thread shane knapp
i noticed that there were no builds, and noticed that it's throwing a bunch
of exceptions in the log file.

i'm looking in to this right now and will update when i get things rolling
again.

sorry for the inconvenience,

shane


Re: [important] jenkins down

2014-11-20 Thread shane knapp
ok, we're back up and building now...  looks like there was a seriously bad
git (or github) plugin update that caused all sorts of unintended
consequences, mostly with cron stacktracing.

i'll take a closer look and see if i can find out exactly what happened,
but suffice to say, we'll be really cautious when updating even recommended
plugins.

sorry for the disruption!

shane

On Thu, Nov 20, 2014 at 10:21 AM, shane knapp skn...@berkeley.edu wrote:

 i noticed that there were no builds, and noticed that it's throwing a
 bunch of exceptions in the log file.

 i'm looking in to this right now and will update when i get things rolling
 again.

 sorry for the inconvenience,

 shane



jenkins downtime: 730-930am, 12/12/14

2014-12-01 Thread shane knapp
i'll send out a reminder next week, but i wanted to give a heads up:  i'll
be bringing down the entire jenkins infrastructure for reboots and system
updates.

please let me know if there are any conflicts with this, thanks!

shane


adding new jenkins worker nodes to eventually replace existing ones

2014-12-09 Thread shane knapp
i just turned up a new jenkins slave (amp-jenkins-worker-01) to ensure it
builds properly.  these machines have half the ram, same number of
processors and more disk, which will hopefully help us achieve more than
the ~15-20% system utilization we're getting on the current
amp-jenkins-slave-{01..05} nodes.

instead of 5 super beefy slaves w/16 workers each, we're planning on 8 less
beefy slaves w/12 workers each.  this should definitely cut down on the
build queue, and not impact build times in a negative way at all.

i'll keep a close eye on amp-jenkins-worker-01 before i start releasing the
other seven in to the wild.

there should be a minimal user impact, but if i happen to miss something,
please don't hesitate to let me know!

thanks,

shane


Re: adding new jenkins worker nodes to eventually replace existing ones

2014-12-09 Thread shane knapp
forgot to install git on this node.  /headdesk

i retirggered the failed spark prb jobs.

On Tue, Dec 9, 2014 at 10:49 AM, shane knapp skn...@berkeley.edu wrote:

 i just turned up a new jenkins slave (amp-jenkins-worker-01) to ensure it
 builds properly.  these machines have half the ram, same number of
 processors and more disk, which will hopefully help us achieve more than
 the ~15-20% system utilization we're getting on the current
 amp-jenkins-slave-{01..05} nodes.

 instead of 5 super beefy slaves w/16 workers each, we're planning on 8
 less beefy slaves w/12 workers each.  this should definitely cut down on
 the build queue, and not impact build times in a negative way at all.

 i'll keep a close eye on amp-jenkins-worker-01 before i start releasing
 the other seven in to the wild.

 there should be a minimal user impact, but if i happen to miss something,
 please don't hesitate to let me know!

 thanks,

 shane



Re: jenkins downtime: 730-930am, 12/12/14

2014-12-10 Thread shane knapp
reminder -- this is happening friday morning @ 730am!

On Mon, Dec 1, 2014 at 5:10 PM, shane knapp skn...@berkeley.edu wrote:

 i'll send out a reminder next week, but i wanted to give a heads up:  i'll
 be bringing down the entire jenkins infrastructure for reboots and system
 updates.

 please let me know if there are any conflicts with this, thanks!

 shane



Re: jenkins downtime: 730-930am, 12/12/14

2014-12-12 Thread shane knapp
reminder:  jenkins is going down NOW.

On Thu, Dec 11, 2014 at 3:08 PM, shane knapp skn...@berkeley.edu wrote:

 here's the plan...  reboots, of course, come last.  :)

 pause build queue at 7am, kill off (and eventually retrigger) any
 stragglers at 8am.  then begin maintenance:

 all systems:
 * yum update all servers (amp-jekins-master, amp-jenkins-slave-{01..05},
 amp-jenkins-worker-{01..08})
 * reboots

 jenkins slaves:
 * install python2.7 (along side 2.6, which would remain the default)
 * install numpy 1.9.1 (currently on 1.4, breaking some spark branch builds)
 * add new slaves to the master, remove old ones (keep them around just in
 case)

 there will be no jenkins system or plugin upgrades at this time.  things
 there seems to be working just fine!

 i'm expecting to be up and building by 9am at the latest.  i'll update
 this thread w/any new time estimates.

 word.

 shane, your rained-in devops guy :)

 On Wed, Dec 10, 2014 at 11:28 AM, shane knapp skn...@berkeley.edu wrote:

 reminder -- this is happening friday morning @ 730am!

 On Mon, Dec 1, 2014 at 5:10 PM, shane knapp skn...@berkeley.edu wrote:

 i'll send out a reminder next week, but i wanted to give a heads up:
  i'll be bringing down the entire jenkins infrastructure for reboots and
 system updates.

 please let me know if there are any conflicts with this, thanks!

 shane






Re: jenkins downtime: 730-930am, 12/12/14

2014-12-12 Thread shane knapp
downtime is extended to 10am PST so that i can finish testing the numpy
upgrade...  besides that, everything looks good and the system updates and
reboots went off w/o a hitch.

shane

On Fri, Dec 12, 2014 at 7:26 AM, shane knapp skn...@berkeley.edu wrote:

 reminder:  jenkins is going down NOW.

 On Thu, Dec 11, 2014 at 3:08 PM, shane knapp skn...@berkeley.edu wrote:

 here's the plan...  reboots, of course, come last.  :)

 pause build queue at 7am, kill off (and eventually retrigger) any
 stragglers at 8am.  then begin maintenance:

 all systems:
 * yum update all servers (amp-jekins-master, amp-jenkins-slave-{01..05},
 amp-jenkins-worker-{01..08})
 * reboots

 jenkins slaves:
 * install python2.7 (along side 2.6, which would remain the default)
 * install numpy 1.9.1 (currently on 1.4, breaking some spark branch
 builds)
 * add new slaves to the master, remove old ones (keep them around just in
 case)

 there will be no jenkins system or plugin upgrades at this time.  things
 there seems to be working just fine!

 i'm expecting to be up and building by 9am at the latest.  i'll update
 this thread w/any new time estimates.

 word.

 shane, your rained-in devops guy :)

 On Wed, Dec 10, 2014 at 11:28 AM, shane knapp skn...@berkeley.edu
 wrote:

 reminder -- this is happening friday morning @ 730am!

 On Mon, Dec 1, 2014 at 5:10 PM, shane knapp skn...@berkeley.edu wrote:

 i'll send out a reminder next week, but i wanted to give a heads up:
  i'll be bringing down the entire jenkins infrastructure for reboots and
 system updates.

 please let me know if there are any conflicts with this, thanks!

 shane







Re: jenkins downtime: 730-930am, 12/12/14

2014-12-12 Thread shane knapp
ok, we're back up w/all new jenkins workers.  i'll be keeping an eye on
these pretty closely today for any build failures caused by the new
systems, and if things look bleak, i'll switch back to the original five.

thanks for your patience!

On Fri, Dec 12, 2014 at 8:47 AM, shane knapp skn...@berkeley.edu wrote:

 downtime is extended to 10am PST so that i can finish testing the numpy
 upgrade...  besides that, everything looks good and the system updates and
 reboots went off w/o a hitch.

 shane

 On Fri, Dec 12, 2014 at 7:26 AM, shane knapp skn...@berkeley.edu wrote:

 reminder:  jenkins is going down NOW.

 On Thu, Dec 11, 2014 at 3:08 PM, shane knapp skn...@berkeley.edu wrote:

 here's the plan...  reboots, of course, come last.  :)

 pause build queue at 7am, kill off (and eventually retrigger) any
 stragglers at 8am.  then begin maintenance:

 all systems:
 * yum update all servers (amp-jekins-master, amp-jenkins-slave-{01..05},
 amp-jenkins-worker-{01..08})
 * reboots

 jenkins slaves:
 * install python2.7 (along side 2.6, which would remain the default)
 * install numpy 1.9.1 (currently on 1.4, breaking some spark branch
 builds)
 * add new slaves to the master, remove old ones (keep them around just
 in case)

 there will be no jenkins system or plugin upgrades at this time.  things
 there seems to be working just fine!

 i'm expecting to be up and building by 9am at the latest.  i'll update
 this thread w/any new time estimates.

 word.

 shane, your rained-in devops guy :)

 On Wed, Dec 10, 2014 at 11:28 AM, shane knapp skn...@berkeley.edu
 wrote:

 reminder -- this is happening friday morning @ 730am!

 On Mon, Dec 1, 2014 at 5:10 PM, shane knapp skn...@berkeley.edu
 wrote:

 i'll send out a reminder next week, but i wanted to give a heads up:
  i'll be bringing down the entire jenkins infrastructure for reboots and
 system updates.

 please let me know if there are any conflicts with this, thanks!

 shane








Re: jenkins downtime: 730-930am, 12/12/14

2014-12-14 Thread shane knapp
josh rosen has this PR open to address the streaming test failures:

https://github.com/apache/spark/pull/3687

On Sun, Dec 14, 2014 at 8:21 AM, WangTaoTheTonic barneystin...@aliyun.com
wrote:

 Jenkins is still not available now as some unit tests(about streaming)
 failed
 all the time. Does it have something to do with this update?



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/jenkins-downtime-730-930am-12-12-14-tp9583p9778.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Archiving XML test reports for analysis

2014-12-15 Thread shane knapp
right now, the following logs are archived on to the master:

  local log_files=$(
find .\
  -name unit-tests.log -o\
  -path ./sql/hive/target/HiveCompatibilitySuite.failed -o\
  -path ./sql/hive/target/HiveCompatibilitySuite.hiveFailed -o\
  -path ./sql/hive/target/HiveCompatibilitySuite.wrong
  )

regarding dumping stuff to S3 -- thankfully, since we're not looking at a
lot of disk usage, i don't see a problem w/this.  we could tar/zip up the
XML for each build and just dump it there.

what builds are we thinking about?  spark pull request builder?  what
others?

On Mon, Dec 15, 2014 at 1:33 AM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 Every time we run a test cycle on our Jenkins cluster, we generate hundreds
 of XML reports covering all the tests we have (e.g.

 `streaming/target/test-reports/org.apache.spark.streaming.util.WriteAheadLogSuite.xml`).

 These reports contain interesting information about whether tests succeeded
 or failed, and how long they took to complete. There is also detailed
 information about the environment they ran in.

 It might be valuable to have a window into all these reports across all
 Jenkins builds and across all time, and use that to track basic statistics
 about our tests. That could give us basic insight into what tests are flaky
 or slow, and perhaps drive other improvements to our testing infrastructure
 that we can't see just yet.

 Do people think that would be valuable? Do we already have something like
 this?

 I'm thinking for starters it might be cool if we automatically uploaded all
 the XML test reports from the Master and the Pull Request builders to an S3
 bucket and just opened it up for the dev community to analyze.

 Nick



Re: Archiving XML test reports for analysis

2014-12-15 Thread shane knapp
i have no problem w/storing all of the logs.  :)

i also have no problem w/donated S3 buckets.  :)

On Mon, Dec 15, 2014 at 2:39 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 How about all of them https://amplab.cs.berkeley.edu/jenkins/view/Spark/? 
 How
 much data per day would it roughly be if we uploaded all the logs for all
 these builds?

 Also, would Databricks be willing to offer up an S3 bucket for this
 purpose?

 Nick

 On Mon Dec 15 2014 at 11:48:44 AM shane knapp skn...@berkeley.edu wrote:

 right now, the following logs are archived on to the master:

   local log_files=$(
 find .\
   -name unit-tests.log -o\
   -path ./sql/hive/target/HiveCompatibilitySuite.failed -o\
   -path ./sql/hive/target/HiveCompatibilitySuite.hiveFailed -o\
   -path ./sql/hive/target/HiveCompatibilitySuite.wrong
   )

 regarding dumping stuff to S3 -- thankfully, since we're not looking at a
 lot of disk usage, i don't see a problem w/this.  we could tar/zip up the
 XML for each build and just dump it there.

 what builds are we thinking about?  spark pull request builder?  what
 others?

 On Mon, Dec 15, 2014 at 1:33 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Every time we run a test cycle on our Jenkins cluster, we generate
 hundreds
 of XML reports covering all the tests we have (e.g.

 `streaming/target/test-reports/org.apache.spark.streaming.util.WriteAheadLogSuite.xml`).

 These reports contain interesting information about whether tests
 succeeded
 or failed, and how long they took to complete. There is also detailed
 information about the environment they ran in.

 It might be valuable to have a window into all these reports across all
 Jenkins builds and across all time, and use that to track basic
 statistics
 about our tests. That could give us basic insight into what tests are
 flaky
 or slow, and perhaps drive other improvements to our testing
 infrastructure
 that we can't see just yet.

 Do people think that would be valuable? Do we already have something like
 this?

 I'm thinking for starters it might be cool if we automatically uploaded
 all
 the XML test reports from the Master and the Pull Request builders to an
 S3
 bucket and just opened it up for the dev community to analyze.

 Nick




Re: Jenkins install reference

2015-02-03 Thread shane knapp
here's the wiki describing the system setup:
https://cwiki.apache.org/confluence/display/SPARK/Spark+QA+Infrastructure

we have 1 master and 8 worker nodes, 12 executors per worker (we'd be
better off w/more and smaller worker nodes however).

you don't need to install sbt -- it's in the build/ directory.

the pull request builder builds in parallel, but the master builds require
specific ports to be reserved and each build effectively locks down a
worker until it's done.  since we have 8 worker nodes, it's not *that* big
of a deal...

shane

On Tue, Feb 3, 2015 at 4:36 AM, scwf wangf...@huawei.com wrote:

 Here my question is:
 1 How to set jenkins to make it build for multi PR parallel?. or one
 machine only support one PR building?
 2 do we need install sbt on the CI machine since the script
 dev/run-tests will auto fetch the sbt jar ?

 - Fei



 On 2015/2/3 15:53, scwf wrote:

 Hi, all
we want to set up a CI env for spark in our team, is there any
 reference of how to install jenkins over spark?
Thanks

 Fei


 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org






 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: spark 1.3 sbt build seems to be broken

2015-02-05 Thread shane knapp
here's the hash of the breaking commit:

Started on Feb 5, 2015 12:01:01 PM
Using strategy: Default
[poll] Last Built Revision: Revision
de112a2096a2b84ce2cac112f12b50b5068d6c35
(refs/remotes/origin/branch-1.3)
  git ls-remote -h https://github.com/apache/spark.git branch-1.3 # timeout=10
[poll] Latest remote head revision is: fba2dc663a644cfe76a744b5cace93e9d6646a25
Done. Took 2.5 sec
Changes found


from:  https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-SBT/18/pollingLog/


On Thu, Feb 5, 2015 at 5:01 PM, shane knapp skn...@berkeley.edu wrote:

 https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-SBT/

 we're seeing java OOMs and heap space errors:

 https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/19/console

 https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/18/console

 memory leak?  i checked the systems (ganglia + logging in and 'free -g')
 and there's nothing going on there.

 20 is building right now:
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-SBT/20/console



spark 1.3 sbt build seems to be broken

2015-02-05 Thread shane knapp
https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-SBT/

we're seeing java OOMs and heap space errors:
https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/19/console
https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/18/console

memory leak?  i checked the systems (ganglia + logging in and 'free -g')
and there's nothing going on there.

20 is building right now:
https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-SBT/20/console


Re: quick jenkins restart tomorrow morning, ~7am PST

2015-02-18 Thread shane knapp
i'm actually going to do this now -- it's really quiet today.

there are two spark pull request builds running, which i will kill and
retrigger once jenkins is back up:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27689/
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27690/

On Wed, Feb 18, 2015 at 12:55 PM, shane knapp skn...@berkeley.edu wrote:

 i'll be kicking jenkins to up the open file limits on the workers.  it
 should be a very short downtime, and i'll post updates on my progress
 tomorrow.

 shane



quick jenkins restart tomorrow morning, ~7am PST

2015-02-18 Thread shane knapp
i'll be kicking jenkins to up the open file limits on the workers.  it
should be a very short downtime, and i'll post updates on my progress
tomorrow.

shane


Re: emergency jenkins restart soon

2015-01-29 Thread shane knapp
the master builds triggered around ~1am last night (according to the logs),
so it looks like we're back in business.

On Wed, Jan 28, 2015 at 10:32 PM, shane knapp skn...@berkeley.edu wrote:

 np!  the master builds haven't triggered yet, but let's give the rube
 goldberg machine a minute to get it's bearings.

 On Wed, Jan 28, 2015 at 10:31 PM, Reynold Xin r...@databricks.com wrote:

 Thanks for doing that, Shane!


 On Wed, Jan 28, 2015 at 10:29 PM, shane knapp skn...@berkeley.edu
 wrote:

 jenkins is back up and all builds have been retriggered...  things are
 building and looking good, and i'll keep an eye on the spark master
 builds
 tonite and tomorrow.

 On Wed, Jan 28, 2015 at 9:56 PM, shane knapp skn...@berkeley.edu
 wrote:

  the spark master builds stopped triggering ~yesterday and the logs
 don't
  show anything.  i'm going to give the current batch of spark pull
 request
  builder jobs a little more time (~30 mins) to finish, then kill
 whatever is
  left and restart jenkins.  anything that was queued or killed will be
  retriggered once jenkins is back up.
 
  sorry for the inconvenience, we'll get this sorted asap.
 
  thanks,
 
  shane
 






Re: emergency jenkins restart soon

2015-01-28 Thread shane knapp
jenkins is back up and all builds have been retriggered...  things are
building and looking good, and i'll keep an eye on the spark master builds
tonite and tomorrow.

On Wed, Jan 28, 2015 at 9:56 PM, shane knapp skn...@berkeley.edu wrote:

 the spark master builds stopped triggering ~yesterday and the logs don't
 show anything.  i'm going to give the current batch of spark pull request
 builder jobs a little more time (~30 mins) to finish, then kill whatever is
 left and restart jenkins.  anything that was queued or killed will be
 retriggered once jenkins is back up.

 sorry for the inconvenience, we'll get this sorted asap.

 thanks,

 shane



Re: emergency jenkins restart soon

2015-01-28 Thread shane knapp
np!  the master builds haven't triggered yet, but let's give the rube
goldberg machine a minute to get it's bearings.

On Wed, Jan 28, 2015 at 10:31 PM, Reynold Xin r...@databricks.com wrote:

 Thanks for doing that, Shane!


 On Wed, Jan 28, 2015 at 10:29 PM, shane knapp skn...@berkeley.edu wrote:

 jenkins is back up and all builds have been retriggered...  things are
 building and looking good, and i'll keep an eye on the spark master builds
 tonite and tomorrow.

 On Wed, Jan 28, 2015 at 9:56 PM, shane knapp skn...@berkeley.edu wrote:

  the spark master builds stopped triggering ~yesterday and the logs don't
  show anything.  i'm going to give the current batch of spark pull
 request
  builder jobs a little more time (~30 mins) to finish, then kill
 whatever is
  left and restart jenkins.  anything that was queued or killed will be
  retriggered once jenkins is back up.
 
  sorry for the inconvenience, we'll get this sorted asap.
 
  thanks,
 
  shane
 





adding some temporary jenkins worker nodes...

2015-02-09 Thread shane knapp
...to help w/the build backlog.  let's all welcome
amp-jenkins-slave-{01..03} back to the fray!


jenkins redirect down (but jenkins is up!), lots of potential

2015-01-05 Thread shane knapp
UC Berkeley had some major maintenance done this past weekend, and long
story short, not everything came back.  our primary webserver's NFS is down
and that means we're not serving websites, meaning that the redirect to
jenkins is failing.

jenkins is still up, and building some jobs, but we will probably see pull
request builder failures, and other transient issues.  SCM-polling builds
should be fine.

there is no ETA on when this will be fixed, but once our
amplab.cs.berkeley.edu/jenkins redir is working, i will let everyone know.
 i'm trying to get more status updates as they come.

i'm really sorry about the inconvenience.

shane


Re: jenkins redirect down (but jenkins is up!), lots of potential

2015-01-06 Thread shane knapp
the regular url is working now, thanks for your patience.

On Mon, Jan 5, 2015 at 2:25 PM, Josh Rosen rosenvi...@gmail.com wrote:

 The pull request builder and SCM-polling builds appear to be working fine,
 but the links in pull request comments won't work because the AMP Lab
 webserver is still down.  In the meantime, though, you can continue to
 access Jenkins through https://hadrian.ist.berkeley.edu/jenkins/

 On Mon, Jan 5, 2015 at 10:37 AM, shane knapp skn...@berkeley.edu wrote:

 UC Berkeley had some major maintenance done this past weekend, and long
 story short, not everything came back.  our primary webserver's NFS is
 down
 and that means we're not serving websites, meaning that the redirect to
 jenkins is failing.

 jenkins is still up, and building some jobs, but we will probably see pull
 request builder failures, and other transient issues.  SCM-polling builds
 should be fine.

 there is no ETA on when this will be fixed, but once our
 amplab.cs.berkeley.edu/jenkins redir is working, i will let everyone
 know.
  i'm trying to get more status updates as they come.

 i'm really sorry about the inconvenience.

 shane





Re: extended jenkins downtime monday, march 16th, plus some hints at the future

2015-03-16 Thread shane knapp
ok, we're back up and building.  upgrading the github plugin (and possibly
EnvInject) caused the stacktraces, so i've kept those at the old versions
that were working before.  jenkins and the rest of the plugins are updated
and we're g2g.

i'll be, of course, keeping an eye on things today and will squash anything
else that pops up.

On Mon, Mar 16, 2015 at 9:06 AM, shane knapp skn...@berkeley.edu wrote:

 looks like we're having some issues w/the pull request builder and cron
 stacktraces in the logs.  i'll be investigating further and will update
 when i figure out what's going on.

 On Mon, Mar 16, 2015 at 7:51 AM, shane knapp skn...@berkeley.edu wrote:

 this is starting now.

 On Fri, Mar 13, 2015 at 10:12 AM, shane knapp skn...@berkeley.edu
 wrote:

 i'll be taking jenkins down for some much-needed plugin updates, as well
 as potentially upgrading jenkins itself.

 this will start at 730am PDT, and i'm hoping to have everything up by
 noon.

 the move to the anaconda python will take place in the next couple of
 weeks as i'm in the process of rebuilding my staging environment (much
 needed) to better reflect production, and allow me to better test the
 change.

 and finally, some teasers for what's coming up in the next month or so:

 * move to a fully puppetized environment (yay no more shell script
 deployments!)
 * virtualized workers (including multiple OSes -- OS X, ubuntu, ...,
 profit?)

 more details as they come.

 happy friday!

 shane






Re: extended jenkins downtime monday, march 16th, plus some hints at the future

2015-03-16 Thread shane knapp
this is starting now.

On Fri, Mar 13, 2015 at 10:12 AM, shane knapp skn...@berkeley.edu wrote:

 i'll be taking jenkins down for some much-needed plugin updates, as well
 as potentially upgrading jenkins itself.

 this will start at 730am PDT, and i'm hoping to have everything up by noon.

 the move to the anaconda python will take place in the next couple of
 weeks as i'm in the process of rebuilding my staging environment (much
 needed) to better reflect production, and allow me to better test the
 change.

 and finally, some teasers for what's coming up in the next month or so:

 * move to a fully puppetized environment (yay no more shell script
 deployments!)
 * virtualized workers (including multiple OSes -- OS X, ubuntu, ...,
 profit?)

 more details as they come.

 happy friday!

 shane



extended jenkins downtime monday, march 16th, plus some hints at the future

2015-03-13 Thread shane knapp
i'll be taking jenkins down for some much-needed plugin updates, as well as
potentially upgrading jenkins itself.

this will start at 730am PDT, and i'm hoping to have everything up by noon.

the move to the anaconda python will take place in the next couple of weeks
as i'm in the process of rebuilding my staging environment (much needed) to
better reflect production, and allow me to better test the change.

and finally, some teasers for what's coming up in the next month or so:

* move to a fully puppetized environment (yay no more shell script
deployments!)
* virtualized workers (including multiple OSes -- OS X, ubuntu, ...,
profit?)

more details as they come.

happy friday!

shane


jenkins httpd being flaky

2015-03-13 Thread shane knapp
we just started having issues when visiting jenkins and getting 503 service
unavailable errors.

i'm on it and will report back with an all-clear.


Re: jenkins httpd being flaky

2015-03-13 Thread shane knapp
ok, things seem to have stabilized...  httpd hasn't flaked since ~noon, the
hanging PRB job on amp-jenkins-worker-06 was removed w/the restart and
things are now building.

i cancelled and retriggered a bunch of PRB builds, btw:
4848 (https://github.com/apache/spark/pull/3699)
5922 (https://github.com/apache/spark/pull/4733)
5987 (https://github.com/apache/spark/pull/4986)
6222 (https://github.com/apache/spark/pull/4964)
6325 (https://github.com/apache/spark/pull/5018)

as well as:
spark-master-maven-with-yarn

sorry for the inconvenience...  i'm still a little stumped as to what
happened, but i think it was a confluence of events (httpd flaking,
problems at github, mercury in retrograde, friday thinking it's monday).

shane

On Fri, Mar 13, 2015 at 1:08 PM, shane knapp skn...@berkeley.edu wrote:

 i tried a couple of things, but will also be doing a jenkins reboot as
 soon as the current batch of builds finish.



 On Fri, Mar 13, 2015 at 12:40 PM, shane knapp skn...@berkeley.edu wrote:

 ok we have a few different things happening:

 1) httpd on the jenkins master is randomly (though not currently) flaking
 out and causing visits to the site to return a 503.  nothing in the logs
 shows any problems.

 2) there are some github timeouts, which i tracked down and think it's a
 problem with github themselves (see:  https://status.github.com/ and
 scroll down to 'mean hook delivery time')

 3) we have one spark job w/a strange ivy lock issue, that i just
 retriggered (https://github.com/apache/spark/pull/4964)

 4) there's an errant, unkillable pull request builder job (
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28574/console
 )

 more updates forthcoming.

 On Fri, Mar 13, 2015 at 12:04 PM, shane knapp skn...@berkeley.edu
 wrote:

 we just started having issues when visiting jenkins and getting 503
 service unavailable errors.

 i'm on it and will report back with an all-clear.






Re: jenkins httpd being flaky

2015-03-13 Thread shane knapp
i tried a couple of things, but will also be doing a jenkins reboot as soon
as the current batch of builds finish.



On Fri, Mar 13, 2015 at 12:40 PM, shane knapp skn...@berkeley.edu wrote:

 ok we have a few different things happening:

 1) httpd on the jenkins master is randomly (though not currently) flaking
 out and causing visits to the site to return a 503.  nothing in the logs
 shows any problems.

 2) there are some github timeouts, which i tracked down and think it's a
 problem with github themselves (see:  https://status.github.com/ and
 scroll down to 'mean hook delivery time')

 3) we have one spark job w/a strange ivy lock issue, that i just
 retriggered (https://github.com/apache/spark/pull/4964)

 4) there's an errant, unkillable pull request builder job (
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28574/console
 )

 more updates forthcoming.

 On Fri, Mar 13, 2015 at 12:04 PM, shane knapp skn...@berkeley.edu wrote:

 we just started having issues when visiting jenkins and getting 503
 service unavailable errors.

 i'm on it and will report back with an all-clear.





Re: PR Builder timing out due to ivy cache lock

2015-03-13 Thread shane knapp
i'm thinking that this was something transient, and hopefully won't happen
again.  a ton of weird stuff happened around the time of this failure (see
my flaky httpd email), and this was the only build exhibiting this behavior.

i'll keep an eye out for this failure over the weekend...



On Fri, Mar 13, 2015 at 12:03 PM, Hari Shreedharan 
hshreedha...@cloudera.com wrote:

 Here you are:
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28571/consoleFull

 On Fri, Mar 13, 2015 at 11:58 AM, shane knapp skn...@berkeley.edu wrote:

 link to a build, please?

 On Fri, Mar 13, 2015 at 11:53 AM, Hari Shreedharan 
 hshreedha...@cloudera.com wrote:

 Looks like something is causing the PR Builder to timeout since this
 morning with the ivy cache being locked.

 Any idea what is happening?






jenkins upgraded to 1.606....

2015-03-25 Thread shane knapp
...due to some big security fixes:

https://wiki.jenkins-ci.org/display/SECURITY/Jenkins+Security+Advisory+2015-03-23

:)

shane


short jenkins 7am downtime tomorrow morning (3-5-15)

2015-03-04 Thread shane knapp
the master and workers need some system and package updates, and i'll also
be rebooting the machines as well.

this shouldn't take very long to perform, and i expect jenkins to be back
up and building by 9am at the *latest*.

important note:  i will NOT be updating jenkins or any of the plugins
during this maintenance!

as always, please let me know if you have any questions or concerns.

danke shane


[jenkins infra -- pls read ] installing anaconda, moving default python from 2.6 - 2.7

2015-02-23 Thread shane knapp
good morning, developers!

TL;DR:

i will be installing anaconda and setting it in the system PATH so that
your python will default to 2.7, as well as it taking over management of
all of the sci-py packages.  this is potentially a big change, so i'll be
testing locally on my staging instance before deployment to the wide world.

deployment is *tentatively* next monday, march 2nd.

a little background:

the jenkins test infra is currently (and happily) managed by a set of tools
that allow me to set up and deploy new workers, manage their packages and
make sure that all spark and research projects can happily and successfully
build.

we're currently at the state where ~50 or so packages are installed and
configured on each worker.  this is getting a little cumbersome, as the
package-to-build dep tree is getting pretty large.

the biggest offender is the science-based python infrastructure.
 everything is blindly installed w/yum and pip, so it's hard to control
*exactly* what version of any given library is as compared to what's on a
dev's laptop.

the solution:

anaconda (https://store.continuum.io/cshop/anaconda/)!  everything is
centralized!  i can manage specific versions much easier!

what this means to you:

* python 2.7 will be the default system python.
* 2.6 will still be installed and available (/usr/bin/python or
/usr/bin/python/2.6)

what you need to do:
* install anaconda, have it update your PATH
* build locally and try to fix any bugs (for spark, this should just work)
* if you have problems, reach out to me and i'll see what i can do to help.
 if we can't get your stuff running under python2.7, we can default to 2.6
via a job config change.

what i will be doing:
* setting up anaconda on my staging instance and spot-testing a lot of
builds before deployment

please let me know if there are any issues/concerns...  i'll be posting
updates this week and will let everyone know if there are any changes to
the Plan[tm].

your friendly devops engineer,

shane


Re: [jenkins infra -- pls read ] installing anaconda, moving default python from 2.6 - 2.7

2015-02-23 Thread shane knapp
On Mon, Feb 23, 2015 at 11:36 AM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 The first concern for Spark will probably be to ensure that we still build
 and test against Python 2.6, since that's the minimum version of Python we
 support.

 sounds good...  we can set up separate 2.6 builds on specific versions...
 this could allow you to easily differentiate between baseline and
latest and greatest if you wanted.  it'll have a little bit more
administrative overhead, due to more jobs needing configs, but offers more
flexibility.

let me know what you think.


 Otherwise this seems OK. We use numpy and other Python packages in
 PySpark, but I don't think we're pinned to any particular version of those
 packages.

 cool.  i'll start mucking about and let you guys know how it goes.

shane


Re: [jenkins infra -- pls read ] installing anaconda, moving default python from 2.6 - 2.7

2015-02-25 Thread shane knapp
i'm going to punt on this until after the next spark 1.3 release (2-3
weeks?).  since i'll be installing a bunch of other packages (including
mongodb), i'd rather wait and be safe.  :)

the full install list is forthcoming, and i'll update the spark infra wiki
w/what's installed on the workers.

shane

On Mon, Feb 23, 2015 at 11:13 AM, shane knapp skn...@berkeley.edu wrote:

 good morning, developers!

 TL;DR:

 i will be installing anaconda and setting it in the system PATH so that
 your python will default to 2.7, as well as it taking over management of
 all of the sci-py packages.  this is potentially a big change, so i'll be
 testing locally on my staging instance before deployment to the wide world.

 deployment is *tentatively* next monday, march 2nd.

 a little background:

 the jenkins test infra is currently (and happily) managed by a set of
 tools that allow me to set up and deploy new workers, manage their packages
 and make sure that all spark and research projects can happily and
 successfully build.

 we're currently at the state where ~50 or so packages are installed and
 configured on each worker.  this is getting a little cumbersome, as the
 package-to-build dep tree is getting pretty large.

 the biggest offender is the science-based python infrastructure.
  everything is blindly installed w/yum and pip, so it's hard to control
 *exactly* what version of any given library is as compared to what's on a
 dev's laptop.

 the solution:

 anaconda (https://store.continuum.io/cshop/anaconda/)!  everything is
 centralized!  i can manage specific versions much easier!

 what this means to you:

 * python 2.7 will be the default system python.
 * 2.6 will still be installed and available (/usr/bin/python or
 /usr/bin/python/2.6)

 what you need to do:
 * install anaconda, have it update your PATH
 * build locally and try to fix any bugs (for spark, this should just
 work)
 * if you have problems, reach out to me and i'll see what i can do to
 help.  if we can't get your stuff running under python2.7, we can default
 to 2.6 via a job config change.

 what i will be doing:
 * setting up anaconda on my staging instance and spot-testing a lot of
 builds before deployment

 please let me know if there are any issues/concerns...  i'll be posting
 updates this week and will let everyone know if there are any changes to
 the Plan[tm].

 your friendly devops engineer,

 shane



Re: [ERROR] bin/compute-classpath.sh: fails with false positive test for java 1.7 vs 1.6

2015-02-24 Thread shane knapp
it's not downgraded, it's your /etc/alternatives setup that's causing this.

you can update all of those entries by executing the following commands (as
root):

update-alternatives --install /usr/bin/java java
/usr/java/latest/bin/java 1
update-alternatives --install /usr/bin/javah javah
/usr/java/latest/bin/javah 1
update-alternatives --install /usr/bin/javac javac
/usr/java/latest/bin/javac 1
update-alternatives --install /usr/bin/jar jar
/usr/java/latest/bin/jar 1

(i have the latest jdk installed in /usr/java/ with a /usr/java/latest/
symlink pointing to said jdk's dir)

On Tue, Feb 24, 2015 at 3:32 PM, Mike Hynes 91m...@gmail.com wrote:

 I don't see any version flag for /usr/bin/jar, but I think I see the
 problem now; the openjdk version is 7, but javac -version gives
 1.6.0_34; so spark was compiled with java 6 despite the system using
 jre 1.7.
 Thanks for the sanity check! Now I just need to find out why javac is
 downgraded on the system..

 On 2/24/15, Sean Owen so...@cloudera.com wrote:
  So you mean that the script is checking for this error, and takes it
  as a sign that you compiled with java 6.
 
  Your command seems to confirm that reading the assembly jar does fail
  on your system though. What version does the jar command show? are you
  sure you don't have JRE 7 but JDK 6 installed?
 
  On Tue, Feb 24, 2015 at 11:02 PM, Mike Hynes 91m...@gmail.com wrote:
  ./bin/compute-classpath.sh fails with error:
 
  gt; jar -tf
 
assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop1.0.4.jar
  nonexistent/class/path
  java.util.zip.ZipException: invalid CEN header (bad signature)
  at java.util.zip.ZipFile.open(Native Method)
  at java.util.zip.ZipFile.init(ZipFile.java:132)
  at java.util.zip.ZipFile.init(ZipFile.java:93)
  at sun.tools.jar.Main.list(Main.java:997)
  at sun.tools.jar.Main.run(Main.java:242)
  at sun.tools.jar.Main.main(Main.java:1167)
 
  However, I both compiled the distribution and am running spark with
Java
  1.7;
  $ java -version
  java version 1.7.0_75
  OpenJDK Runtime Environment (IcedTea 2.5.4)
  (7u75-2.5.4-1~trusty1)
  OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)
  on a system running Ubuntu:
  $ uname -srpov
  Linux 3.13.0-44-generic #73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014
  x86_64 GNU/Linux
  $ uname -srpo
  Linux 3.13.0-44-generic x86_64 GNU/Linux
 
  This problem was reproduced on Arch Linux:
 
  $ uname -srpo
  Linux 3.18.5-1-ARCH x86_64 GNU/Linux
  with
  $ java -version
  java version 1.7.0_75
  OpenJDK Runtime Environment (IcedTea 2.5.4) (Arch Linux build
  7.u75_2.5.4-1-x86_64)
  OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)
 
  In both of these cases, the problem is not the java versioning;
  neither system even has a java 6 installation. This seems like a false
  positive to me in compute-classpath.sh.
 
  When I comment out the relevant lines in compute-classpath.sh, the
  scripts start-{master,slaves,...}.sh all run fine, and I have no
  problem launching applications.
 
  Could someone please offer some insight into this issue?
 
  Thanks,
  Mike
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 


 --
 Thanks,
 Mike

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org



Jenkins down

2015-04-24 Thread shane knapp
jenkins is currently unreachable.  i'm not entirely sure why, as i can't
ssh in to the box and see what's going on.  i've filed a ticket and will
let everyone know when i have more information.

shane


Re: Jenkins down

2015-04-24 Thread shane knapp
looks like we had a power failure on campus, and our datacenter is working
to bring things back up:

http://systemstatus.berkeley.edu/

On Fri, Apr 24, 2015 at 11:24 AM, shane knapp skn...@berkeley.edu wrote:

 jenkins is currently unreachable.  i'm not entirely sure why, as i can't
 ssh in to the box and see what's going on.  i've filed a ticket and will
 let everyone know when i have more information.

 shane



Re: Jenkins down

2015-04-24 Thread shane knapp
thanks everyone!  happy friday!  :)

On Fri, Apr 24, 2015 at 3:37 PM, York, Brennon brennon.y...@capitalone.com
wrote:

 Ditto to Reynold. Thanks a bunch for all the updates and work Shane!

 On 4/24/15, 3:25 PM, Reynold Xin r...@databricks.com wrote:

 Thanks for looking into this, Shane.
 
 On Fri, Apr 24, 2015 at 3:18 PM, shane knapp skn...@berkeley.edu wrote:
 
  ok, jenkins is back up and building.  we have a few things to mop up
 here
  (ganglia is sad), but i think we'll be good for the afternoon.
 
  shane
 
  On Fri, Apr 24, 2015 at 2:17 PM, shane knapp skn...@berkeley.edu
 wrote:
 
   ok, power has been restored and jenkins is back up.  we might be
 taking
   things down again to fix up some power mis-cabling (jon and i are in
 the
   colo, and the jenkins master wasn't on the UPS and needs to be).
  
   more updates as they come.  sorry for the inconvenience.
  
   On Fri, Apr 24, 2015 at 11:33 AM, shane knapp skn...@berkeley.edu
  wrote:
  
   looks like we had a power failure on campus, and our datacenter is
   working to bring things back up:
  
   http://systemstatus.berkeley.edu/
  
   On Fri, Apr 24, 2015 at 11:24 AM, shane knapp skn...@berkeley.edu
   wrote:
  
   jenkins is currently unreachable.  i'm not entirely sure why, as i
  can't
   ssh in to the box and see what's going on.  i've filed a ticket and
  will
   let everyone know when i have more information.
  
   shane
  
  
  
  
 

 

 The information contained in this e-mail is confidential and/or
 proprietary to Capital One and/or its affiliates. The information
 transmitted herewith is intended only for use by the individual or entity
 to which it is addressed.  If the reader of this message is not the
 intended recipient, you are hereby notified that any review,
 retransmission, dissemination, distribution, copying or other use of, or
 taking of any action in reliance upon this information is strictly
 prohibited. If you have received this communication in error, please
 contact the sender and delete the material from your computer.

 --
 You received this message because you are subscribed to the Google Groups
 amp-infra group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to amp-infra+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



Re: Jenkins down

2015-04-24 Thread shane knapp
ok, power has been restored and jenkins is back up.  we might be taking
things down again to fix up some power mis-cabling (jon and i are in the
colo, and the jenkins master wasn't on the UPS and needs to be).

more updates as they come.  sorry for the inconvenience.

On Fri, Apr 24, 2015 at 11:33 AM, shane knapp skn...@berkeley.edu wrote:

 looks like we had a power failure on campus, and our datacenter is working
 to bring things back up:

 http://systemstatus.berkeley.edu/

 On Fri, Apr 24, 2015 at 11:24 AM, shane knapp skn...@berkeley.edu wrote:

 jenkins is currently unreachable.  i'm not entirely sure why, as i can't
 ssh in to the box and see what's going on.  i've filed a ticket and will
 let everyone know when i have more information.

 shane





Re: [discuss] ending support for Java 6?

2015-04-30 Thread shane knapp
something to keep in mind:  we can easily support java 6 for the build
environment, particularly if there's a definite EOL.

i'd like to fix our java versioning 'problem', and this could be a big
instigator...  right now we're hackily setting java_home in test invocation
on jenkins, which really isn't the best.  if i decide, within jenkins, to
reconfigure every build to 'do the right thing' WRT java version, then i
will clean up the old mess and pay down on some technical debt.

or i can just install java 6 and we use that as JAVA_HOME on a
build-by-build basis.

this will be a few days of prep and another morning-long downtime if i do
the right thing (within jenkins), and only a couple of hours the hacky way
(system level).

either way, we can test on java 6.  :)

On Thu, Apr 30, 2015 at 1:00 PM, Koert Kuipers ko...@tresata.com wrote:

 nicholas started it! :)

 for java 6 i would have said the same thing about 1 year ago: it is foolish
 to drop it. but i think the time is right about now.
 about half our clients are on java 7 and the other half have active plans
 to migrate to it within 6 months.

 On Thu, Apr 30, 2015 at 3:57 PM, Reynold Xin r...@databricks.com wrote:

  Guys thanks for chiming in, but please focus on Java here. Python is an
  entirely separate issue.
 
 
  On Thu, Apr 30, 2015 at 12:53 PM, Koert Kuipers ko...@tresata.com
 wrote:
 
  i am not sure eol means much if it is still actively used. we have a lot
  of clients with centos 5 (for which we still support python 2.4 in some
  form or another, fun!). most of them are on centos 6, which means python
  2.6. by cutting out python 2.6 you would cut out the majority of the
 actual
  clusters i am aware of. unless you intention is to truly make something
  academic i dont think that is wise.
 
  On Thu, Apr 30, 2015 at 3:48 PM, Nicholas Chammas 
  nicholas.cham...@gmail.com wrote:
 
  (On that note, I think Python 2.6 should be next on the chopping block
  sometime later this year, but that’s for another thread.)
 
  (To continue the parenthetical, Python 2.6 was in fact EOL-ed in
 October
  of
  2013. https://www.python.org/download/releases/2.6.9/)
  ​
 
  On Thu, Apr 30, 2015 at 3:18 PM Nicholas Chammas 
  nicholas.cham...@gmail.com
  wrote:
 
   I understand the concern about cutting out users who still use Java
 6,
  and
   I don't have numbers about how many people are still using Java 6.
  
   But I want to say at a high level that I support deprecating older
   versions of stuff to reduce our maintenance burden and let us use
 more
   modern patterns in our code.
  
   Maintenance always costs way more than initial development over the
   lifetime of a project, and for that reason anti-support is just as
   important as support.
  
   (On that note, I think Python 2.6 should be next on the chopping
 block
   sometime later this year, but that's for another thread.)
  
   Nick
  
  
   On Thu, Apr 30, 2015 at 3:03 PM Reynold Xin r...@databricks.com
  wrote:
  
   This has been discussed a few times in the past, but now Oracle has
  ended
   support for Java 6 for over a year, I wonder if we should just drop
  Java 6
   support.
  
   There is one outstanding issue Tom has brought to my attention:
  PySpark on
   YARN doesn't work well with Java 7/8, but we have an outstanding
 pull
   request to fix that.
  
   https://issues.apache.org/jira/browse/SPARK-6869
   https://issues.apache.org/jira/browse/SPARK-1920
  
  
 
 
 
 



Re: [discuss] ending support for Java 6?

2015-05-04 Thread shane knapp
...and now the workers all have java6 installed.

https://issues.apache.org/jira/browse/SPARK-1437

sadly, the built-in jenkins jdk management doesn't allow us to choose a JDK
version within matrix projects...  so we need to manage this stuff
manually.

On Sun, May 3, 2015 at 8:57 AM, shane knapp skn...@berkeley.edu wrote:

 that bug predates my time at the amplab...  :)

 anyways, just to restate: jenkins currently only builds w/java 7.  if you
 folks need 6, i can make it happen, but it will be a (smallish) bit of work.

 shane

 On Sun, May 3, 2015 at 2:14 AM, Sean Owen so...@cloudera.com wrote:

 Should be, but isn't what Jenkins does.
 https://issues.apache.org/jira/browse/SPARK-1437

 At this point it might be simpler to just decide that 1.5 will require
 Java 7 and then the Jenkins setup is correct.

 (NB: you can also solve this by setting bootclasspath to JDK 6 libs
 even when using javac 7+ but I think this is overly complicated.)

 On Sun, May 3, 2015 at 5:52 AM, Mridul Muralidharan mri...@gmail.com
 wrote:
  Hi Shane,
 
Since we are still maintaining support for jdk6, jenkins should be
  using jdk6 [1] to ensure we do not inadvertently use jdk7 or higher
  api which breaks source level compat.
  -source and -target is insufficient to ensure api usage is conformant
  with the minimum jdk version we are supporting.
 
  Regards,
  Mridul
 
  [1] Not jdk7 as you mentioned
 
  On Sat, May 2, 2015 at 8:53 PM, shane knapp skn...@berkeley.edu
 wrote:
  that's kinda what we're doing right now, java 7 is the
 default/standard on
  our jenkins.
 
  or, i vote we buy a butler's outfit for thomas and have a second
 jenkins
  instance...  ;)





Re: [discuss] ending support for Java 6?

2015-05-04 Thread shane knapp
sgtm

On Mon, May 4, 2015 at 11:23 AM, Patrick Wendell pwend...@gmail.com wrote:

 If we just set JAVA_HOME in dev/run-test-jenkins, I think it should work.

 On Mon, May 4, 2015 at 7:20 PM, shane knapp skn...@berkeley.edu wrote:
  ...and now the workers all have java6 installed.
 
  https://issues.apache.org/jira/browse/SPARK-1437
 
  sadly, the built-in jenkins jdk management doesn't allow us to choose a
 JDK
  version within matrix projects...  so we need to manage this stuff
  manually.
 
  On Sun, May 3, 2015 at 8:57 AM, shane knapp skn...@berkeley.edu wrote:
 
  that bug predates my time at the amplab...  :)
 
  anyways, just to restate: jenkins currently only builds w/java 7.  if
 you
  folks need 6, i can make it happen, but it will be a (smallish) bit of
 work.
 
  shane
 
  On Sun, May 3, 2015 at 2:14 AM, Sean Owen so...@cloudera.com wrote:
 
  Should be, but isn't what Jenkins does.
  https://issues.apache.org/jira/browse/SPARK-1437
 
  At this point it might be simpler to just decide that 1.5 will require
  Java 7 and then the Jenkins setup is correct.
 
  (NB: you can also solve this by setting bootclasspath to JDK 6 libs
  even when using javac 7+ but I think this is overly complicated.)
 
  On Sun, May 3, 2015 at 5:52 AM, Mridul Muralidharan mri...@gmail.com
  wrote:
   Hi Shane,
  
 Since we are still maintaining support for jdk6, jenkins should be
   using jdk6 [1] to ensure we do not inadvertently use jdk7 or higher
   api which breaks source level compat.
   -source and -target is insufficient to ensure api usage is conformant
   with the minimum jdk version we are supporting.
  
   Regards,
   Mridul
  
   [1] Not jdk7 as you mentioned
  
   On Sat, May 2, 2015 at 8:53 PM, shane knapp skn...@berkeley.edu
  wrote:
   that's kinda what we're doing right now, java 7 is the
  default/standard on
   our jenkins.
  
   or, i vote we buy a butler's outfit for thomas and have a second
  jenkins
   instance...  ;)
 
 
 



Re: github pull request builder FAIL, now WIN(-ish)

2015-04-27 Thread shane knapp
sure, i'll kill all of the current spark prb build...

On Mon, Apr 27, 2015 at 11:34 AM, Reynold Xin r...@databricks.com wrote:

 Shane - can we purge all the outstanding builds so we are not running
 stuff against stale PRs?


 On Mon, Apr 27, 2015 at 11:30 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 And unfortunately, many Jenkins executor slots are being taken by stale
 Spark PRs...

 On Mon, Apr 27, 2015 at 2:25 PM shane knapp skn...@berkeley.edu wrote:

  anyways, the build queue is SLAMMED...  we're going to need at least a
 day
  to catch up w/this.  i'll be keeping an eye on system loads and whatnot
 all
  day today.
 
  whee!
 
  On Mon, Apr 27, 2015 at 11:18 AM, shane knapp skn...@berkeley.edu
 wrote:
 
   somehow, the power outage on friday caused the pull request builder to
   lose it's config entirely...  i'm not sure why, but after i added the
  oauth
   token back, we're now catching up on the weekend's pull request
 builds.
  
   have i mentioned how much i hate this plugin?  ;)
  
   sorry for the inconvenience...
  
   shane
  
 





Re: github pull request builder FAIL, now WIN(-ish)

2015-04-27 Thread shane knapp
never mind, looks like you guys are already on it.  :)

On Mon, Apr 27, 2015 at 11:35 AM, shane knapp skn...@berkeley.edu wrote:

 sure, i'll kill all of the current spark prb build...

 On Mon, Apr 27, 2015 at 11:34 AM, Reynold Xin r...@databricks.com wrote:

 Shane - can we purge all the outstanding builds so we are not running
 stuff against stale PRs?


 On Mon, Apr 27, 2015 at 11:30 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 And unfortunately, many Jenkins executor slots are being taken by stale
 Spark PRs...

 On Mon, Apr 27, 2015 at 2:25 PM shane knapp skn...@berkeley.edu wrote:

  anyways, the build queue is SLAMMED...  we're going to need at least a
 day
  to catch up w/this.  i'll be keeping an eye on system loads and
 whatnot all
  day today.
 
  whee!
 
  On Mon, Apr 27, 2015 at 11:18 AM, shane knapp skn...@berkeley.edu
 wrote:
 
   somehow, the power outage on friday caused the pull request builder
 to
   lose it's config entirely...  i'm not sure why, but after i added the
  oauth
   token back, we're now catching up on the weekend's pull request
 builds.
  
   have i mentioned how much i hate this plugin?  ;)
  
   sorry for the inconvenience...
  
   shane
  
 






github pull request builder FAIL, now WIN(-ish)

2015-04-27 Thread shane knapp
somehow, the power outage on friday caused the pull request builder to lose
it's config entirely...  i'm not sure why, but after i added the oauth
token back, we're now catching up on the weekend's pull request builds.

have i mentioned how much i hate this plugin?  ;)

sorry for the inconvenience...

shane


Re: github pull request builder FAIL, now WIN(-ish)

2015-04-27 Thread shane knapp
anyways, the build queue is SLAMMED...  we're going to need at least a day
to catch up w/this.  i'll be keeping an eye on system loads and whatnot all
day today.

whee!

On Mon, Apr 27, 2015 at 11:18 AM, shane knapp skn...@berkeley.edu wrote:

 somehow, the power outage on friday caused the pull request builder to
 lose it's config entirely...  i'm not sure why, but after i added the oauth
 token back, we're now catching up on the weekend's pull request builds.

 have i mentioned how much i hate this plugin?  ;)

 sorry for the inconvenience...

 shane



  1   2   3   4   5   6   7   8   >