Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts
i've seen a few more builds fail w/timeouts and it appears that we're definitely NOT hitting any rate limiting. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22005/console [jenkins@amp-jenkins-slave-01 ~]$ curl -i -H Authorization: token REDACTED https://api.github.com | grep Rate X-RateLimit-Limit: 5000 X-RateLimit-Remaining: 4997 X-RateLimit-Reset: 1413929848 Access-Control-Expose-Headers: ETag, Link, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval On Sat, Oct 18, 2014 at 12:44 AM, Davies Liu dav...@databricks.com wrote: Cool, the recent 4 build had used the new configs, thanks! Let's run more builds. Davies On Fri, Oct 17, 2014 at 11:06 PM, Josh Rosen rosenvi...@gmail.com wrote: I think that the fix was applied. Take a look at https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21874/consoleFull Here, I see a fetch command that mentions this specific PR branch rather than the wildcard that we had before: git fetch --tags --progress https://github.com/apache/spark.git +refs/pull/2840/*:refs/remotes/origin/pr/2840/* # timeout=15 Do you have an example of a Spark PRB build that’s still failing with the old fetch failure? - Josh On October 17, 2014 at 11:03:14 PM, Davies Liu (dav...@databricks.com) wrote: How can we know the changes has been applied? I had checked several recent builds, they all use the original configs. Davies On Fri, Oct 17, 2014 at 6:17 PM, Josh Rosen rosenvi...@gmail.com wrote: FYI, I edited the Spark Pull Request Builder job to try this out. Let’s see if it works (I’ll be around to revert if it doesn’t). On October 17, 2014 at 5:26:56 PM, Davies Liu (dav...@databricks.com) wrote: One finding is that all the timeout happened with this command: git fetch --tags --progress https://github.com/apache/spark.git +refs/pull/*:refs/remotes/origin/pr/* I'm thinking that maybe this may be a expensive call, we could try to use a more cheap one: git fetch --tags --progress https://github.com/apache/spark.git +refs/pull/XXX/*:refs/remotes/origin/pr/XXX/* XXX is the PullRequestID, The configuration support parameters [1], so we could put this in : +refs/pull//${ghprbPullId}/*:refs/remotes/origin/pr/${ghprbPullId}/* I have not tested this yet, could you give this a try? Davies [1] https://wiki.jenkins-ci.org/display/JENKINS/GitHub+pull+request+builder+plugin On Fri, Oct 17, 2014 at 5:00 PM, shane knapp skn...@berkeley.edu wrote: actually, nvm, you have to be run that command from our servers to affect our limit. run it all you want from your own machines! :P On Fri, Oct 17, 2014 at 4:59 PM, shane knapp skn...@berkeley.edu wrote: yep, and i will tell you guys ONLY if you promise to NOT try this yourselves... checking the rate limit also counts as a hit and increments our numbers: # curl -i https://api.github.com/users/whatever 2 /dev/null | egrep ^X-Rate X-RateLimit-Limit: 60 X-RateLimit-Remaining: 51 X-RateLimit-Reset: 1413590269 (yes, that is the exact url that they recommended on the github site lol) so, earlier today, we had a spark build fail w/a git timeout at 10:57am, but there were only ~7 builds run that hour, so that points to us NOT hitting the rate limit... at least for this fail. whee! is it beer-thirty yet? shane On Fri, Oct 17, 2014 at 4:52 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Wow, thanks for this deep dive Shane. Is there a way to check if we are getting hit by rate limiting directly, or do we need to contact GitHub for that? 2014년 10월 17일 금요일, shane knappskn...@berkeley.edu님이 작성한 메시지: quick update: here are some stats i scraped over the past week of ALL pull request builder projects and timeout failures. due to the large number of spark ghprb jobs, i don't have great records earlier than oct 7th. the data is current up until ~230pm today: spark and new spark ghprb total builds vs git fetch timeouts: $ for x in 10-{09..17}; do passed=$(grep $x SORTED.passed | grep -i spark | wc -l); failed=$(grep $x SORTED | grep -i spark | wc -l); let total=passed+failed; fail_percent=$(echo scale=2; $failed/$total | bc | sed s/^\.//g); line=$x -- total builds: $total\tp/f: $passed/$failed\tfail%: $fail_percent%; echo -e $line; done 10-09 -- total builds: 140 p/f: 92/48 fail%: 34% 10-10 -- total builds: 65 p/f: 59/6 fail%: 09% 10-11 -- total builds: 29 p/f: 29/0 fail%: 0% 10-12 -- total builds: 24 p/f: 21/3 fail%: 12% 10-13 -- total builds: 39 p/f: 35/4 fail%: 10% 10-14 -- total builds: 7 p/f: 5/2 fail%: 28% 10-15 -- total builds: 37 p/f: 34/3 fail%: 08% 10-16 -- total builds: 71 p/f: 59/12 fail%: 16% 10-17 -- total builds: 26 p/f: 20/6 fail%: 23% all other ghprb builds vs git fetch timeouts: $
Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts
How can we know the changes has been applied? I had checked several recent builds, they all use the original configs. Davies On Fri, Oct 17, 2014 at 6:17 PM, Josh Rosen rosenvi...@gmail.com wrote: FYI, I edited the Spark Pull Request Builder job to try this out. Let’s see if it works (I’ll be around to revert if it doesn’t). On October 17, 2014 at 5:26:56 PM, Davies Liu (dav...@databricks.com) wrote: One finding is that all the timeout happened with this command: git fetch --tags --progress https://github.com/apache/spark.git +refs/pull/*:refs/remotes/origin/pr/* I'm thinking that maybe this may be a expensive call, we could try to use a more cheap one: git fetch --tags --progress https://github.com/apache/spark.git +refs/pull/XXX/*:refs/remotes/origin/pr/XXX/* XXX is the PullRequestID, The configuration support parameters [1], so we could put this in : +refs/pull//${ghprbPullId}/*:refs/remotes/origin/pr/${ghprbPullId}/* I have not tested this yet, could you give this a try? Davies [1] https://wiki.jenkins-ci.org/display/JENKINS/GitHub+pull+request+builder+plugin On Fri, Oct 17, 2014 at 5:00 PM, shane knapp skn...@berkeley.edu wrote: actually, nvm, you have to be run that command from our servers to affect our limit. run it all you want from your own machines! :P On Fri, Oct 17, 2014 at 4:59 PM, shane knapp skn...@berkeley.edu wrote: yep, and i will tell you guys ONLY if you promise to NOT try this yourselves... checking the rate limit also counts as a hit and increments our numbers: # curl -i https://api.github.com/users/whatever 2 /dev/null | egrep ^X-Rate X-RateLimit-Limit: 60 X-RateLimit-Remaining: 51 X-RateLimit-Reset: 1413590269 (yes, that is the exact url that they recommended on the github site lol) so, earlier today, we had a spark build fail w/a git timeout at 10:57am, but there were only ~7 builds run that hour, so that points to us NOT hitting the rate limit... at least for this fail. whee! is it beer-thirty yet? shane On Fri, Oct 17, 2014 at 4:52 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Wow, thanks for this deep dive Shane. Is there a way to check if we are getting hit by rate limiting directly, or do we need to contact GitHub for that? 2014년 10월 17일 금요일, shane knappskn...@berkeley.edu님이 작성한 메시지: quick update: here are some stats i scraped over the past week of ALL pull request builder projects and timeout failures. due to the large number of spark ghprb jobs, i don't have great records earlier than oct 7th. the data is current up until ~230pm today: spark and new spark ghprb total builds vs git fetch timeouts: $ for x in 10-{09..17}; do passed=$(grep $x SORTED.passed | grep -i spark | wc -l); failed=$(grep $x SORTED | grep -i spark | wc -l); let total=passed+failed; fail_percent=$(echo scale=2; $failed/$total | bc | sed s/^\.//g); line=$x -- total builds: $total\tp/f: $passed/$failed\tfail%: $fail_percent%; echo -e $line; done 10-09 -- total builds: 140 p/f: 92/48 fail%: 34% 10-10 -- total builds: 65 p/f: 59/6 fail%: 09% 10-11 -- total builds: 29 p/f: 29/0 fail%: 0% 10-12 -- total builds: 24 p/f: 21/3 fail%: 12% 10-13 -- total builds: 39 p/f: 35/4 fail%: 10% 10-14 -- total builds: 7 p/f: 5/2 fail%: 28% 10-15 -- total builds: 37 p/f: 34/3 fail%: 08% 10-16 -- total builds: 71 p/f: 59/12 fail%: 16% 10-17 -- total builds: 26 p/f: 20/6 fail%: 23% all other ghprb builds vs git fetch timeouts: $ for x in 10-{09..17}; do passed=$(grep $x SORTED.passed | grep -vi spark | wc -l); failed=$(grep $x SORTED | grep -vi spark | wc -l); let total=passed+failed; fail_percent=$(echo scale=2; $failed/$total | bc | sed s/^\.//g); line=$x -- total builds: $total\tp/f: $passed/$failed\tfail%: $fail_percent%; echo -e $line; done 10-09 -- total builds: 16 p/f: 16/0 fail%: 0% 10-10 -- total builds: 46 p/f: 40/6 fail%: 13% 10-11 -- total builds: 4 p/f: 4/0 fail%: 0% 10-12 -- total builds: 2 p/f: 2/0 fail%: 0% 10-13 -- total builds: 2 p/f: 2/0 fail%: 0% 10-14 -- total builds: 10 p/f: 10/0 fail%: 0% 10-15 -- total builds: 5 p/f: 5/0 fail%: 0% 10-16 -- total builds: 5 p/f: 5/0 fail%: 0% 10-17 -- total builds: 0 p/f: 0/0 fail%: 0% note: the 15th was the day i rolled back to the earlier version of the git plugin. it doesn't seem to have helped much, so i'll probably bring us back up to the latest version soon. also note: rocking some floating point math on the CLI! ;) i also compared the distribution of git timeout failures vs time of day, and there appears to be no correlation. the failures are pretty evenly distributed over each hour of the day. we could be hitting the rate limit due to the ghprb hitting github a couple of times for each build, but we're averaging ~10-20 builds per hour (a build hits github 2-4 times, from what i can tell). i'll have to look more in to this on monday, but suffice to say we may need to move from unauthorized https fetches to authorized requests.
Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts
I think that the fix was applied. Take a look at https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21874/consoleFull Here, I see a fetch command that mentions this specific PR branch rather than the wildcard that we had before: git fetch --tags --progress https://github.com/apache/spark.git +refs/pull/2840/*:refs/remotes/origin/pr/2840/* # timeout=15 Do you have an example of a Spark PRB build that’s still failing with the old fetch failure? - Josh On October 17, 2014 at 11:03:14 PM, Davies Liu (dav...@databricks.com) wrote: How can we know the changes has been applied? I had checked several recent builds, they all use the original configs. Davies On Fri, Oct 17, 2014 at 6:17 PM, Josh Rosen rosenvi...@gmail.com wrote: FYI, I edited the Spark Pull Request Builder job to try this out. Let’s see if it works (I’ll be around to revert if it doesn’t). On October 17, 2014 at 5:26:56 PM, Davies Liu (dav...@databricks.com) wrote: One finding is that all the timeout happened with this command: git fetch --tags --progress https://github.com/apache/spark.git +refs/pull/*:refs/remotes/origin/pr/* I'm thinking that maybe this may be a expensive call, we could try to use a more cheap one: git fetch --tags --progress https://github.com/apache/spark.git +refs/pull/XXX/*:refs/remotes/origin/pr/XXX/* XXX is the PullRequestID, The configuration support parameters [1], so we could put this in : +refs/pull//${ghprbPullId}/*:refs/remotes/origin/pr/${ghprbPullId}/* I have not tested this yet, could you give this a try? Davies [1] https://wiki.jenkins-ci.org/display/JENKINS/GitHub+pull+request+builder+plugin On Fri, Oct 17, 2014 at 5:00 PM, shane knapp skn...@berkeley.edu wrote: actually, nvm, you have to be run that command from our servers to affect our limit. run it all you want from your own machines! :P On Fri, Oct 17, 2014 at 4:59 PM, shane knapp skn...@berkeley.edu wrote: yep, and i will tell you guys ONLY if you promise to NOT try this yourselves... checking the rate limit also counts as a hit and increments our numbers: # curl -i https://api.github.com/users/whatever 2 /dev/null | egrep ^X-Rate X-RateLimit-Limit: 60 X-RateLimit-Remaining: 51 X-RateLimit-Reset: 1413590269 (yes, that is the exact url that they recommended on the github site lol) so, earlier today, we had a spark build fail w/a git timeout at 10:57am, but there were only ~7 builds run that hour, so that points to us NOT hitting the rate limit... at least for this fail. whee! is it beer-thirty yet? shane On Fri, Oct 17, 2014 at 4:52 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Wow, thanks for this deep dive Shane. Is there a way to check if we are getting hit by rate limiting directly, or do we need to contact GitHub for that? 2014년 10월 17일 금요일, shane knappskn...@berkeley.edu님이 작성한 메시지: quick update: here are some stats i scraped over the past week of ALL pull request builder projects and timeout failures. due to the large number of spark ghprb jobs, i don't have great records earlier than oct 7th. the data is current up until ~230pm today: spark and new spark ghprb total builds vs git fetch timeouts: $ for x in 10-{09..17}; do passed=$(grep $x SORTED.passed | grep -i spark | wc -l); failed=$(grep $x SORTED | grep -i spark | wc -l); let total=passed+failed; fail_percent=$(echo scale=2; $failed/$total | bc | sed s/^\.//g); line=$x -- total builds: $total\tp/f: $passed/$failed\tfail%: $fail_percent%; echo -e $line; done 10-09 -- total builds: 140 p/f: 92/48 fail%: 34% 10-10 -- total builds: 65 p/f: 59/6 fail%: 09% 10-11 -- total builds: 29 p/f: 29/0 fail%: 0% 10-12 -- total builds: 24 p/f: 21/3 fail%: 12% 10-13 -- total builds: 39 p/f: 35/4 fail%: 10% 10-14 -- total builds: 7 p/f: 5/2 fail%: 28% 10-15 -- total builds: 37 p/f: 34/3 fail%: 08% 10-16 -- total builds: 71 p/f: 59/12 fail%: 16% 10-17 -- total builds: 26 p/f: 20/6 fail%: 23% all other ghprb builds vs git fetch timeouts: $ for x in 10-{09..17}; do passed=$(grep $x SORTED.passed | grep -vi spark | wc -l); failed=$(grep $x SORTED | grep -vi spark | wc -l); let total=passed+failed; fail_percent=$(echo scale=2; $failed/$total | bc | sed s/^\.//g); line=$x -- total builds: $total\tp/f: $passed/$failed\tfail%: $fail_percent%; echo -e $line; done 10-09 -- total builds: 16 p/f: 16/0 fail%: 0% 10-10 -- total builds: 46 p/f: 40/6 fail%: 13% 10-11 -- total builds: 4 p/f: 4/0 fail%: 0% 10-12 -- total builds: 2 p/f: 2/0 fail%: 0% 10-13 -- total builds: 2 p/f: 2/0 fail%: 0% 10-14 -- total builds: 10 p/f: 10/0 fail%: 0% 10-15 -- total builds: 5 p/f: 5/0 fail%: 0% 10-16 -- total builds: 5 p/f: 5/0 fail%: 0% 10-17 -- total builds: 0 p/f: 0/0 fail%:
Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts
Cool, the recent 4 build had used the new configs, thanks! Let's run more builds. Davies On Fri, Oct 17, 2014 at 11:06 PM, Josh Rosen rosenvi...@gmail.com wrote: I think that the fix was applied. Take a look at https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21874/consoleFull Here, I see a fetch command that mentions this specific PR branch rather than the wildcard that we had before: git fetch --tags --progress https://github.com/apache/spark.git +refs/pull/2840/*:refs/remotes/origin/pr/2840/* # timeout=15 Do you have an example of a Spark PRB build that’s still failing with the old fetch failure? - Josh On October 17, 2014 at 11:03:14 PM, Davies Liu (dav...@databricks.com) wrote: How can we know the changes has been applied? I had checked several recent builds, they all use the original configs. Davies On Fri, Oct 17, 2014 at 6:17 PM, Josh Rosen rosenvi...@gmail.com wrote: FYI, I edited the Spark Pull Request Builder job to try this out. Let’s see if it works (I’ll be around to revert if it doesn’t). On October 17, 2014 at 5:26:56 PM, Davies Liu (dav...@databricks.com) wrote: One finding is that all the timeout happened with this command: git fetch --tags --progress https://github.com/apache/spark.git +refs/pull/*:refs/remotes/origin/pr/* I'm thinking that maybe this may be a expensive call, we could try to use a more cheap one: git fetch --tags --progress https://github.com/apache/spark.git +refs/pull/XXX/*:refs/remotes/origin/pr/XXX/* XXX is the PullRequestID, The configuration support parameters [1], so we could put this in : +refs/pull//${ghprbPullId}/*:refs/remotes/origin/pr/${ghprbPullId}/* I have not tested this yet, could you give this a try? Davies [1] https://wiki.jenkins-ci.org/display/JENKINS/GitHub+pull+request+builder+plugin On Fri, Oct 17, 2014 at 5:00 PM, shane knapp skn...@berkeley.edu wrote: actually, nvm, you have to be run that command from our servers to affect our limit. run it all you want from your own machines! :P On Fri, Oct 17, 2014 at 4:59 PM, shane knapp skn...@berkeley.edu wrote: yep, and i will tell you guys ONLY if you promise to NOT try this yourselves... checking the rate limit also counts as a hit and increments our numbers: # curl -i https://api.github.com/users/whatever 2 /dev/null | egrep ^X-Rate X-RateLimit-Limit: 60 X-RateLimit-Remaining: 51 X-RateLimit-Reset: 1413590269 (yes, that is the exact url that they recommended on the github site lol) so, earlier today, we had a spark build fail w/a git timeout at 10:57am, but there were only ~7 builds run that hour, so that points to us NOT hitting the rate limit... at least for this fail. whee! is it beer-thirty yet? shane On Fri, Oct 17, 2014 at 4:52 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Wow, thanks for this deep dive Shane. Is there a way to check if we are getting hit by rate limiting directly, or do we need to contact GitHub for that? 2014년 10월 17일 금요일, shane knappskn...@berkeley.edu님이 작성한 메시지: quick update: here are some stats i scraped over the past week of ALL pull request builder projects and timeout failures. due to the large number of spark ghprb jobs, i don't have great records earlier than oct 7th. the data is current up until ~230pm today: spark and new spark ghprb total builds vs git fetch timeouts: $ for x in 10-{09..17}; do passed=$(grep $x SORTED.passed | grep -i spark | wc -l); failed=$(grep $x SORTED | grep -i spark | wc -l); let total=passed+failed; fail_percent=$(echo scale=2; $failed/$total | bc | sed s/^\.//g); line=$x -- total builds: $total\tp/f: $passed/$failed\tfail%: $fail_percent%; echo -e $line; done 10-09 -- total builds: 140 p/f: 92/48 fail%: 34% 10-10 -- total builds: 65 p/f: 59/6 fail%: 09% 10-11 -- total builds: 29 p/f: 29/0 fail%: 0% 10-12 -- total builds: 24 p/f: 21/3 fail%: 12% 10-13 -- total builds: 39 p/f: 35/4 fail%: 10% 10-14 -- total builds: 7 p/f: 5/2 fail%: 28% 10-15 -- total builds: 37 p/f: 34/3 fail%: 08% 10-16 -- total builds: 71 p/f: 59/12 fail%: 16% 10-17 -- total builds: 26 p/f: 20/6 fail%: 23% all other ghprb builds vs git fetch timeouts: $ for x in 10-{09..17}; do passed=$(grep $x SORTED.passed | grep -vi spark | wc -l); failed=$(grep $x SORTED | grep -vi spark | wc -l); let total=passed+failed; fail_percent=$(echo scale=2; $failed/$total | bc | sed s/^\.//g); line=$x -- total builds: $total\tp/f: $passed/$failed\tfail%: $fail_percent%; echo -e $line; done 10-09 -- total builds: 16 p/f: 16/0 fail%: 0% 10-10 -- total builds: 46 p/f: 40/6 fail%: 13% 10-11 -- total builds: 4 p/f: 4/0 fail%: 0% 10-12 -- total builds: 2 p/f: 2/0 fail%: 0% 10-13 -- total builds: 2 p/f: 2/0 fail%: 0% 10-14 -- total builds: 10 p/f: 10/0 fail%: 0% 10-15 -- total builds: 5 p/f: 5/0 fail%: 0% 10-16 -- total builds: 5 p/f: 5/0 fail%: 0% 10-17 -- total builds: 0 p/f: 0/0 fail%: 0% note: the 15th was the day i rolled
Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts
One finding is that all the timeout happened with this command: git fetch --tags --progress https://github.com/apache/spark.git +refs/pull/*:refs/remotes/origin/pr/* I'm thinking that maybe this may be a expensive call, we could try to use a more cheap one: git fetch --tags --progress https://github.com/apache/spark.git +refs/pull/XXX/*:refs/remotes/origin/pr/XXX/* XXX is the PullRequestID, The configuration support parameters [1], so we could put this in : +refs/pull//${ghprbPullId}/*:refs/remotes/origin/pr/${ghprbPullId}/* I have not tested this yet, could you give this a try? Davies [1] https://wiki.jenkins-ci.org/display/JENKINS/GitHub+pull+request+builder+plugin On Fri, Oct 17, 2014 at 5:00 PM, shane knapp skn...@berkeley.edu wrote: actually, nvm, you have to be run that command from our servers to affect our limit. run it all you want from your own machines! :P On Fri, Oct 17, 2014 at 4:59 PM, shane knapp skn...@berkeley.edu wrote: yep, and i will tell you guys ONLY if you promise to NOT try this yourselves... checking the rate limit also counts as a hit and increments our numbers: # curl -i https://api.github.com/users/whatever 2 /dev/null | egrep ^X-Rate X-RateLimit-Limit: 60 X-RateLimit-Remaining: 51 X-RateLimit-Reset: 1413590269 (yes, that is the exact url that they recommended on the github site lol) so, earlier today, we had a spark build fail w/a git timeout at 10:57am, but there were only ~7 builds run that hour, so that points to us NOT hitting the rate limit... at least for this fail. whee! is it beer-thirty yet? shane On Fri, Oct 17, 2014 at 4:52 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Wow, thanks for this deep dive Shane. Is there a way to check if we are getting hit by rate limiting directly, or do we need to contact GitHub for that? 2014년 10월 17일 금요일, shane knappskn...@berkeley.edu님이 작성한 메시지: quick update: here are some stats i scraped over the past week of ALL pull request builder projects and timeout failures. due to the large number of spark ghprb jobs, i don't have great records earlier than oct 7th. the data is current up until ~230pm today: spark and new spark ghprb total builds vs git fetch timeouts: $ for x in 10-{09..17}; do passed=$(grep $x SORTED.passed | grep -i spark | wc -l); failed=$(grep $x SORTED | grep -i spark | wc -l); let total=passed+failed; fail_percent=$(echo scale=2; $failed/$total | bc | sed s/^\.//g); line=$x -- total builds: $total\tp/f: $passed/$failed\tfail%: $fail_percent%; echo -e $line; done 10-09 -- total builds: 140 p/f: 92/48 fail%: 34% 10-10 -- total builds: 65 p/f: 59/6 fail%: 09% 10-11 -- total builds: 29 p/f: 29/0 fail%: 0% 10-12 -- total builds: 24 p/f: 21/3 fail%: 12% 10-13 -- total builds: 39 p/f: 35/4 fail%: 10% 10-14 -- total builds: 7 p/f: 5/2 fail%: 28% 10-15 -- total builds: 37 p/f: 34/3 fail%: 08% 10-16 -- total builds: 71 p/f: 59/12 fail%: 16% 10-17 -- total builds: 26 p/f: 20/6 fail%: 23% all other ghprb builds vs git fetch timeouts: $ for x in 10-{09..17}; do passed=$(grep $x SORTED.passed | grep -vi spark | wc -l); failed=$(grep $x SORTED | grep -vi spark | wc -l); let total=passed+failed; fail_percent=$(echo scale=2; $failed/$total | bc | sed s/^\.//g); line=$x -- total builds: $total\tp/f: $passed/$failed\tfail%: $fail_percent%; echo -e $line; done 10-09 -- total builds: 16 p/f: 16/0 fail%: 0% 10-10 -- total builds: 46 p/f: 40/6 fail%: 13% 10-11 -- total builds: 4 p/f: 4/0 fail%: 0% 10-12 -- total builds: 2 p/f: 2/0 fail%: 0% 10-13 -- total builds: 2 p/f: 2/0 fail%: 0% 10-14 -- total builds: 10 p/f: 10/0 fail%: 0% 10-15 -- total builds: 5 p/f: 5/0 fail%: 0% 10-16 -- total builds: 5 p/f: 5/0 fail%: 0% 10-17 -- total builds: 0 p/f: 0/0 fail%: 0% note: the 15th was the day i rolled back to the earlier version of the git plugin. it doesn't seem to have helped much, so i'll probably bring us back up to the latest version soon. also note: rocking some floating point math on the CLI! ;) i also compared the distribution of git timeout failures vs time of day, and there appears to be no correlation. the failures are pretty evenly distributed over each hour of the day. we could be hitting the rate limit due to the ghprb hitting github a couple of times for each build, but we're averaging ~10-20 builds per hour (a build hits github 2-4 times, from what i can tell). i'll have to look more in to this on monday, but suffice to say we may need to move from unauthorized https fetches to authorized requests. this means retrofitting all of our jobs. yay! fun! :) another option is to have local mirrors of all of the repos. the problem w/this is that there might be a window where changes haven't made it to the local mirror and tests run against it. more fun stuff to think about... now that i have some stats, and a list of all of the times/dates of the failures, i will be drafting my email to github and
Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts
FYI, I edited the Spark Pull Request Builder job to try this out. Let’s see if it works (I’ll be around to revert if it doesn’t). On October 17, 2014 at 5:26:56 PM, Davies Liu (dav...@databricks.com) wrote: One finding is that all the timeout happened with this command: git fetch --tags --progress https://github.com/apache/spark.git +refs/pull/*:refs/remotes/origin/pr/* I'm thinking that maybe this may be a expensive call, we could try to use a more cheap one: git fetch --tags --progress https://github.com/apache/spark.git +refs/pull/XXX/*:refs/remotes/origin/pr/XXX/* XXX is the PullRequestID, The configuration support parameters [1], so we could put this in : +refs/pull//${ghprbPullId}/*:refs/remotes/origin/pr/${ghprbPullId}/* I have not tested this yet, could you give this a try? Davies [1] https://wiki.jenkins-ci.org/display/JENKINS/GitHub+pull+request+builder+plugin On Fri, Oct 17, 2014 at 5:00 PM, shane knapp skn...@berkeley.edu wrote: actually, nvm, you have to be run that command from our servers to affect our limit. run it all you want from your own machines! :P On Fri, Oct 17, 2014 at 4:59 PM, shane knapp skn...@berkeley.edu wrote: yep, and i will tell you guys ONLY if you promise to NOT try this yourselves... checking the rate limit also counts as a hit and increments our numbers: # curl -i https://api.github.com/users/whatever 2 /dev/null | egrep ^X-Rate X-RateLimit-Limit: 60 X-RateLimit-Remaining: 51 X-RateLimit-Reset: 1413590269 (yes, that is the exact url that they recommended on the github site lol) so, earlier today, we had a spark build fail w/a git timeout at 10:57am, but there were only ~7 builds run that hour, so that points to us NOT hitting the rate limit... at least for this fail. whee! is it beer-thirty yet? shane On Fri, Oct 17, 2014 at 4:52 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Wow, thanks for this deep dive Shane. Is there a way to check if we are getting hit by rate limiting directly, or do we need to contact GitHub for that? 2014년 10월 17일 금요일, shane knappskn...@berkeley.edu님이 작성한 메시지: quick update: here are some stats i scraped over the past week of ALL pull request builder projects and timeout failures. due to the large number of spark ghprb jobs, i don't have great records earlier than oct 7th. the data is current up until ~230pm today: spark and new spark ghprb total builds vs git fetch timeouts: $ for x in 10-{09..17}; do passed=$(grep $x SORTED.passed | grep -i spark | wc -l); failed=$(grep $x SORTED | grep -i spark | wc -l); let total=passed+failed; fail_percent=$(echo scale=2; $failed/$total | bc | sed s/^\.//g); line=$x -- total builds: $total\tp/f: $passed/$failed\tfail%: $fail_percent%; echo -e $line; done 10-09 -- total builds: 140 p/f: 92/48 fail%: 34% 10-10 -- total builds: 65 p/f: 59/6 fail%: 09% 10-11 -- total builds: 29 p/f: 29/0 fail%: 0% 10-12 -- total builds: 24 p/f: 21/3 fail%: 12% 10-13 -- total builds: 39 p/f: 35/4 fail%: 10% 10-14 -- total builds: 7 p/f: 5/2 fail%: 28% 10-15 -- total builds: 37 p/f: 34/3 fail%: 08% 10-16 -- total builds: 71 p/f: 59/12 fail%: 16% 10-17 -- total builds: 26 p/f: 20/6 fail%: 23% all other ghprb builds vs git fetch timeouts: $ for x in 10-{09..17}; do passed=$(grep $x SORTED.passed | grep -vi spark | wc -l); failed=$(grep $x SORTED | grep -vi spark | wc -l); let total=passed+failed; fail_percent=$(echo scale=2; $failed/$total | bc | sed s/^\.//g); line=$x -- total builds: $total\tp/f: $passed/$failed\tfail%: $fail_percent%; echo -e $line; done 10-09 -- total builds: 16 p/f: 16/0 fail%: 0% 10-10 -- total builds: 46 p/f: 40/6 fail%: 13% 10-11 -- total builds: 4 p/f: 4/0 fail%: 0% 10-12 -- total builds: 2 p/f: 2/0 fail%: 0% 10-13 -- total builds: 2 p/f: 2/0 fail%: 0% 10-14 -- total builds: 10 p/f: 10/0 fail%: 0% 10-15 -- total builds: 5 p/f: 5/0 fail%: 0% 10-16 -- total builds: 5 p/f: 5/0 fail%: 0% 10-17 -- total builds: 0 p/f: 0/0 fail%: 0% note: the 15th was the day i rolled back to the earlier version of the git plugin. it doesn't seem to have helped much, so i'll probably bring us back up to the latest version soon. also note: rocking some floating point math on the CLI! ;) i also compared the distribution of git timeout failures vs time of day, and there appears to be no correlation. the failures are pretty evenly distributed over each hour of the day. we could be hitting the rate limit due to the ghprb hitting github a couple of times for each build, but we're averaging ~10-20 builds per hour (a build hits github 2-4 times, from what i can tell). i'll have to look more in to this on monday, but suffice to say we may need to move from unauthorized https fetches to authorized requests. this means
Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts
the bad news is that we've had a couple more failures due to timeouts, but the good news is that the frequency that these happen has decreased significantly (3 in the past ~18hr). seems like the git plugin downgrade has helped relieve the problem, but hasn't fixed it. i'll be looking in to this more today. On Wed, Oct 15, 2014 at 7:05 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: A quick scan through the Spark PR board https://spark-prs.appspot.com/ shows no recent failures related to this git checkout problem. Looks promising! Nick On Wed, Oct 15, 2014 at 6:10 PM, shane knapp skn...@berkeley.edu wrote: ok, we've had about 10 spark pull request builds go through w/o any git timeouts. it seems that the git timeout issue might be licked. i will be definitely be keeping an eye on this for the next few days. thanks for being patient! shane On Wed, Oct 15, 2014 at 2:27 PM, shane knapp skn...@berkeley.edu wrote: four builds triggered and no timeouts. :crossestoes: :) On Wed, Oct 15, 2014 at 2:19 PM, shane knapp skn...@berkeley.edu wrote: ok, we're up and building... :crossesfingersfortheumpteenthtime: On Wed, Oct 15, 2014 at 1:59 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I support this effort. :thumbsup: On Wed, Oct 15, 2014 at 4:52 PM, shane knapp skn...@berkeley.edu wrote: i'm going to be downgrading our git plugin (from 2.2.7 to 2.2.2) to see if that helps w/the git fetch timeouts. this will require a short downtime (~20 mins for builds to finish, ~20 mins to downgrade), and will hopefully give us some insight in to wtf is going on. thanks for your patience... shane -- You received this message because you are subscribed to the Google Groups amp-infra group. To unsubscribe from this group and stop receiving emails from it, send an email to amp-infra+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts
Thanks for continuing to look into this, Shane. One suggestion that Patrick brought up, if we have trouble getting to the bottom of this, is doing the git checkout ourselves in the run-tests-jenkins script and cutting out the Jenkins git plugin entirely. That way we can script retries and post friendlier messages about timeouts if they still occur by ourselves. Do you think that’s worth trying at some point? Nick On Thu, Oct 16, 2014 at 2:04 PM, shane knapp skn...@berkeley.edu wrote: the bad news is that we've had a couple more failures due to timeouts, but the good news is that the frequency that these happen has decreased significantly (3 in the past ~18hr). seems like the git plugin downgrade has helped relieve the problem, but hasn't fixed it. i'll be looking in to this more today. On Wed, Oct 15, 2014 at 7:05 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: A quick scan through the Spark PR board https://spark-prs.appspot.com/ shows no recent failures related to this git checkout problem. Looks promising! Nick On Wed, Oct 15, 2014 at 6:10 PM, shane knapp skn...@berkeley.edu wrote: ok, we've had about 10 spark pull request builds go through w/o any git timeouts. it seems that the git timeout issue might be licked. i will be definitely be keeping an eye on this for the next few days. thanks for being patient! shane On Wed, Oct 15, 2014 at 2:27 PM, shane knapp skn...@berkeley.edu wrote: four builds triggered and no timeouts. :crossestoes: :) On Wed, Oct 15, 2014 at 2:19 PM, shane knapp skn...@berkeley.edu wrote: ok, we're up and building... :crossesfingersfortheumpteenthtime: On Wed, Oct 15, 2014 at 1:59 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I support this effort. :thumbsup: On Wed, Oct 15, 2014 at 4:52 PM, shane knapp skn...@berkeley.edu wrote: i'm going to be downgrading our git plugin (from 2.2.7 to 2.2.2) to see if that helps w/the git fetch timeouts. this will require a short downtime (~20 mins for builds to finish, ~20 mins to downgrade), and will hopefully give us some insight in to wtf is going on. thanks for your patience... shane -- You received this message because you are subscribed to the Google Groups amp-infra group. To unsubscribe from this group and stop receiving emails from it, send an email to amp-infra+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts
yeah, at this point it might be worth trying. :) the absolutely irritating thing is that i am not seeing this happen w/any other jobs other that the spark prb, nor does it seem to correlate w/time of day, network or system load, or what slave it runs on. nor are we hitting our limit of connections on github. i really, truly hate non-deterministic failures. i'm also going to write an email to support@github and see if they have any insight in to this as well. On Thu, Oct 16, 2014 at 12:51 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Thanks for continuing to look into this, Shane. One suggestion that Patrick brought up, if we have trouble getting to the bottom of this, is doing the git checkout ourselves in the run-tests-jenkins script and cutting out the Jenkins git plugin entirely. That way we can script retries and post friendlier messages about timeouts if they still occur by ourselves. Do you think that’s worth trying at some point? Nick On Thu, Oct 16, 2014 at 2:04 PM, shane knapp skn...@berkeley.edu wrote: the bad news is that we've had a couple more failures due to timeouts, but the good news is that the frequency that these happen has decreased significantly (3 in the past ~18hr). seems like the git plugin downgrade has helped relieve the problem, but hasn't fixed it. i'll be looking in to this more today. On Wed, Oct 15, 2014 at 7:05 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: A quick scan through the Spark PR board https://spark-prs.appspot.com/ shows no recent failures related to this git checkout problem. Looks promising! Nick On Wed, Oct 15, 2014 at 6:10 PM, shane knapp skn...@berkeley.edu wrote: ok, we've had about 10 spark pull request builds go through w/o any git timeouts. it seems that the git timeout issue might be licked. i will be definitely be keeping an eye on this for the next few days. thanks for being patient! shane On Wed, Oct 15, 2014 at 2:27 PM, shane knapp skn...@berkeley.edu wrote: four builds triggered and no timeouts. :crossestoes: :) On Wed, Oct 15, 2014 at 2:19 PM, shane knapp skn...@berkeley.edu wrote: ok, we're up and building... :crossesfingersfortheumpteenthtime: On Wed, Oct 15, 2014 at 1:59 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I support this effort. :thumbsup: On Wed, Oct 15, 2014 at 4:52 PM, shane knapp skn...@berkeley.edu wrote: i'm going to be downgrading our git plugin (from 2.2.7 to 2.2.2) to see if that helps w/the git fetch timeouts. this will require a short downtime (~20 mins for builds to finish, ~20 mins to downgrade), and will hopefully give us some insight in to wtf is going on. thanks for your patience... shane -- You received this message because you are subscribed to the Google Groups amp-infra group. To unsubscribe from this group and stop receiving emails from it, send an email to amp-infra+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts
On Thu, Oct 16, 2014 at 3:55 PM, shane knapp skn...@berkeley.edu wrote: i really, truly hate non-deterministic failures. Amen bruddah.
short jenkins downtime -- trying to get to the bottom of the git fetch timeouts
i'm going to be downgrading our git plugin (from 2.2.7 to 2.2.2) to see if that helps w/the git fetch timeouts. this will require a short downtime (~20 mins for builds to finish, ~20 mins to downgrade), and will hopefully give us some insight in to wtf is going on. thanks for your patience... shane
Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts
I support this effort. :thumbsup: On Wed, Oct 15, 2014 at 4:52 PM, shane knapp skn...@berkeley.edu wrote: i'm going to be downgrading our git plugin (from 2.2.7 to 2.2.2) to see if that helps w/the git fetch timeouts. this will require a short downtime (~20 mins for builds to finish, ~20 mins to downgrade), and will hopefully give us some insight in to wtf is going on. thanks for your patience... shane
Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts
ok, we're up and building... :crossesfingersfortheumpteenthtime: On Wed, Oct 15, 2014 at 1:59 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I support this effort. :thumbsup: On Wed, Oct 15, 2014 at 4:52 PM, shane knapp skn...@berkeley.edu wrote: i'm going to be downgrading our git plugin (from 2.2.7 to 2.2.2) to see if that helps w/the git fetch timeouts. this will require a short downtime (~20 mins for builds to finish, ~20 mins to downgrade), and will hopefully give us some insight in to wtf is going on. thanks for your patience... shane -- You received this message because you are subscribed to the Google Groups amp-infra group. To unsubscribe from this group and stop receiving emails from it, send an email to amp-infra+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts
four builds triggered and no timeouts. :crossestoes: :) On Wed, Oct 15, 2014 at 2:19 PM, shane knapp skn...@berkeley.edu wrote: ok, we're up and building... :crossesfingersfortheumpteenthtime: On Wed, Oct 15, 2014 at 1:59 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I support this effort. :thumbsup: On Wed, Oct 15, 2014 at 4:52 PM, shane knapp skn...@berkeley.edu wrote: i'm going to be downgrading our git plugin (from 2.2.7 to 2.2.2) to see if that helps w/the git fetch timeouts. this will require a short downtime (~20 mins for builds to finish, ~20 mins to downgrade), and will hopefully give us some insight in to wtf is going on. thanks for your patience... shane -- You received this message because you are subscribed to the Google Groups amp-infra group. To unsubscribe from this group and stop receiving emails from it, send an email to amp-infra+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts
ok, we've had about 10 spark pull request builds go through w/o any git timeouts. it seems that the git timeout issue might be licked. i will be definitely be keeping an eye on this for the next few days. thanks for being patient! shane On Wed, Oct 15, 2014 at 2:27 PM, shane knapp skn...@berkeley.edu wrote: four builds triggered and no timeouts. :crossestoes: :) On Wed, Oct 15, 2014 at 2:19 PM, shane knapp skn...@berkeley.edu wrote: ok, we're up and building... :crossesfingersfortheumpteenthtime: On Wed, Oct 15, 2014 at 1:59 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I support this effort. :thumbsup: On Wed, Oct 15, 2014 at 4:52 PM, shane knapp skn...@berkeley.edu wrote: i'm going to be downgrading our git plugin (from 2.2.7 to 2.2.2) to see if that helps w/the git fetch timeouts. this will require a short downtime (~20 mins for builds to finish, ~20 mins to downgrade), and will hopefully give us some insight in to wtf is going on. thanks for your patience... shane -- You received this message because you are subscribed to the Google Groups amp-infra group. To unsubscribe from this group and stop receiving emails from it, send an email to amp-infra+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts
A quick scan through the Spark PR board https://spark-prs.appspot.com/ shows no recent failures related to this git checkout problem. Looks promising! Nick On Wed, Oct 15, 2014 at 6:10 PM, shane knapp skn...@berkeley.edu wrote: ok, we've had about 10 spark pull request builds go through w/o any git timeouts. it seems that the git timeout issue might be licked. i will be definitely be keeping an eye on this for the next few days. thanks for being patient! shane On Wed, Oct 15, 2014 at 2:27 PM, shane knapp skn...@berkeley.edu wrote: four builds triggered and no timeouts. :crossestoes: :) On Wed, Oct 15, 2014 at 2:19 PM, shane knapp skn...@berkeley.edu wrote: ok, we're up and building... :crossesfingersfortheumpteenthtime: On Wed, Oct 15, 2014 at 1:59 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I support this effort. :thumbsup: On Wed, Oct 15, 2014 at 4:52 PM, shane knapp skn...@berkeley.edu wrote: i'm going to be downgrading our git plugin (from 2.2.7 to 2.2.2) to see if that helps w/the git fetch timeouts. this will require a short downtime (~20 mins for builds to finish, ~20 mins to downgrade), and will hopefully give us some insight in to wtf is going on. thanks for your patience... shane -- You received this message because you are subscribed to the Google Groups amp-infra group. To unsubscribe from this group and stop receiving emails from it, send an email to amp-infra+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.