Re: SSH slave performance degradation
Cf.: https://issues.jenkins-ci.org/browse/JENKINS-20108 -- You received this message because you are subscribed to the Google Groups Jenkins Developers group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-dev+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: SSH slave performance degradation
I thought that a common default on Linux was to block if /dev/random was to block if the pool of random data was emptied. Refer to http://en.wikipedia.org/?title=/dev/random for a description. I thought that /dev/urandom did not block if the pool of random data was emptied. That same article describes the differences between the two. I've seen cases with some versions of Java and some Linux variants where Java performance suffered badly when I had emptied the pool of random data. I think that is why Stephen recommends using /dev/urandom so that your program won't block while waiting for random data. Mark Waite On Tue, Jul 29, 2014 at 10:44 PM, Dean Yu dean...@gmail.com wrote: Obviously, going from 1.509.4 to 1.554.3 is a pretty big jump that included lots and lots of changes. However, the fact that the singular act of downgrading that library got us back to our prior build times is a big smoking gun to me. I wonder if something changed upstream... From the upstream release notes: build217, 2013-06-03: - Support for SSH agent based authentication. build216, 2013-03-04: - Support of unencrypted entries in the known_hosts file. - Improved timeout handling. BTW you are using /dev/./urandom as an entropy source for the JVM? Nope. Should we? -- Dean From: Stephen Connolly stephen.alan.conno...@gmail.com Reply-To: jenkinsci-dev@googlegroups.com jenkinsci-dev@googlegroups.com Date: Tuesday, July 29, 2014 at 2:16 PM To: jenkinsci-dev@googlegroups.com jenkinsci-dev@googlegroups.com Subject: Re: SSH slave performance degradation * KK's changes to window sizes should have *increased* performance * My connection bug fixes were surgical IIRC * Nicolas's merge of upstream seems to include an EOL change, so hard to see what changed there with the Github diff tool: https://github.com/jenkinsci/trilead-ssh2/compare/trilead-ssh2-build214-jenkins-3...trilead-ssh2-build217-jenkins-5 I wonder if something changed upstream... BTW you are using /dev/./urandom as an entropy source for the JVM? On 29 July 2014 19:51, Dean Yu dean...@gmail.com wrote: Hi folks, We just upgraded our cluster from 1.509.4 to 1.554.3, and discovered a significant increase in our build times. Builds that typically took ~50 to complete started taking ~90 minutes to finish, sometimes spiking to 2 hours. While researching, we found this JIRA[1] which reported that downgrading the trilead-ssh2 jar solved the performance issues. While this ticket talks specifically artifact downloads, we see that our builds as a whole were slower. The trilead-ssh2 dependency version was updated by [2], so it was introduced into 1.536, show would only have made it to LTS with 1.554.1 in April. Looking at the trilead-ssh2 repo[3], it looks like there were a small set of changes: - changes by ndeloof to merge a newer upstream (build214 to build217) - changes by stephenc to fix connection bugs - changes by kohsuke to support package window sizes Anyone have thoughts on the likely culprit? Given the severity of the performance hit we took, I'm surprised that more people haven't reported this. -- Dean [1] https://issues.jenkins-ci.org/browse/JENKINS-20550 [2] https://github.com/jenkinsci/jenkins/commit/bb265c5e95b0fe39128720b903914236962db41b [3] https://github.com/jenkinsci/trilead-ssh2/commits/master -- You received this message because you are subscribed to the Google Groups Jenkins Developers group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-dev+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups Jenkins Developers group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-dev+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups Jenkins Developers group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-dev+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- Thanks! Mark Waite -- You received this message because you are subscribed to the Google Groups Jenkins Developers group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-dev+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: SSH slave performance degradation
In my scalability testing I have found you cannot scale out ssh slaves with /dev/random as the entropy source. You need to use /dev/./urandom (JVM bug requires that name btw) The master on windows is a different story though On Wednesday, 30 July 2014, Mark Waite mark.earl.wa...@gmail.com wrote: I thought that a common default on Linux was to block if /dev/random was to block if the pool of random data was emptied. Refer to http://en.wikipedia.org/?title=/dev/random for a description. I thought that /dev/urandom did not block if the pool of random data was emptied. That same article describes the differences between the two. I've seen cases with some versions of Java and some Linux variants where Java performance suffered badly when I had emptied the pool of random data. I think that is why Stephen recommends using /dev/urandom so that your program won't block while waiting for random data. Mark Waite On Tue, Jul 29, 2014 at 10:44 PM, Dean Yu dean...@gmail.com javascript:_e(%7B%7D,'cvml','dean...@gmail.com'); wrote: Obviously, going from 1.509.4 to 1.554.3 is a pretty big jump that included lots and lots of changes. However, the fact that the singular act of downgrading that library got us back to our prior build times is a big smoking gun to me. I wonder if something changed upstream... From the upstream release notes: build217, 2013-06-03: - Support for SSH agent based authentication. build216, 2013-03-04: - Support of unencrypted entries in the known_hosts file. - Improved timeout handling. BTW you are using /dev/./urandom as an entropy source for the JVM? Nope. Should we? -- Dean From: Stephen Connolly stephen.alan.conno...@gmail.com javascript:_e(%7B%7D,'cvml','stephen.alan.conno...@gmail.com'); Reply-To: jenkinsci-dev@googlegroups.com javascript:_e(%7B%7D,'cvml','jenkinsci-dev@googlegroups.com'); jenkinsci-dev@googlegroups.com javascript:_e(%7B%7D,'cvml','jenkinsci-dev@googlegroups.com'); Date: Tuesday, July 29, 2014 at 2:16 PM To: jenkinsci-dev@googlegroups.com javascript:_e(%7B%7D,'cvml','jenkinsci-dev@googlegroups.com'); jenkinsci-dev@googlegroups.com javascript:_e(%7B%7D,'cvml','jenkinsci-dev@googlegroups.com'); Subject: Re: SSH slave performance degradation * KK's changes to window sizes should have *increased* performance * My connection bug fixes were surgical IIRC * Nicolas's merge of upstream seems to include an EOL change, so hard to see what changed there with the Github diff tool: https://github.com/jenkinsci/trilead-ssh2/compare/trilead-ssh2-build214-jenkins-3...trilead-ssh2-build217-jenkins-5 I wonder if something changed upstream... BTW you are using /dev/./urandom as an entropy source for the JVM? On 29 July 2014 19:51, Dean Yu dean...@gmail.com javascript:_e(%7B%7D,'cvml','dean...@gmail.com'); wrote: Hi folks, We just upgraded our cluster from 1.509.4 to 1.554.3, and discovered a significant increase in our build times. Builds that typically took ~50 to complete started taking ~90 minutes to finish, sometimes spiking to 2 hours. While researching, we found this JIRA[1] which reported that downgrading the trilead-ssh2 jar solved the performance issues. While this ticket talks specifically artifact downloads, we see that our builds as a whole were slower. The trilead-ssh2 dependency version was updated by [2], so it was introduced into 1.536, show would only have made it to LTS with 1.554.1 in April. Looking at the trilead-ssh2 repo[3], it looks like there were a small set of changes: - changes by ndeloof to merge a newer upstream (build214 to build217) - changes by stephenc to fix connection bugs - changes by kohsuke to support package window sizes Anyone have thoughts on the likely culprit? Given the severity of the performance hit we took, I'm surprised that more people haven't reported this. -- Dean [1] https://issues.jenkins-ci.org/browse/JENKINS-20550 [2] https://github.com/jenkinsci/jenkins/commit/bb265c5e95b0fe39128720b903914236962db41b [3] https://github.com/jenkinsci/trilead-ssh2/commits/master -- You received this message because you are subscribed to the Google Groups Jenkins Developers group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-dev+unsubscr...@googlegroups.com javascript:_e(%7B%7D,'cvml','jenkinsci-dev%2bunsubscr...@googlegroups.com'); . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups Jenkins Developers group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-dev+unsubscr...@googlegroups.com javascript:_e(%7B%7D,'cvml','jenkinsci-dev%2bunsubscr...@googlegroups.com'); . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups Jenkins Developers group
Re: SSH slave performance degradation
This is great info, but how big of a pool of ssh slaves does this become a problem at? We have 12. (And again, the problem goes away by downgrading the library.) -- Dean From: Stephen Connolly stephen.alan.conno...@gmail.com Reply-To: jenkinsci-dev@googlegroups.com jenkinsci-dev@googlegroups.com Date: Tuesday, July 29, 2014 at 11:42 PM To: jenkinsci-dev@googlegroups.com jenkinsci-dev@googlegroups.com Subject: Re: SSH slave performance degradation In my scalability testing I have found you cannot scale out ssh slaves with /dev/random as the entropy source. You need to use /dev/./urandom (JVM bug requires that name btw) The master on windows is a different story though On Wednesday, 30 July 2014, Mark Waite mark.earl.wa...@gmail.com wrote: I thought that a common default on Linux was to block if /dev/random was to block if the pool of random data was emptied. Refer to http://en.wikipedia.org/?title=/dev/random for a description. I thought that /dev/urandom did not block if the pool of random data was emptied. That same article describes the differences between the two. I've seen cases with some versions of Java and some Linux variants where Java performance suffered badly when I had emptied the pool of random data. I think that is why Stephen recommends using /dev/urandom so that your program won't block while waiting for random data. Mark Waite On Tue, Jul 29, 2014 at 10:44 PM, Dean Yu dean...@gmail.com javascript:_e(%7B%7D,'cvml','dean...@gmail.com'); wrote: Obviously, going from 1.509.4 to 1.554.3 is a pretty big jump that included lots and lots of changes. However, the fact that the singular act of downgrading that library got us back to our prior build times is a big smoking gun to me. I wonder if something changed upstream... From the upstream release notes: build217, 2013-06-03: - Support for SSH agent based authentication. build216, 2013-03-04: - Support of unencrypted entries in the known_hosts file. - Improved timeout handling. BTW you are using /dev/./urandom as an entropy source for the JVM? Nope. Should we? -- Dean From: Stephen Connolly stephen.alan.conno...@gmail.com javascript:_e(%7B%7D,'cvml','stephen.alan.conno...@gmail.com'); Reply-To: jenkinsci-dev@googlegroups.com javascript:_e(%7B%7D,'cvml','jenkinsci-dev@googlegroups.com'); jenkinsci-dev@googlegroups.com javascript:_e(%7B%7D,'cvml','jenkinsci-dev@googlegroups.com'); Date: Tuesday, July 29, 2014 at 2:16 PM To: jenkinsci-dev@googlegroups.com javascript:_e(%7B%7D,'cvml','jenkinsci-dev@googlegroups.com'); jenkinsci-dev@googlegroups.com javascript:_e(%7B%7D,'cvml','jenkinsci-dev@googlegroups.com'); Subject: Re: SSH slave performance degradation * KK's changes to window sizes should have *increased* performance * My connection bug fixes were surgical IIRC * Nicolas's merge of upstream seems to include an EOL change, so hard to see what changed there with the Github diff tool: https://github.com/jenkinsci/trilead-ssh2/compare/trilead-ssh2-build214-jen kins-3...trilead-ssh2-build217-jenkins-5 I wonder if something changed upstream... BTW you are using /dev/./urandom as an entropy source for the JVM? On 29 July 2014 19:51, Dean Yu dean...@gmail.com javascript:_e(%7B%7D,'cvml','dean...@gmail.com'); wrote: Hi folks, We just upgraded our cluster from 1.509.4 to 1.554.3, and discovered a significant increase in our build times. Builds that typically took ~50 to complete started taking ~90 minutes to finish, sometimes spiking to 2 hours. While researching, we found this JIRA[1] which reported that downgrading the trilead-ssh2 jar solved the performance issues. While this ticket talks specifically artifact downloads, we see that our builds as a whole were slower. The trilead-ssh2 dependency version was updated by [2], so it was introduced into 1.536, show would only have made it to LTS with 1.554.1 in April. Looking at the trilead-ssh2 repo[3], it looks like there were a small set of changes: - changes by ndeloof to merge a newer upstream (build214 to build217) - changes by stephenc to fix connection bugs - changes by kohsuke to support package window sizes Anyone have thoughts on the likely culprit? Given the severity of the performance hit we took, I'm surprised that more people haven't reported this. -- Dean [1] https://issues.jenkins-ci.org/browse/JENKINS-20550 [2] https://github.com/jenkinsci/jenkins/commit/bb265c5e95b0fe39128720b9039142 36962db41b [3] https://github.com/jenkinsci/trilead-ssh2/commits/master -- You received this message because you are subscribed to the Google Groups Jenkins Developers group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-dev+unsubscr...@googlegroups.com javascript:_e(%7B%7D,'cvml','jenkinsci-dev%2bunsubscr...@googlegroups.com '); . For more options, visit https
Re: SSH slave performance degradation
On an AWS m3.large I could not even get to 10 SSH slaves connected without switching to /dev/./urandom On 30 July 2014 14:48, Dean Yu dean...@gmail.com wrote: This is great info, but how big of a pool of ssh slaves does this become a problem at? We have 12. (And again, the problem goes away by downgrading the library.) -- Dean From: Stephen Connolly stephen.alan.conno...@gmail.com Reply-To: jenkinsci-dev@googlegroups.com jenkinsci-dev@googlegroups.com Date: Tuesday, July 29, 2014 at 11:42 PM To: jenkinsci-dev@googlegroups.com jenkinsci-dev@googlegroups.com Subject: Re: SSH slave performance degradation In my scalability testing I have found you cannot scale out ssh slaves with /dev/random as the entropy source. You need to use /dev/./urandom (JVM bug requires that name btw) The master on windows is a different story though On Wednesday, 30 July 2014, Mark Waite mark.earl.wa...@gmail.com wrote: I thought that a common default on Linux was to block if /dev/random was to block if the pool of random data was emptied. Refer to http://en.wikipedia.org/?title=/dev/random for a description. I thought that /dev/urandom did not block if the pool of random data was emptied. That same article describes the differences between the two. I've seen cases with some versions of Java and some Linux variants where Java performance suffered badly when I had emptied the pool of random data. I think that is why Stephen recommends using /dev/urandom so that your program won't block while waiting for random data. Mark Waite On Tue, Jul 29, 2014 at 10:44 PM, Dean Yu dean...@gmail.com wrote: Obviously, going from 1.509.4 to 1.554.3 is a pretty big jump that included lots and lots of changes. However, the fact that the singular act of downgrading that library got us back to our prior build times is a big smoking gun to me. I wonder if something changed upstream... From the upstream release notes: build217, 2013-06-03: - Support for SSH agent based authentication. build216, 2013-03-04: - Support of unencrypted entries in the known_hosts file. - Improved timeout handling. BTW you are using /dev/./urandom as an entropy source for the JVM? Nope. Should we? -- Dean From: Stephen Connolly stephen.alan.conno...@gmail.com Reply-To: jenkinsci-dev@googlegroups.com jenkinsci-dev@googlegroups.com Date: Tuesday, July 29, 2014 at 2:16 PM To: jenkinsci-dev@googlegroups.com jenkinsci-dev@googlegroups.com Subject: Re: SSH slave performance degradation * KK's changes to window sizes should have *increased* performance * My connection bug fixes were surgical IIRC * Nicolas's merge of upstream seems to include an EOL change, so hard to see what changed there with the Github diff tool: https://github.com/jenkinsci/trilead-ssh2/compare/trilead-ssh2-build214-jenkins-3...trilead-ssh2-build217-jenkins-5 I wonder if something changed upstream... BTW you are using /dev/./urandom as an entropy source for the JVM? On 29 July 2014 19:51, Dean Yu dean...@gmail.com wrote: Hi folks, We just upgraded our cluster from 1.509.4 to 1.554.3, and discovered a significant increase in our build times. Builds that typically took ~50 to complete started taking ~90 minutes to finish, sometimes spiking to 2 hours. While researching, we found this JIRA[1] which reported that downgrading the trilead-ssh2 jar solved the performance issues. While this ticket talks specifically artifact downloads, we see that our builds as a whole were slower. The trilead-ssh2 dependency version was updated by [2], so it was introduced into 1.536, show would only have made it to LTS with 1.554.1 in April. Looking at the trilead-ssh2 repo[3], it looks like there were a small set of changes: - changes by ndeloof to merge a newer upstream (build214 to build217) - changes by stephenc to fix connection bugs - changes by kohsuke to support package window sizes Anyone have thoughts on the likely culprit? Given the severity of the performance hit we took, I'm surprised that more people haven't reported this. -- Dean [1] https://issues.jenkins-ci.org/browse/JENKINS-20550 [2] https://github.com/jenkinsci/jenkins/commit/bb265c5e95b0fe39128720b903914236962db41b [3] https://github.com/jenkinsci/trilead-ssh2/commits/master -- You received this message because you are subscribed to the Google Groups Jenkins Developers group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-dev+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups Jenkins Developers group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-dev+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed
Re: SSH slave performance degradation
What's the most straightforward way to add this to my installation, add to container args for master, and the node configuration for slaves? Is this just needed on master or just needed on slaves? - Original message - From: Stephen Connolly [1]stephen.alan.conno...@gmail.com To: [2]jenkinsci-dev@googlegroups.com [3]jenkinsci-dev@googlegroups.com Subject: Re: SSH slave performance degradation Date: Wed, 30 Jul 2014 15:04:30 +0100 On an AWS m3.large I could not even get to 10 SSH slaves connected without switching to /dev/./urandom On 30 July 2014 14:48, Dean Yu [4]dean...@gmail.com wrote: This is great info, but how big of a pool of ssh slaves does this become a problem at? We have 12. (And again, the problem goes away by downgrading the library.) -- Dean From: Stephen Connolly [5]stephen.alan.conno...@gmail.com Reply-To: [6]jenkinsci-dev@googlegroups.com [7]jenkinsci-dev@googlegroups.com Date: Tuesday, July 29, 2014 at 11:42 PM To: [8]jenkinsci-dev@googlegroups.com [9]jenkinsci-dev@googlegroups.com Subject: Re: SSH slave performance degradation In my scalability testing I have found you cannot scale out ssh slaves with /dev/random as the entropy source. You need to use /dev/./urandom (JVM bug requires that name btw) The master on windows is a different story though On Wednesday, 30 July 2014, Mark Waite [10]mark.earl.wa...@gmail.com wrote: I thought that a common default on Linux was to block if /dev/random was to block if the pool of random data was emptied. Refer to [11]http://en.wikipedia.org/?title=/dev/random for a description. I thought that /dev/urandom did not block if the pool of random data was emptied. That same article describes the differences between the two. I've seen cases with some versions of Java and some Linux variants where Java performance suffered badly when I had emptied the pool of random data. I think that is why Stephen recommends using /dev/urandom so that your program won't block while waiting for random data. Mark Waite On Tue, Jul 29, 2014 at 10:44 PM, Dean Yu dean...@gmail.com wrote: Obviously, going from 1.509.4 to 1.554.3 is a pretty big jump that included lots and lots of changes. However, the fact that the singular act of downgrading that library got us back to our prior build times is a big smoking gun to me. I wonder if something changed upstream... From the upstream release notes: build217, 2013-06-03: - Support for SSH agent based authentication. build216, 2013-03-04: - Support of unencrypted entries in the known_hosts file. - Improved timeout handling. BTW you are using /dev/./urandom as an entropy source for the JVM? Nope. Should we? -- Dean From: Stephen Connolly stephen.alan.conno...@gmail.com Reply-To: jenkinsci-dev@googlegroups.com jenkinsci-dev@googlegroups.com Date: Tuesday, July 29, 2014 at 2:16 PM To: jenkinsci-dev@googlegroups.com jenkinsci-dev@googlegroups.com Subject: Re: SSH slave performance degradation * KK's changes to window sizes should have *increased* performance * My connection bug fixes were surgical IIRC * Nicolas's merge of upstream seems to include an EOL change, so hard to see what changed there with the Github diff tool: [12]https://github.com/jenkinsci/trilead-ssh2/compare/tri lead-ssh2-build214-jenkins-3...trilead-ssh2-build217-jenkins-5 I wonder if something changed upstream... BTW you are using /dev/./urandom as an entropy source for the JVM? On 29 July 2014 19:51, Dean Yu dean...@gmail.com wrote: Hi folks, We just upgraded our cluster from 1.509.4 to 1.554.3, and discovered a significant increase in our build times. Builds that typically took ~50 to complete started taking ~90 minutes to finish, sometimes spiking to 2 hours. While researching, we found this JIRA[1] which reported that downgrading the trilead-ssh2 jar solved the performance issues. While this ticket talks specifically artifact downloads, we see that our builds as a whole were slower. The trilead-ssh2 dependency version was updated by [2], so it was introduced into 1.536, show would only have made it to LTS with 1.554.1 in April. Looking at the trilead-ssh2 repo[3], it looks like there were a small set of changes: - changes by ndeloof to merge a newer upstream (build214 to build217) - changes by stephenc to fix connection bugs - changes by kohsuke to support package window sizes Anyone have thoughts on the likely culprit? Given the severity of the performance hit we took, I'm surprised that more people haven't reported this. -- Dean [1] [13]https://issues.jenkins-ci.org/browse/JENKINS-20550 [2] [14]https://github.com/jenkinsci/jenkins/commit/bb265c5e95b 0fe39128720b903914236962db41b [3] [15]https://github.com/jenkinsci/trilead-ssh2/commits/maste r -- You received this message because you are subscribed to the Google Groups Jenkins Developers group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-dev+unsubscr...@googlegroups.com
Re: SSH slave performance degradation
on all linux machines you can just add `-Djava.security.egd=file:/dev/./urandom` to the JVM startup command. This is more critical on the Jenkins master than the slaves as the slaves typically only have one connection back to the master where as the master has multiple slaves. If you have multiple slaves sharing the same machine then you would probably need it for the slaves also. Windows machines do not have this issue as far as I am aware. Oh and yes that crazy path is the only way to get it to work... it's a bug/feature of the JVM On 30 July 2014 15:23, Mike Chmielewski c...@mikec.123mail.org wrote: What's the most straightforward way to add this to my installation, add to container args for master, and the node configuration for slaves? Is this just needed on master or just needed on slaves? - Original message - From: Stephen Connolly stephen.alan.conno...@gmail.com To: jenkinsci-dev@googlegroups.com jenkinsci-dev@googlegroups.com Subject: Re: SSH slave performance degradation Date: Wed, 30 Jul 2014 15:04:30 +0100 On an AWS m3.large I could not even get to 10 SSH slaves connected without switching to /dev/./urandom On 30 July 2014 14:48, Dean Yu dean...@gmail.com wrote: This is great info, but how big of a pool of ssh slaves does this become a problem at? We have 12. (And again, the problem goes away by downgrading the library.) -- Dean *From: * Stephen Connolly stephen.alan.conno...@gmail.com *Reply-To: * jenkinsci-dev@googlegroups.com jenkinsci-dev@googlegroups.com *Date: * Tuesday, July 29, 2014 at 11:42 PM *To: * jenkinsci-dev@googlegroups.com jenkinsci-dev@googlegroups.com *Subject: * Re: SSH slave performance degradation In my scalability testing I have found you cannot scale out ssh slaves with /dev/random as the entropy source. You need to use /dev/./urandom (JVM bug requires that name btw) The master on windows is a different story though On Wednesday, 30 July 2014, Mark Waite mark.earl.wa...@gmail.com wrote: I thought that a common default on Linux was to block if /dev/random was to block if the pool of random data was emptied. Refer to http://en.wikipedia.org/?title=/dev/random for a description. I thought that /dev/urandom did not block if the pool of random data was emptied. That same article describes the differences between the two. I've seen cases with some versions of Java and some Linux variants where Java performance suffered badly when I had emptied the pool of random data. I think that is why Stephen recommends using /dev/urandom so that your program won't block while waiting for random data. Mark Waite On Tue, Jul 29, 2014 at 10:44 PM, Dean Yu dean...@gmail.com wrote: Obviously, going from 1.509.4 to 1.554.3 is a pretty big jump that included lots and lots of changes. However, the fact that the singular act of downgrading that library got us back to our prior build times is a big smoking gun to me. I wonder if something changed upstream... From the upstream release notes: build217, 2013-06-03: - Support for SSH agent based authentication. build216, 2013-03-04: - Support of unencrypted entries in the known_hosts file. - Improved timeout handling. BTW you are using /dev/./urandom as an entropy source for the JVM? Nope. Should we? -- Dean *From: *Stephen Connolly stephen.alan.conno...@gmail.com *Reply-To: *jenkinsci-dev@googlegroups.com jenkinsci-dev@googlegroups.com *Date: *Tuesday, July 29, 2014 at 2:16 PM *To: *jenkinsci-dev@googlegroups.com jenkinsci-dev@googlegroups.com *Subject: *Re: SSH slave performance degradation * KK's changes to window sizes should have *increased* performance * My connection bug fixes were surgical IIRC * Nicolas's merge of upstream seems to include an EOL change, so hard to see what changed there with the Github diff tool: https://github.com/jenkinsci/trilead-ssh2/compare/trilead-ssh2-build214-jenkins-3...trilead-ssh2-build217-jenkins-5 I wonder if something changed upstream... BTW you are using /dev/./urandom as an entropy source for the JVM? On 29 July 2014 19:51, Dean Yu dean...@gmail.com wrote: Hi folks, We just upgraded our cluster from 1.509.4 to 1.554.3, and discovered a significant increase in our build times. Builds that typically took ~50 to complete started taking ~90 minutes to finish, sometimes spiking to 2 hours. While researching, we found this JIRA[1] which reported that downgrading the trilead-ssh2 jar solved the performance issues. While this ticket talks specifically artifact downloads, we see that our builds as a whole were slower. The trilead-ssh2 dependency version was updated by [2], so it was introduced into 1.536, show would only have made it to LTS with 1.554.1 in April. Looking at the trilead-ssh2 repo[3], it looks like there were a small set of changes: - changes by ndeloof to merge a newer upstream (build214 to build217) - changes by stephenc to fix connection bugs
Re: SSH slave performance degradation
On 30 July 2014 14:48, Dean Yu dean...@gmail.com wrote: the problem goes away by downgrading the library. It would be great if you could determine whether the new version of the library is selecting a different cipher suite from the old version. It may be a change in the cipher priority that could have impacted performance, or perhaps there was a replay attack that the older version was vulnerable to and the fix may require more entropy than the old version... -- You received this message because you are subscribed to the Google Groups Jenkins Developers group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-dev+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: SSH slave performance degradation
Release Notes: == build217, 2013-06-03: - Support for SSH agent based authentication. build216, 2013-03-04: - Support of unencrypted entries in the known_hosts file. - Improved timeout handling. build214, 2011-04-25: - Project build procedure uses Gradle; project artifacts from now on are available at TMate Software Maven repository at http://maven.tmatesoft.com/ build213, 2008-04-01: - Added a workaround for servers that violate RFC4253 when sending the SSH_MSG_SERVICE_ACCEPT and the SSH_MSG_KEXDH_REPLY messages. Thanks to Gordon Brockway. - Fixed encodings for alien platforms (e.g., EBCDIC based). Use ISO-8859-1 in most places where we used the default platform encoding so far. - API change: atime and mtime attributes in SFTPv3FileAttributes are now of type Long (not Integer). Makes it easier to properly handle values 2^31. - Fixed the blowfish-ctr cipher, it could not be instantiated (a typo that got in during the move to the trilead namespace). Thanks to Roelof Kemp. - Still in the queue: SSH server support. On 30 July 2014 15:53, Stephen Connolly stephen.alan.conno...@gmail.com wrote: On 30 July 2014 14:48, Dean Yu dean...@gmail.com wrote: the problem goes away by downgrading the library. It would be great if you could determine whether the new version of the library is selecting a different cipher suite from the old version. It may be a change in the cipher priority that could have impacted performance, or perhaps there was a replay attack that the older version was vulnerable to and the fix may require more entropy than the old version... -- You received this message because you are subscribed to the Google Groups Jenkins Developers group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-dev+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: SSH slave performance degradation
Jenkins 1.535 used build 213, so while those changes look interesting, they are part of the old version of the library. -- Dean From: Stephen Connolly stephen.alan.conno...@gmail.com Reply-To: jenkinsci-dev@googlegroups.com jenkinsci-dev@googlegroups.com Date: Wednesday, July 30, 2014 at 8:03 AM To: jenkinsci-dev@googlegroups.com jenkinsci-dev@googlegroups.com Subject: Re: SSH slave performance degradation Release Notes: == build217, 2013-06-03: - Support for SSH agent based authentication. build216, 2013-03-04: - Support of unencrypted entries in the known_hosts file. - Improved timeout handling. build214, 2011-04-25: - Project build procedure uses Gradle; project artifacts from now on are available at TMate Software Maven repository at http://maven.tmatesoft.com/ build213, 2008-04-01: - Added a workaround for servers that violate RFC4253 when sending the SSH_MSG_SERVICE_ACCEPT and the SSH_MSG_KEXDH_REPLY messages. Thanks to Gordon Brockway. - Fixed encodings for alien platforms (e.g., EBCDIC based). Use ISO-8859-1 in most places where we used the default platform encoding so far. - API change: atime and mtime attributes in SFTPv3FileAttributes are now of type Long (not Integer). Makes it easier to properly handle values 2^31. - Fixed the blowfish-ctr cipher, it could not be instantiated (a typo that got in during the move to the trilead namespace). Thanks to Roelof Kemp. - Still in the queue: SSH server support. On 30 July 2014 15:53, Stephen Connolly stephen.alan.conno...@gmail.com wrote: On 30 July 2014 14:48, Dean Yu dean...@gmail.com wrote: the problem goes away by downgrading the library. It would be great if you could determine whether the new version of the library is selecting a different cipher suite from the old version. It may be a change in the cipher priority that could have impacted performance, or perhaps there was a replay attack that the older version was vulnerable to and the fix may require more entropy than the old version... -- You received this message because you are subscribed to the Google Groups Jenkins Developers group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-dev+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups Jenkins Developers group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-dev+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: SSH slave performance degradation
yep On 30 July 2014 16:07, Dean Yu dean...@gmail.com wrote: Oh and yes that crazy path is the only way to get it to work... it's a bug/feature of the JVM I'm taking that by the lack of mention of JVM version that this is an outstanding issue? From: Stephen Connolly stephen.alan.conno...@gmail.com Reply-To: jenkinsci-dev@googlegroups.com jenkinsci-dev@googlegroups.com Date: Wednesday, July 30, 2014 at 7:51 AM To: jenkinsci-dev@googlegroups.com jenkinsci-dev@googlegroups.com Subject: Re: SSH slave performance degradation on all linux machines you can just add `-Djava.security.egd=file:/dev/./urandom` to the JVM startup command. This is more critical on the Jenkins master than the slaves as the slaves typically only have one connection back to the master where as the master has multiple slaves. If you have multiple slaves sharing the same machine then you would probably need it for the slaves also. Windows machines do not have this issue as far as I am aware. Oh and yes that crazy path is the only way to get it to work... it's a bug/feature of the JVM On 30 July 2014 15:23, Mike Chmielewski c...@mikec.123mail.org wrote: What's the most straightforward way to add this to my installation, add to container args for master, and the node configuration for slaves? Is this just needed on master or just needed on slaves? - Original message - From: Stephen Connolly stephen.alan.conno...@gmail.com To: jenkinsci-dev@googlegroups.com jenkinsci-dev@googlegroups.com Subject: Re: SSH slave performance degradation Date: Wed, 30 Jul 2014 15:04:30 +0100 On an AWS m3.large I could not even get to 10 SSH slaves connected without switching to /dev/./urandom On 30 July 2014 14:48, Dean Yu dean...@gmail.com wrote: This is great info, but how big of a pool of ssh slaves does this become a problem at? We have 12. (And again, the problem goes away by downgrading the library.) -- Dean *From: *Stephen Connolly stephen.alan.conno...@gmail.com *Reply-To: *jenkinsci-dev@googlegroups.com jenkinsci-dev@googlegroups.com *Date: *Tuesday, July 29, 2014 at 11:42 PM *To: *jenkinsci-dev@googlegroups.com jenkinsci-dev@googlegroups.com *Subject: *Re: SSH slave performance degradation In my scalability testing I have found you cannot scale out ssh slaves with /dev/random as the entropy source. You need to use /dev/./urandom (JVM bug requires that name btw) The master on windows is a different story though On Wednesday, 30 July 2014, Mark Waite mark.earl.wa...@gmail.com wrote: I thought that a common default on Linux was to block if /dev/random was to block if the pool of random data was emptied. Refer to http://en.wikipedia.org/?title=/dev/random for a description. I thought that /dev/urandom did not block if the pool of random data was emptied. That same article describes the differences between the two. I've seen cases with some versions of Java and some Linux variants where Java performance suffered badly when I had emptied the pool of random data. I think that is why Stephen recommends using /dev/urandom so that your program won't block while waiting for random data. Mark Waite On Tue, Jul 29, 2014 at 10:44 PM, Dean Yu dean...@gmail.com wrote: Obviously, going from 1.509.4 to 1.554.3 is a pretty big jump that included lots and lots of changes. However, the fact that the singular act of downgrading that library got us back to our prior build times is a big smoking gun to me. I wonder if something changed upstream... From the upstream release notes: build217, 2013-06-03: - Support for SSH agent based authentication. build216, 2013-03-04: - Support of unencrypted entries in the known_hosts file. - Improved timeout handling. BTW you are using /dev/./urandom as an entropy source for the JVM? Nope. Should we? -- Dean *From: *Stephen Connolly stephen.alan.conno...@gmail.com *Reply-To: *jenkinsci-dev@googlegroups.com jenkinsci-dev@googlegroups.com *Date: *Tuesday, July 29, 2014 at 2:16 PM *To: *jenkinsci-dev@googlegroups.com jenkinsci-dev@googlegroups.com *Subject: *Re: SSH slave performance degradation * KK's changes to window sizes should have *increased* performance * My connection bug fixes were surgical IIRC * Nicolas's merge of upstream seems to include an EOL change, so hard to see what changed there with the Github diff tool: https://github.com/jenkinsci/trilead-ssh2/compare/trilead-ssh2-build214-jenkins-3...trilead-ssh2-build217-jenkins-5 I wonder if something changed upstream... BTW you are using /dev/./urandom as an entropy source for the JVM? On 29 July 2014 19:51, Dean Yu dean...@gmail.com wrote: Hi folks, We just upgraded our cluster from 1.509.4 to 1.554.3, and discovered a significant increase in our build times. Builds that typically took ~50 to complete started taking ~90 minutes to finish, sometimes spiking to 2 hours. While researching, we found this JIRA[1] which
Re: SSH slave performance degradation
IIUC http://bugs.java.com/view_bug.do?bug_id=4705093 assigned special meaning to /dev/urandom so to avoid that special meaning you need to add the /./ On 30 July 2014 17:24, Stephen Connolly stephen.alan.conno...@gmail.com wrote: yep On 30 July 2014 16:07, Dean Yu dean...@gmail.com wrote: Oh and yes that crazy path is the only way to get it to work... it's a bug/feature of the JVM I'm taking that by the lack of mention of JVM version that this is an outstanding issue? From: Stephen Connolly stephen.alan.conno...@gmail.com Reply-To: jenkinsci-dev@googlegroups.com jenkinsci-dev@googlegroups.com Date: Wednesday, July 30, 2014 at 7:51 AM To: jenkinsci-dev@googlegroups.com jenkinsci-dev@googlegroups.com Subject: Re: SSH slave performance degradation on all linux machines you can just add `-Djava.security.egd=file:/dev/./urandom` to the JVM startup command. This is more critical on the Jenkins master than the slaves as the slaves typically only have one connection back to the master where as the master has multiple slaves. If you have multiple slaves sharing the same machine then you would probably need it for the slaves also. Windows machines do not have this issue as far as I am aware. Oh and yes that crazy path is the only way to get it to work... it's a bug/feature of the JVM On 30 July 2014 15:23, Mike Chmielewski c...@mikec.123mail.org wrote: What's the most straightforward way to add this to my installation, add to container args for master, and the node configuration for slaves? Is this just needed on master or just needed on slaves? - Original message - From: Stephen Connolly stephen.alan.conno...@gmail.com To: jenkinsci-dev@googlegroups.com jenkinsci-dev@googlegroups.com Subject: Re: SSH slave performance degradation Date: Wed, 30 Jul 2014 15:04:30 +0100 On an AWS m3.large I could not even get to 10 SSH slaves connected without switching to /dev/./urandom On 30 July 2014 14:48, Dean Yu dean...@gmail.com wrote: This is great info, but how big of a pool of ssh slaves does this become a problem at? We have 12. (And again, the problem goes away by downgrading the library.) -- Dean *From: *Stephen Connolly stephen.alan.conno...@gmail.com *Reply-To: *jenkinsci-dev@googlegroups.com jenkinsci-dev@googlegroups.com *Date: *Tuesday, July 29, 2014 at 11:42 PM *To: *jenkinsci-dev@googlegroups.com jenkinsci-dev@googlegroups.com *Subject: *Re: SSH slave performance degradation In my scalability testing I have found you cannot scale out ssh slaves with /dev/random as the entropy source. You need to use /dev/./urandom (JVM bug requires that name btw) The master on windows is a different story though On Wednesday, 30 July 2014, Mark Waite mark.earl.wa...@gmail.com wrote: I thought that a common default on Linux was to block if /dev/random was to block if the pool of random data was emptied. Refer to http://en.wikipedia.org/?title=/dev/random for a description. I thought that /dev/urandom did not block if the pool of random data was emptied. That same article describes the differences between the two. I've seen cases with some versions of Java and some Linux variants where Java performance suffered badly when I had emptied the pool of random data. I think that is why Stephen recommends using /dev/urandom so that your program won't block while waiting for random data. Mark Waite On Tue, Jul 29, 2014 at 10:44 PM, Dean Yu dean...@gmail.com wrote: Obviously, going from 1.509.4 to 1.554.3 is a pretty big jump that included lots and lots of changes. However, the fact that the singular act of downgrading that library got us back to our prior build times is a big smoking gun to me. I wonder if something changed upstream... From the upstream release notes: build217, 2013-06-03: - Support for SSH agent based authentication. build216, 2013-03-04: - Support of unencrypted entries in the known_hosts file. - Improved timeout handling. BTW you are using /dev/./urandom as an entropy source for the JVM? Nope. Should we? -- Dean *From: *Stephen Connolly stephen.alan.conno...@gmail.com *Reply-To: *jenkinsci-dev@googlegroups.com jenkinsci-dev@googlegroups.com *Date: *Tuesday, July 29, 2014 at 2:16 PM *To: *jenkinsci-dev@googlegroups.com jenkinsci-dev@googlegroups.com *Subject: *Re: SSH slave performance degradation * KK's changes to window sizes should have *increased* performance * My connection bug fixes were surgical IIRC * Nicolas's merge of upstream seems to include an EOL change, so hard to see what changed there with the Github diff tool: https://github.com/jenkinsci/trilead-ssh2/compare/trilead-ssh2-build214-jenkins-3...trilead-ssh2-build217-jenkins-5 I wonder if something changed upstream... BTW you are using /dev/./urandom as an entropy source for the JVM? On 29 July 2014 19:51, Dean Yu dean...@gmail.com wrote: Hi folks, We just upgraded our cluster from
SSH slave performance degradation
Hi folks, We just upgraded our cluster from 1.509.4 to 1.554.3, and discovered a significant increase in our build times. Builds that typically took ~50 to complete started taking ~90 minutes to finish, sometimes spiking to 2 hours. While researching, we found this JIRA[1] which reported that downgrading the trilead-ssh2 jar solved the performance issues. While this ticket talks specifically artifact downloads, we see that our builds as a whole were slower. The trilead-ssh2 dependency version was updated by [2], so it was introduced into 1.536, show would only have made it to LTS with 1.554.1 in April. Looking at the trilead-ssh2 repo[3], it looks like there were a small set of changes: - changes by ndeloof to merge a newer upstream (build214 to build217) - changes by stephenc to fix connection bugs - changes by kohsuke to support package window sizes Anyone have thoughts on the likely culprit? Given the severity of the performance hit we took, I'm surprised that more people haven't reported this. -- Dean [1] https://issues.jenkins-ci.org/browse/JENKINS-20550 [2] https://github.com/jenkinsci/jenkins/commit/bb265c5e95b0fe39128720b903914236962db41b [3] https://github.com/jenkinsci/trilead-ssh2/commits/master -- You received this message because you are subscribed to the Google Groups Jenkins Developers group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-dev+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: SSH slave performance degradation
* KK's changes to window sizes should have *increased* performance * My connection bug fixes were surgical IIRC * Nicolas's merge of upstream seems to include an EOL change, so hard to see what changed there with the Github diff tool: https://github.com/jenkinsci/trilead-ssh2/compare/trilead-ssh2-build214-jenkins-3...trilead-ssh2-build217-jenkins-5 I wonder if something changed upstream... BTW you are using /dev/./urandom as an entropy source for the JVM? On 29 July 2014 19:51, Dean Yu dean...@gmail.com wrote: Hi folks, We just upgraded our cluster from 1.509.4 to 1.554.3, and discovered a significant increase in our build times. Builds that typically took ~50 to complete started taking ~90 minutes to finish, sometimes spiking to 2 hours. While researching, we found this JIRA[1] which reported that downgrading the trilead-ssh2 jar solved the performance issues. While this ticket talks specifically artifact downloads, we see that our builds as a whole were slower. The trilead-ssh2 dependency version was updated by [2], so it was introduced into 1.536, show would only have made it to LTS with 1.554.1 in April. Looking at the trilead-ssh2 repo[3], it looks like there were a small set of changes: - changes by ndeloof to merge a newer upstream (build214 to build217) - changes by stephenc to fix connection bugs - changes by kohsuke to support package window sizes Anyone have thoughts on the likely culprit? Given the severity of the performance hit we took, I'm surprised that more people haven't reported this. -- Dean [1] https://issues.jenkins-ci.org/browse/JENKINS-20550 [2] https://github.com/jenkinsci/jenkins/commit/bb265c5e95b0fe39128720b903914236962db41b [3] https://github.com/jenkinsci/trilead-ssh2/commits/master -- You received this message because you are subscribed to the Google Groups Jenkins Developers group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-dev+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups Jenkins Developers group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-dev+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.