[openssl-dev] [openssl.org #2650] major ssl read/ write performance improvement - updated
Sorry it took so long to look at this. The code has changed significantly since then, including making the structures opaque. Please open a new ticker (or GitHub pull request) against current sources if this is still an issue. -- Ticket here: http://rt.openssl.org/Ticket/Display.html?id=2650 Please log in as guest with password guest if prompted -- openssl-dev mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev
Re: [openssl.org #2650] major ssl read/ write performance improvement - updated
I'm getting more SSL timeouts when running apachebench with this patch enabled, http://www.pastie.org/3002992 __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: [openssl.org #2650] major ssl read/ write performance improvement - updated
Hi Andrey, I measured on a chip that has no OS which supports cryto acceleration (cavium octeon). My setup doe not involve TCP io since the TCP data has been received and passed to ssl through custom BIO (or mem bio). I measure SSL_read SSL_write (about 1K size) in ms (aes256_cbc/sha1). the measurement is done through cpu ticks, the number seems: without any change and crypto accel: 170ms (this is linear almost to the size of record) with cryto accel only: 54ms (or something like that, the acceleration is done on the same cavium cpu through engine interface) with the patch: 25ms since there is no OS so the code runs to finish and IOs are done separately. The memory allocation is based cavium provided code. for me the saving is fixed so the percentage depends on other part. I don't have a way of measuring if IO is involved. Regards, Michael - Original Message - From: Andrey Kulikov amde...@gmail.com To: openssl-dev@openssl.org Cc: Sent: Thursday, December 8, 2011 4:11 PM Subject: Re: [openssl.org #2650] major ssl read/ write performance improvement - updated Hello Michael, I have tested youe patch. It is working stable at least with ccgost engine (and without any engine too, of cource). Thanks for contribution! Could you please describe, what was your test environmnet and test methodology? How did you measure that doubling read/write speed? What tool/profiler do you use? How it depends from SSL record size? What the overall speed improvement if we'll count OS IO? I'm asking because I'm trying to measure performance improvement your changes can give with my crypto-accelerator, and my results not even close to doube read/write speed. But my test resources are limited for the moment, and it is possible it is due to these limitations. In any case, I guess comunity will be grateful if your share your expirience. WBR, Andrey On 5 December 2011 14:33, Deng Michael via RT r...@openssl.org wrote: Hi, I have changed the mac code which gives substantial improvement for both read and write (not handshake) The saving is fairly major, on cpu with cryto acceleration, the change can more than double the overall ssl read /write speed for 1K record excluding OS IO time. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: [openssl.org #2650] major ssl read/ write performance improvement - updated
I forgot to mention when I tested it was a slightly different impl that contains couple other small optimizations, in the tls1_mac() function I combined the first two update calls into one call which saved couple of ms also. the numbers were tls numbers. as for the question of record size, the smaller the record the larger the percentage of saving since the saving is fixed. - Original Message - From: Deng Michael mdeng...@yahoo.com To: openssl-dev@openssl.org openssl-dev@openssl.org Cc: Sent: Friday, December 9, 2011 5:15 PM Subject: Re: [openssl.org #2650] major ssl read/ write performance improvement - updated Hi Andrey, I measured on a chip that has no OS which supports cryto acceleration (cavium octeon). My setup doe not involve TCP io since the TCP data has been received and passed to ssl through custom BIO (or mem bio). I measure SSL_read or SSL_write (about 1K size) in ms (aes256_cbc/sha1). the measurement is done through cpu ticks, the number seems: without any change and crypto accel: 170ms (this is linear almost to the size of record) with cryto accel only: 54ms (or something like that, the acceleration is done on the same cavium cpu through engine interface) with the patch: 25ms since there is no OS so the code runs to finish and IOs are done separately. The memory allocation is based cavium provided code. for me the saving is fixed so the percentage depends on other part. I don't have a way of measuring if IO is involved. Regards, Michael - Original Message - From: Andrey Kulikov amde...@gmail.com To: openssl-dev@openssl.org Cc: Sent: Thursday, December 8, 2011 4:11 PM Subject: Re: [openssl.org #2650] major ssl read/ write performance improvement - updated Hello Michael, I have tested youe patch. It is working stable at least with ccgost engine (and without any engine too, of cource). Thanks for contribution! Could you please describe, what was your test environmnet and test methodology? How did you measure that doubling read/write speed? What tool/profiler do you use? How it depends from SSL record size? What the overall speed improvement if we'll count OS IO? I'm asking because I'm trying to measure performance improvement your changes can give with my crypto-accelerator, and my results not even close to doube read/write speed. But my test resources are limited for the moment, and it is possible it is due to these limitations. In any case, I guess comunity will be grateful if your share your expirience. WBR, Andrey On 5 December 2011 14:33, Deng Michael via RT r...@openssl.org wrote: Hi, I have changed the mac code which gives substantial improvement for both read and write (not handshake) The saving is fairly major, on cpu with cryto acceleration, the change can more than double the overall ssl read /write speed for 1K record excluding OS IO time. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: [openssl.org #2650] major ssl read/ write performance improvement - updated
I just tried a simple tls echo server running on the chip with only one core enabled (to rule malloc contention by cpu cores, this is a 16 core cpu). I did two runs and the ONLY difference in code is with /without checkpointing the ctx. both have crypto accel. the speed is measured on data part only and each ssl write is of size 1000 on the client side. the openssl code is running on thread protected mode (registered lock callbacks). The server has no OS (with TCP stack in software). the overall performance difference is close to double (54 vs 96). This also may indicate the memory allocation and deallocation routines in my setup are not very good. in system with no OS I think the timing is more indicative of software efficiency. for my setup the unknown is the memory arena malloc / free calls. - Original Message - From: Deng Michael mdeng...@yahoo.com To: openssl-dev@openssl.org openssl-dev@openssl.org Cc: Sent: Friday, December 9, 2011 5:34 PM Subject: Re: [openssl.org #2650] major ssl read/ write performance improvement - updated I forgot to mention when I tested it was a slightly different impl that contains couple other small optimizations, in the tls1_mac() function I combined the first two update calls into one call which saved couple of ms also. the numbers were tls numbers. as for the question of record size, the smaller the record the larger the percentage of saving since the saving is fixed. - Original Message - From: Deng Michael mdeng...@yahoo.com To: openssl-dev@openssl.org openssl-dev@openssl.org Cc: Sent: Friday, December 9, 2011 5:15 PM Subject: Re: [openssl.org #2650] major ssl read/ write performance improvement - updated Hi Andrey, I measured on a chip that has no OS which supports cryto acceleration (cavium octeon). My setup doe not involve TCP io since the TCP data has been received and passed to ssl through custom BIO (or mem bio). I measure SSL_read or SSL_write (about 1K size) in ms (aes256_cbc/sha1). the measurement is done through cpu ticks, the number seems: without any change and crypto accel: 170ms (this is linear almost to the size of record) with cryto accel only: 54ms (or something like that, the acceleration is done on the same cavium cpu through engine interface) with the patch: 25ms since there is no OS so the code runs to finish and IOs are done separately. The memory allocation is based cavium provided code. for me the saving is fixed so the percentage depends on other part. I don't have a way of measuring if IO is involved. Regards, Michael - Original Message - From: Andrey Kulikov amde...@gmail.com To: openssl-dev@openssl.org Cc: Sent: Thursday, December 8, 2011 4:11 PM Subject: Re: [openssl.org #2650] major ssl read/ write performance improvement - updated Hello Michael, I have tested youe patch. It is working stable at least with ccgost engine (and without any engine too, of cource). Thanks for contribution! Could you please describe, what was your test environmnet and test methodology? How did you measure that doubling read/write speed? What tool/profiler do you use? How it depends from SSL record size? What the overall speed improvement if we'll count OS IO? I'm asking because I'm trying to measure performance improvement your changes can give with my crypto-accelerator, and my results not even close to doube read/write speed. But my test resources are limited for the moment, and it is possible it is due to these limitations. In any case, I guess comunity will be grateful if your share your expirience. WBR, Andrey On 5 December 2011 14:33, Deng Michael via RT r...@openssl.org wrote: Hi, I have changed the mac code which gives substantial improvement for both read and write (not handshake) The saving is fairly major, on cpu with cryto acceleration, the change can more than double the overall ssl read /write speed for 1K record excluding OS IO time. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org __ OpenSSL Project http://www.openssl.org
Re: [openssl.org #2650] major ssl read/ write performance improvement - updated
Hello Michael, I have tested youe patch. It is working stable at least with ccgost engine (and without any engine too, of cource). Thanks for contribution! Could you please describe, what was your test environmnet and test methodology? How did you measure that doubling read/write speed? What tool/profiler do you use? How it depends from SSL record size? What the overall speed improvement if we'll count OS IO? I'm asking because I'm trying to measure performance improvement your changes can give with my crypto-accelerator, and my results not even close to doube read/write speed. But my test resources are limited for the moment, and it is possible it is due to these limitations. In any case, I guess comunity will be grateful if your share your expirience. WBR, Andrey On 5 December 2011 14:33, Deng Michael via RT r...@openssl.org wrote: Hi, I have changed the mac code which gives substantial improvement for both read and write (not handshake) The saving is fairly major, on cpu with cryto acceleration, the change can more than double the overall ssl read /write speed for 1K record excluding OS IO time. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
[openssl.org #2650] major ssl read/ write performance improvement - updated
Hi, I have changed the mac code which gives substantial improvement for both read and write (not handshake) The saving is fairly major, on cpu with cryto acceleration, the change can more than double the overall ssl read /write speed for 1K record excluding OS IO time. this implies the change removed majority of the code overhead for read and write. The basic idea is to remove all the EVP_MD_CTX duplications (which is very cpu intensive) during read and write. the original code involves numerous memory allocations and frees for each read or write all due to the ctx's deep copy. the new way of keeping the ctx is to make it do state checkpoint and restore instead of deep copy, after this change there is NO memory operation for read and write. The changes are not too big also. One catch (should not really be a catch) is that at application level NO MORE than one thread can work on the SAME SSL/TLS connection for read or write (read or write can be done at the same time). But I would think most apps would NEVER allow more than one thread to read or write on the same connection (I don't think it would work if you do that anyway, even without my change). the patch file I attached is based on 1.0.0e version. Andrey found some problem in original version of the patch when PKEY_METHS engine is used. so this is an updated patch (complete, not incremental patch) to fix that. This checkpoint/restore is enabled if PKEY_METHS engine is used UNLESS the engine code implements the control interface to do the checkpointing/restore. As pointed out by others, there can be other ways to achieve similar thing, the saving also depends your system's memory allocation routines. also part of the patch look a bit like hack Thanks to Andrey! Regards, Michael checkpoint.patch Description: Binary data
Re: [openssl.org #2650] major ssl read/ write performance improvement - updated
Got a patch for trunk also? On Mon, Dec 5, 2011 at 11:33 PM, Deng Michael via RT r...@openssl.org wrote: Hi, I have changed the mac code which gives substantial improvement for both read and write (not handshake) The saving is fairly major, on cpu with cryto acceleration, the change can more than double the overall ssl read /write speed for 1K record excluding OS IO time. this implies the change removed majority of the code overhead for read and write. The basic idea is to remove all the EVP_MD_CTX duplications (which is very cpu intensive) during read and write. the original code involves numerous memory allocations and frees for each read or write all due to the ctx's deep copy. the new way of keeping the ctx is to make it do state checkpoint and restore instead of deep copy, after this change there is NO memory operation for read and write. The changes are not too big also. One catch (should not really be a catch) is that at application level NO MORE than one thread can work on the SAME SSL/TLS connection for read or write (read or write can be done at the same time). But I would think most apps would NEVER allow more than one thread to read or write on the same connection (I don't think it would work if you do that anyway, even without my change). the patch file I attached is based on 1.0.0e version. Andrey found some problem in original version of the patch when PKEY_METHS engine is used. so this is an updated patch (complete, not incremental patch) to fix that. This checkpoint/restore is enabled if PKEY_METHS engine is used UNLESS the engine code implements the control interface to do the checkpointing/restore. As pointed out by others, there can be other ways to achieve similar thing, the saving also depends your system's memory allocation routines. also part of the patch look a bit like hack Thanks to Andrey! Regards, Michael __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
major ssl read/ write performance improvement - updated
Hi, I have changed the mac code which gives substantial improvement for both read and write (not handshake) The saving is fairly major, on cpu with cryto acceleration, the change can more than double the overall ssl read /write speed for 1K record excluding OS IO time. this implies the change removed majority of the code overhead for read and write. The basic idea is to remove all the EVP_MD_CTX duplications (which is very cpu intensive) during read and write. the original code involves numerous memory allocations and frees for each read or write all due to the ctx's deep copy. the new way of keeping the ctx is to make it do state checkpoint and restore instead of deep copy, after this change there is NO memory operation for read and write. The changes are not too big also. One catch (should not really be a catch) is that at application level NO MORE than one thread can work on the SAME SSL/TLS connection for read or write (read or write can be done at the same time). But I would think most apps would NEVER allow more than one thread to read or write on the same connection (I don't think it would work if you do that anyway, even without my change). the patch file I attached is based on 1.0.0e version. Andrey found some problem in original version of the patch when PKEY_METHS engine is used. so this is an updated patch (complete, not incremental patch) to fix that. This checkpoint/restore is enabled if PKEY_METHS engine is used UNLESS the engine code implements the control interface to do the checkpointing/restore. As pointed out by others, there can be other ways to achieve similar thing, the saving also depends your system's memory allocation routines. also part of the patch look a bit like hack Thanks to Andrey! Regards, Michael checkpoint.patch Description: Binary data
Re: major ssl read/ write performance improvement
Hello, Thanks for interesting contribution! Unfortunately when I apply the patch s_server failed with SEGFAULT, when using ccgost engine (and possibly others) here: EVP_DigestSignFinal if (sctx) r = md_ctx_ptr-pctx-pmeth-signctx(md_ctx_ptr-pctx, sigret, siglen, md_ctx_ptr); else because of pmeth-signctx == 0x08 (or something like this) When I use RSA certificate segfault didn't occur, as pmeth-signctx points to some valid place. Stacktrace is: EVP_DigestSignFinal (ctx=0x87802b0, sigret=0xbfd5f6dc \\\b\\002x\\b\\001\, siglen=0xbfd5f698) tls1_mac (ssl=0x877a088, md=0xbfd5f6dc \\\b\\002x\\b\\001\, send=0) ssl3_get_record (s=0x877a088) ssl3_read_bytes (s=0x877a088, type=22, buf=0x8788d50 \\\020\, len=4, peek=0) ssl3_get_message (s=0x877a088, st1=8608, stn=8609, mt=-1, max=514, ok=0xbfd5f8b8) ssl3_get_cert_verify (s=0x877a088) ssl3_accept (s=0x877a088) ssl3_read_bytes (s=0x877a088, type=23, buf=0x877e7e8 \\, len=4096, peek=0) ssl3_read_internal (s=0x877a088, buf=0x877e7e8, len=4096, peek=0) ssl3_read (s=0x877a088, buf=0x877e7e8, len=4096) SSL_read (s=0x877a088, buf=0x877e7e8, num=4096) ssl_read (b=0x8779370, out=0x877e7e8 \\, outl=4096) BIO_read (b=0x8779370, out=0x877e7e8, outl=4096) buffer_gets (b=0x8777e00, buf=0x877a7e0 \\, size=16382) BIO_gets (b=0x8777e00, in=0x877a7e0 \\, inl=16383) www_body (hostname=0x0, s=6, context=0x0) do_server (port=443, type=1, ret=0x8248ac8, cb=0x8072d24 www_body, context=0x0) s_server_main (argc=0, argv=0xbfd602b8) do_cmd (prog=0x8770868, argc=16, argv=0xbfd60278) main (Argc=16, Argv=0xbfd60278) Could you please advice, what going wrong with your code??? Go check it you need: 1. Adjust your openssl.cnf file, bu adding there: openssl_conf = openssl_def [openssl_def] engines = engine_section [engine_section] gost = gost_section [gost_section] engine_id = gost default_algorithms = ALL somewhhere before [ new_oids ] (if we talking about sample config file from OpenSSL distribution). 2. Generate private key: ./apps/openssl genpkey -engine gost -algorithm gost2001 -pkeyopt paramset:A -out botkey.p8 3. Create self-sign certificate ./apps/openssl req -x509 -days 1095 -subj '/C=US/CN=ccgost_srv/O=sam...@mail.com' -engine gost -new -key botkey.p8 -out botcert.cer 4. Run s_server ./apps/openssl s_server -engine gost -tls1 -www -accept 443 -state -cert botcert.cer -key botkey.p8 -cipher aGOST01 5. Run s_client ./apps/openssl s_client -tls1 -connect 192.168.10.103:443 -msg Well Here s_client will crash with segfault... But if you'll connect via browser - s_server will crash. Please let me know if you'll have any questions. Andrey. On 30 November 2011 05:56, Deng Michael mdeng...@yahoo.com wrote: Thanks Steve for the comment. I guess there are other ways to do similar things, since I was not sure about the intentions of the original code I was trying to make the change in a way such that when checkpoint is not call it should behave like before. Adding a new field for me is less likely to interfere with other code. It seems to me the three evp_md_ctxs contained within the hmac_md_ctx has the data for restoring the state but I was not sure. Also the new field serves as a flag to tell if it has checkpoint data (I could have used an existing flag). My patch also contains some hacking I would think. anyway the real saving comes from redo of state preserving of the evp_md_ctx that contains evp_pkey_ctx which in turn contains hmac_ctx which again contains three evp_md_ctx's. the dup of these are called in tls1_mac() similar place for ssl3 and EVP_DigestSignFinal() these two are the super expensive ones (real super) the copy of ctx in HMAC_Final() --- this one is not too bad can be simplified. I would think the saving is so much that is worth changing maybe in future releases. regards, Michael - Original Message - From: Dr. Stephen Henson st...@openssl.org To: openssl-dev@openssl.org Cc: Sent: Tuesday, November 29, 2011 1:21 PM Subject: Re: major ssl read/ write performance improvement On Mon, Nov 28, 2011, Deng Michael wrote: Hi, I have changed the mac code which gives substantial improvement for both read and write (not handshake) The saving is fairly major, on cpu with cryto acceleration, the change can more than double the overall ssl read /write speed for 1K record excluding OS IO time. this implies the change removed majority of the code overhead for read and write. The basic idea is to remove all the EVP_MD_CTX duplications (which is very cpu intensive) during read and write. the original code involves numerous memory allocations and frees for each read or write all due to the ctx's deep copy. the new way of keeping the ctx is to make it do state checkpoint and restore instead of deep copy, after this change there is NO memory operation for read and write. The changes
Re: major ssl read/ write performance improvement
Hi Andrey, Thanks for trying it out. I did not try this version with many engines. I am very interested in your set up. could you try without the patch how it works (under gdb) what is the value of ctx-pctx-pmeth-signctx when the function was entered. and what is the tmp_ctx.pctx-pmeth-signctx after the ctx copy. Also I am not sure how you use engines. The patch should work if digest engine is used (as digest engine such as sha1 or md5). I am sure if there is signing engine. It would be great if your could send me the engine code and how your code used the engine then we could figure out how to escape that. I am not sure how the pointer is set up by openssl (I'll do some digging there). but the value 0x08 likely coming from a member of NULL pointer structure (the member happens to be at offset 8). this is a guess. Regards, Michael Deng mdeng...@yahoo.com - Original Message - From: Andrey Kulikov amde...@gmail.com To: openssl-dev@openssl.org Cc: Sent: Saturday, December 3, 2011 5:15 PM Subject: Re: major ssl read/ write performance improvement Hello, Thanks for interesting contribution! Unfortunately when I apply the patch s_server failed with SEGFAULT, when using ccgost engine (and possibly others) here: EVP_DigestSignFinal if (sctx) r = md_ctx_ptr-pctx-pmeth-signctx(md_ctx_ptr-pctx, sigret, siglen, md_ctx_ptr); else because of pmeth-signctx == 0x08 (or something like this) When I use RSA certificate segfault didn't occur, as pmeth-signctx points to some valid place. Stacktrace is: EVP_DigestSignFinal (ctx=0x87802b0, sigret=0xbfd5f6dc \\\b\\002x\\b\\001\, siglen=0xbfd5f698) tls1_mac (ssl=0x877a088, md=0xbfd5f6dc \\\b\\002x\\b\\001\, send=0) ssl3_get_record (s=0x877a088) ssl3_read_bytes (s=0x877a088, type=22, buf=0x8788d50 \\\020\, len=4, peek=0) ssl3_get_message (s=0x877a088, st1=8608, stn=8609, mt=-1, max=514, ok=0xbfd5f8b8) ssl3_get_cert_verify (s=0x877a088) ssl3_accept (s=0x877a088) ssl3_read_bytes (s=0x877a088, type=23, buf=0x877e7e8 \\, len=4096, peek=0) ssl3_read_internal (s=0x877a088, buf=0x877e7e8, len=4096, peek=0) ssl3_read (s=0x877a088, buf=0x877e7e8, len=4096) SSL_read (s=0x877a088, buf=0x877e7e8, num=4096) ssl_read (b=0x8779370, out=0x877e7e8 \\, outl=4096) BIO_read (b=0x8779370, out=0x877e7e8, outl=4096) buffer_gets (b=0x8777e00, buf=0x877a7e0 \\, size=16382) BIO_gets (b=0x8777e00, in=0x877a7e0 \\, inl=16383) www_body (hostname=0x0, s=6, context=0x0) do_server (port=443, type=1, ret=0x8248ac8, cb=0x8072d24 www_body, context=0x0) s_server_main (argc=0, argv=0xbfd602b8) do_cmd (prog=0x8770868, argc=16, argv=0xbfd60278) main (Argc=16, Argv=0xbfd60278) Could you please advice, what going wrong with your code??? Go check it you need: 1. Adjust your openssl.cnf file, bu adding there: openssl_conf = openssl_def [openssl_def] engines = engine_section [engine_section] gost = gost_section [gost_section] engine_id = gost default_algorithms = ALL somewhhere before [ new_oids ] (if we talking about sample config file from OpenSSL distribution). 2. Generate private key: ./apps/openssl genpkey -engine gost -algorithm gost2001 -pkeyopt paramset:A -out botkey.p8 3. Create self-sign certificate ./apps/openssl req -x509 -days 1095 -subj '/C=US/CN=ccgost_srv/O=sam...@mail.com' -engine gost -new -key botkey.p8 -out botcert.cer 4. Run s_server ./apps/openssl s_server -engine gost -tls1 -www -accept 443 -state -cert botcert.cer -key botkey.p8 -cipher aGOST01 5. Run s_client ./apps/openssl s_client -tls1 -connect 192.168.10.103:443 -msg Well Here s_client will crash with segfault... But if you'll connect via browser - s_server will crash. Please let me know if you'll have any questions. Andrey. On 30 November 2011 05:56, Deng Michael mdeng...@yahoo.com wrote: Thanks Steve for the comment. I guess there are other ways to do similar things, since I was not sure about the intentions of the original code I was trying to make the change in a way such that when checkpoint is not call it should behave like before. Adding a new field for me is less likely to interfere with other code. It seems to me the three evp_md_ctxs contained within the hmac_md_ctx has the data for restoring the state but I was not sure. Also the new field serves as a flag to tell if it has checkpoint data (I could have used an existing flag). My patch also contains some hacking I would think. anyway the real saving comes from redo of state preserving of the evp_md_ctx that contains evp_pkey_ctx which in turn contains hmac_ctx which again contains three evp_md_ctx's. the dup of these are called in tls1_mac() similar place for ssl3 and EVP_DigestSignFinal() these two are the super expensive ones (real super) the copy of ctx in HMAC_Final() --- this one is not too bad can be simplified. I would think the saving is so much that is worth changing maybe in future releases. regards, Michael
Re: major ssl read/ write performance improvement
Hi Andrey again, Maybe there is a bug in the patch if(EVP_MD_CTX_has_checkpoint(ctx)){ md_ctx_ptr = ctx; } Should be changed to if(EVP_MD_CTX_has_checkpoint(ctx)){ md_ctx_ptr = ctx; } else { EVP_MD_CTX_init(tmp_ctx); } - Original Message - From: Andrey Kulikov amde...@gmail.com To: openssl-dev@openssl.org Cc: Sent: Saturday, December 3, 2011 5:15 PM Subject: Re: major ssl read/ write performance improvement Hello, Thanks for interesting contribution! Unfortunately when I apply the patch s_server failed with SEGFAULT, when using ccgost engine (and possibly others) here: EVP_DigestSignFinal if (sctx) r = md_ctx_ptr-pctx-pmeth-signctx(md_ctx_ptr-pctx, sigret, siglen, md_ctx_ptr); else because of pmeth-signctx == 0x08 (or something like this) When I use RSA certificate segfault didn't occur, as pmeth-signctx points to some valid place. Stacktrace is: EVP_DigestSignFinal (ctx=0x87802b0, sigret=0xbfd5f6dc \\\b\\002x\\b\\001\, siglen=0xbfd5f698) tls1_mac (ssl=0x877a088, md=0xbfd5f6dc \\\b\\002x\\b\\001\, send=0) ssl3_get_record (s=0x877a088) ssl3_read_bytes (s=0x877a088, type=22, buf=0x8788d50 \\\020\, len=4, peek=0) ssl3_get_message (s=0x877a088, st1=8608, stn=8609, mt=-1, max=514, ok=0xbfd5f8b8) ssl3_get_cert_verify (s=0x877a088) ssl3_accept (s=0x877a088) ssl3_read_bytes (s=0x877a088, type=23, buf=0x877e7e8 \\, len=4096, peek=0) ssl3_read_internal (s=0x877a088, buf=0x877e7e8, len=4096, peek=0) ssl3_read (s=0x877a088, buf=0x877e7e8, len=4096) SSL_read (s=0x877a088, buf=0x877e7e8, num=4096) ssl_read (b=0x8779370, out=0x877e7e8 \\, outl=4096) BIO_read (b=0x8779370, out=0x877e7e8, outl=4096) buffer_gets (b=0x8777e00, buf=0x877a7e0 \\, size=16382) BIO_gets (b=0x8777e00, in=0x877a7e0 \\, inl=16383) www_body (hostname=0x0, s=6, context=0x0) do_server (port=443, type=1, ret=0x8248ac8, cb=0x8072d24 www_body, context=0x0) s_server_main (argc=0, argv=0xbfd602b8) do_cmd (prog=0x8770868, argc=16, argv=0xbfd60278) main (Argc=16, Argv=0xbfd60278) Could you please advice, what going wrong with your code??? Go check it you need: 1. Adjust your openssl.cnf file, bu adding there: openssl_conf = openssl_def [openssl_def] engines = engine_section [engine_section] gost = gost_section [gost_section] engine_id = gost default_algorithms = ALL somewhhere before [ new_oids ] (if we talking about sample config file from OpenSSL distribution). 2. Generate private key: ./apps/openssl genpkey -engine gost -algorithm gost2001 -pkeyopt paramset:A -out botkey.p8 3. Create self-sign certificate ./apps/openssl req -x509 -days 1095 -subj '/C=US/CN=ccgost_srv/O=sam...@mail.com' -engine gost -new -key botkey.p8 -out botcert.cer 4. Run s_server ./apps/openssl s_server -engine gost -tls1 -www -accept 443 -state -cert botcert.cer -key botkey.p8 -cipher aGOST01 5. Run s_client ./apps/openssl s_client -tls1 -connect 192.168.10.103:443 -msg Well Here s_client will crash with segfault... But if you'll connect via browser - s_server will crash. Please let me know if you'll have any questions. Andrey. On 30 November 2011 05:56, Deng Michael mdeng...@yahoo.com wrote: Thanks Steve for the comment. I guess there are other ways to do similar things, since I was not sure about the intentions of the original code I was trying to make the change in a way such that when checkpoint is not call it should behave like before. Adding a new field for me is less likely to interfere with other code. It seems to me the three evp_md_ctxs contained within the hmac_md_ctx has the data for restoring the state but I was not sure. Also the new field serves as a flag to tell if it has checkpoint data (I could have used an existing flag). My patch also contains some hacking I would think. anyway the real saving comes from redo of state preserving of the evp_md_ctx that contains evp_pkey_ctx which in turn contains hmac_ctx which again contains three evp_md_ctx's. the dup of these are called in tls1_mac() similar place for ssl3 and EVP_DigestSignFinal() these two are the super expensive ones (real super) the copy of ctx in HMAC_Final() --- this one is not too bad can be simplified. I would think the saving is so much that is worth changing maybe in future releases. regards, Michael - Original Message - From: Dr. Stephen Henson st...@openssl.org To: openssl-dev@openssl.org Cc: Sent: Tuesday, November 29, 2011 1:21 PM Subject: Re: major ssl read/ write performance improvement On Mon, Nov 28, 2011, Deng Michael wrote: Hi, I have changed the mac code which gives substantial improvement for both read and write (not handshake) The saving is fairly major, on cpu with cryto acceleration, the change can more than double the overall ssl read /write speed for 1K record excluding OS IO time. this implies the change removed
Re: major ssl read/ write performance improvement
On Mon, Nov 28, 2011, Deng Michael wrote: Hi, I have changed the mac code which gives substantial improvement for both read and write (not handshake) The saving is fairly major, on cpu with cryto acceleration, the change can more than double the overall ssl read /write speed for 1K record excluding OS IO time. this implies the change removed majority of the code overhead for read and write. The basic idea is to remove all the EVP_MD_CTX duplications (which is very cpu intensive) during read and write. the original code involves numerous memory allocations and frees for each read or write all due to the ctx's deep copy. the new way of keeping the ctx is to make it do state checkpoint and restore instead of deep copy, after this change there is NO memory operation for read and write. The changes are not too big also. One catch (should not really be a catch) is that at application level NO MORE than one thread can work on the SAME SSL/TLS connection for read or write (read or write can be done at the same time). But I would think most apps would NEVER allow more than one thread to read or write on the same connection (I don't think it would work if you do that anyway, even without my change). the patch file I attached is mad from 1.0.0e version. Thanks for the patch. It should really go to the request tracker RT though. There are a few problems with the patch as it stands. Firstly new features will never be added to 1.0.0x only security and bug fixes. Your patch adds a field in the middle of an EVP_MD_CTX which will result in binary compatibility issues with existing applications so that makes it problematical including it in 1.0.1 either. Adding the field on the end would result in fewer problems but it would still increase the size of EVP_MD_CTX. However I wonder if the same savings could be achieved in a different way. If the destination EVP_MD_CTX is the same digest as the existing one no new memory is allocated and it should simply memcpy the result across which should be a far less expensive operation. So perhaps if instead of having a temporary EVP_MD_CTX which is created and destroyed regularly we could have a more persistent one tied to the SSL structure: so the initial copy would allocate memory but subsequent ones would only be a memcpy? Adding fields at the end of an SSL structure is likely to cause far fewer problems because SSL structures are allocated using SSL_new(). Steve. -- Dr Stephen N. Henson. OpenSSL project core developer. Commercial tech support now available see: http://www.openssl.org __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: major ssl read/ write performance improvement
Thanks Steve for the comment. I guess there are other ways to do similar things, since I was not sure about the intentions of the original code I was trying to make the change in a way such that when checkpoint is not call it should behave like before. Adding a new field for me is less likely to interfere with other code. It seems to me the three evp_md_ctxs contained within the hmac_md_ctx has the data for restoring the state but I was not sure. Also the new field serves as a flag to tell if it has checkpoint data (I could have used an existing flag). My patch also contains some hacking I would think. anyway the real saving comes from redo of state preserving of the evp_md_ctx that contains evp_pkey_ctx which in turn contains hmac_ctx which again contains three evp_md_ctx's. the dup of these are called in tls1_mac() similar place for ssl3 and EVP_DigestSignFinal() these two are the super expensive ones (real super) the copy of ctx in HMAC_Final() --- this one is not too bad can be simplified. I would think the saving is so much that is worth changing maybe in future releases. regards, Michael - Original Message - From: Dr. Stephen Henson st...@openssl.org To: openssl-dev@openssl.org Cc: Sent: Tuesday, November 29, 2011 1:21 PM Subject: Re: major ssl read/ write performance improvement On Mon, Nov 28, 2011, Deng Michael wrote: Hi, I have changed the mac code which gives substantial improvement for both read and write (not handshake) The saving is fairly major, on cpu with cryto acceleration, the change can more than double the overall ssl read /write speed for 1K record excluding OS IO time. this implies the change removed majority of the code overhead for read and write. The basic idea is to remove all the EVP_MD_CTX duplications (which is very cpu intensive) during read and write. the original code involves numerous memory allocations and frees for each read or write all due to the ctx's deep copy. the new way of keeping the ctx is to make it do state checkpoint and restore instead of deep copy, after this change there is NO memory operation for read and write. The changes are not too big also. One catch (should not really be a catch) is that at application level NO MORE than one thread can work on the SAME SSL/TLS connection for read or write (read or write can be done at the same time). But I would think most apps would NEVER allow more than one thread to read or write on the same connection (I don't think it would work if you do that anyway, even without my change). the patch file I attached is mad from 1.0.0e version. Thanks for the patch. It should really go to the request tracker RT though. There are a few problems with the patch as it stands. Firstly new features will never be added to 1.0.0x only security and bug fixes. Your patch adds a field in the middle of an EVP_MD_CTX which will result in binary compatibility issues with existing applications so that makes it problematical including it in 1.0.1 either. Adding the field on the end would result in fewer problems but it would still increase the size of EVP_MD_CTX. However I wonder if the same savings could be achieved in a different way. If the destination EVP_MD_CTX is the same digest as the existing one no new memory is allocated and it should simply memcpy the result across which should be a far less expensive operation. So perhaps if instead of having a temporary EVP_MD_CTX which is created and destroyed regularly we could have a more persistent one tied to the SSL structure: so the initial copy would allocate memory but subsequent ones would only be a memcpy? Adding fields at the end of an SSL structure is likely to cause far fewer problems because SSL structures are allocated using SSL_new(). Steve. -- Dr Stephen N. Henson. OpenSSL project core developer. Commercial tech support now available see: http://www.openssl.org __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
major ssl read/ write performance improvement
Hi, I have changed the mac code which gives substantial improvement for both read and write (not handshake) The saving is fairly major, on cpu with cryto acceleration, the change can more than double the overall ssl read /write speed for 1K record excluding OS IO time. this implies the change removed majority of the code overhead for read and write. The basic idea is to remove all the EVP_MD_CTX duplications (which is very cpu intensive) during read and write. the original code involves numerous memory allocations and frees for each read or write all due to the ctx's deep copy. the new way of keeping the ctx is to make it do state checkpoint and restore instead of deep copy, after this change there is NO memory operation for read and write. The changes are not too big also. One catch (should not really be a catch) is that at application level NO MORE than one thread can work on the SAME SSL/TLS connection for read or write (read or write can be done at the same time). But I would think most apps would NEVER allow more than one thread to read or write on the same connection (I don't think it would work if you do that anyway, even without my change). the patch file I attached is mad from 1.0.0e version. Happy coding! Michael checkpoint.patch Description: Binary data