Re: [Rd] Big speedup in install.packages() by re-using connections
On Thu, 25 Apr 2024 14:45:04 +0200 Jeroen Ooms wrote: > Thoughts? How verboten would it be to create an empty external pointer object, add it to the preserved list, and set an on-exit finalizer to clean up the curl multi-handle? As far as I can tell, the internet module is not supposed to be unloaded, so this would not introduce an opportunity to jump to an unmapped address. This makes it possible to avoid adding a CurlCleanup() function to the internet module: Index: src/modules/internet/libcurl.c === --- src/modules/internet/libcurl.c (revision 86484) +++ src/modules/internet/libcurl.c (working copy) @@ -55,6 +55,47 @@ static int current_timeout = 0; +// The multi-handle is shared between downloads for reusing connections +static CURLM *shared_mhnd = NULL; +static SEXP mhnd_sentinel = NULL; + +static void cleanup_mhnd(SEXP ignored) +{ +if(shared_mhnd){ +curl_multi_cleanup(shared_mhnd); +shared_mhnd = NULL; +} +curl_global_cleanup(); +} +static void rollback_mhnd_sentinel(void* sentinel) { +// Failed to allocate memory while registering a finalizer, +// therefore must release the object +R_ReleaseObject((SEXP)sentinel); +} +static CURLM *get_mhnd(void) +{ +if (!mhnd_sentinel) { + SEXP sentinel = PROTECT(R_MakeExternalPtr(NULL, R_NilValue, R_NilValue)); + R_PreserveObject(sentinel); + UNPROTECT(1); + // Avoid leaking the sentinel before setting the finalizer + RCNTXT cntxt; + begincontext(, CTXT_CCODE, R_NilValue, R_BaseEnv, R_BaseEnv, + R_NilValue, R_NilValue); + cntxt.cend = _mhnd_sentinel; + cntxt.cenddata = sentinel; + R_RegisterCFinalizerEx(sentinel, cleanup_mhnd, TRUE); + // Succeeded, no need to clean up if endcontext() fails allocation + mhnd_sentinel = sentinel; + cntxt.cend = NULL; + endcontext(); +} +if(!shared_mhnd) { + shared_mhnd = curl_multi_init(); +} +return shared_mhnd; +} + # if LIBCURL_VERSION_MAJOR < 7 || (LIBCURL_VERSION_MAJOR == 7 && LIBCURL_VERSION_MINOR < 28) // curl/curl.h includes and headers it requires. @@ -565,8 +606,6 @@ if (c->hnd && c->hnd[i]) curl_easy_cleanup(c->hnd[i]); } -if (c->mhnd) - curl_multi_cleanup(c->mhnd); if (c->headers) curl_slist_free_all(c->headers); @@ -668,7 +707,7 @@ c.headers = headers = tmp; } -CURLM *mhnd = curl_multi_init(); +CURLM *mhnd = get_mhnd(); if (!mhnd) error(_("could not create curl handle")); c.mhnd = mhnd; -- Best regards, Ivan __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Big speedup in install.packages() by re-using connections
I'd like to raise this again now that 4.4 is out. Below is a more complete patch which includes a function to properly cleanup libcurl when R quits. Implementing this is a little tricky because libcurl is a separate "module" in R, perhaps there is a better way, but this works: view: https://github.com/r-devel/r-svn/pull/166/files patch: https://github.com/r-devel/r-svn/pull/166.diff The old patch is still there as well, which is meant a minimal proof of concept to test the performance gains for reusing the connection: view: https://github.com/r-devel/r-svn/pull/155/files patch: https://github.com/r-devel/r-svn/pull/155.diff Performance gains are greatest on high-bandwidth servers when downloading many files from the same server (e.g. packages from a cran mirror). In such cases, currently over 90% of the total time is spent on initiating and tearing town a separate SSL connection for each file download. Thoughts? On Sat, Mar 2, 2024 at 3:07 PM Jeroen Ooms wrote: > > Currently download.file() creates and terminates a new TLS connection > for each download. This creates a lot of overhead which is expensive > for both client and server (in particular the TLS handshake). Modern > internet clients (including browsers) re-use connections for many http > requests. > > We can do this in R by creating a persistent libcurl "multi-handle". > The R libcurl implementation already uses a multi-handle, however it > destroys it after each download, which defeats the purpose. The > purpose of the multi-handle is to keep it alive and let libcurl > maintain a persistent connection pool. This is particularly relevant > for install.packages() which needs to download many files from one and > the same server. > > Here is a bare minimal proof of concept patch that re-uses one and the > same multi-handle for all requests in R: > https://github.com/r-devel/r-svn/pull/155/files > > Some quick benchmarking shows that this can lead to big speedups for > download.packages() on high-bandwidth servers (such as CI). This quick > test to download 100 packages from CRAN showed more than 10x speedup > for me: https://github.com/r-devel/r-svn/pull/155 > > Moreover, I think this may make install.packages() more robust. In CI > build logs that download many packages, I often see one or two > downloads randomly failing with a TLS-connect error. I am hopeful this > problem will disappear when using a single connection to the CRAN > server to download all the packages. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Big speedup in install.packages() by re-using connections
Currently download.file() creates and terminates a new TLS connection for each download. This creates a lot of overhead which is expensive for both client and server (in particular the TLS handshake). Modern internet clients (including browsers) re-use connections for many http requests. We can do this in R by creating a persistent libcurl "multi-handle". The R libcurl implementation already uses a multi-handle, however it destroys it after each download, which defeats the purpose. The purpose of the multi-handle is to keep it alive and let libcurl maintain a persistent connection pool. This is particularly relevant for install.packages() which needs to download many files from one and the same server. Here is a bare minimal proof of concept patch that re-uses one and the same multi-handle for all requests in R: https://github.com/r-devel/r-svn/pull/155/files Some quick benchmarking shows that this can lead to big speedups for download.packages() on high-bandwidth servers (such as CI). This quick test to download 100 packages from CRAN showed more than 10x speedup for me: https://github.com/r-devel/r-svn/pull/155 Moreover, I think this may make install.packages() more robust. In CI build logs that download many packages, I often see one or two downloads randomly failing with a TLS-connect error. I am hopeful this problem will disappear when using a single connection to the CRAN server to download all the packages. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel