Re: [Rd] Big speedup in install.packages() by re-using connections

2024-04-25 Thread Ivan Krylov via R-devel
On Thu, 25 Apr 2024 14:45:04 +0200
Jeroen Ooms  wrote:

> Thoughts?

How verboten would it be to create an empty external pointer object,
add it to the preserved list, and set an on-exit finalizer to clean up
the curl multi-handle? As far as I can tell, the internet module is not
supposed to be unloaded, so this would not introduce an opportunity to
jump to an unmapped address. This makes it possible to avoid adding a
CurlCleanup() function to the internet module:

Index: src/modules/internet/libcurl.c
===
--- src/modules/internet/libcurl.c  (revision 86484)
+++ src/modules/internet/libcurl.c  (working copy)
@@ -55,6 +55,47 @@
 
 static int current_timeout = 0;
 
+// The multi-handle is shared between downloads for reusing connections
+static CURLM *shared_mhnd = NULL;
+static SEXP mhnd_sentinel = NULL;
+
+static void cleanup_mhnd(SEXP ignored)
+{
+if(shared_mhnd){
+curl_multi_cleanup(shared_mhnd);
+shared_mhnd = NULL;
+}
+curl_global_cleanup();
+}
+static void rollback_mhnd_sentinel(void* sentinel) {
+// Failed to allocate memory while registering a finalizer,
+// therefore must release the object
+R_ReleaseObject((SEXP)sentinel);
+}
+static CURLM *get_mhnd(void)
+{
+if (!mhnd_sentinel) {
+  SEXP sentinel = PROTECT(R_MakeExternalPtr(NULL, R_NilValue, R_NilValue));
+  R_PreserveObject(sentinel);
+  UNPROTECT(1);
+  // Avoid leaking the sentinel before setting the finalizer
+  RCNTXT cntxt;
+  begincontext(, CTXT_CCODE, R_NilValue, R_BaseEnv, R_BaseEnv,
+   R_NilValue, R_NilValue);
+  cntxt.cend = _mhnd_sentinel;
+  cntxt.cenddata = sentinel;
+  R_RegisterCFinalizerEx(sentinel, cleanup_mhnd, TRUE);
+  // Succeeded, no need to clean up if endcontext() fails allocation
+  mhnd_sentinel = sentinel;
+  cntxt.cend = NULL;
+  endcontext();
+}
+if(!shared_mhnd) {
+  shared_mhnd = curl_multi_init();
+}
+return shared_mhnd;
+}
+
 # if LIBCURL_VERSION_MAJOR < 7 || (LIBCURL_VERSION_MAJOR == 7 && 
LIBCURL_VERSION_MINOR < 28)
 
 // curl/curl.h includes  and headers it requires.
@@ -565,8 +606,6 @@
if (c->hnd && c->hnd[i])
curl_easy_cleanup(c->hnd[i]);
 }
-if (c->mhnd)
-   curl_multi_cleanup(c->mhnd);
 if (c->headers)
curl_slist_free_all(c->headers);
 
@@ -668,7 +707,7 @@
c.headers = headers = tmp;
 }
 
-CURLM *mhnd = curl_multi_init();
+CURLM *mhnd = get_mhnd();
 if (!mhnd)
error(_("could not create curl handle"));
 c.mhnd = mhnd;


-- 
Best regards,
Ivan

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Big speedup in install.packages() by re-using connections

2024-04-25 Thread Jeroen Ooms
I'd like to raise this again now that 4.4 is out.

Below is a more complete patch which includes a function to properly
cleanup libcurl when R quits. Implementing this is a little tricky
because libcurl is a separate "module" in R, perhaps there is a better
way, but this works:

  view: https://github.com/r-devel/r-svn/pull/166/files
  patch: https://github.com/r-devel/r-svn/pull/166.diff

The old patch is still there as well, which is meant a minimal proof
of concept to test the performance gains for reusing the connection:

  view: https://github.com/r-devel/r-svn/pull/155/files
  patch: https://github.com/r-devel/r-svn/pull/155.diff

Performance gains are greatest on high-bandwidth servers when
downloading many files from the same server (e.g. packages from a cran
mirror). In such cases, currently over 90% of the total time is spent
on initiating and tearing town a separate SSL connection for each file
download.

Thoughts?



On Sat, Mar 2, 2024 at 3:07 PM Jeroen Ooms  wrote:
>
> Currently download.file() creates and terminates a new TLS connection
> for each download. This creates a lot of overhead which is expensive
> for both client and server (in particular the TLS handshake). Modern
> internet clients (including browsers) re-use connections for many http
> requests.
>
> We can do this in R by creating a persistent libcurl "multi-handle".
> The R libcurl implementation already uses a multi-handle, however it
> destroys it after each download, which defeats the purpose. The
> purpose of the multi-handle is to keep it alive and let libcurl
> maintain a persistent connection pool. This is particularly relevant
> for install.packages() which needs to download many files from one and
> the same server.
>
> Here is a bare minimal proof of concept patch that re-uses one and the
> same multi-handle for all requests in R:
> https://github.com/r-devel/r-svn/pull/155/files
>
> Some quick benchmarking shows that this can lead to big speedups for
> download.packages() on high-bandwidth servers (such as CI). This quick
> test to download 100 packages from CRAN showed more than 10x speedup
> for me: https://github.com/r-devel/r-svn/pull/155
>
> Moreover, I think this may make install.packages() more robust. In CI
> build logs that download many packages, I often see one or two
> downloads randomly failing with a TLS-connect error. I am hopeful this
> problem will disappear when using a single connection to the CRAN
> server to download all the packages.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Big speedup in install.packages() by re-using connections

2024-03-02 Thread Jeroen Ooms
Currently download.file() creates and terminates a new TLS connection
for each download. This creates a lot of overhead which is expensive
for both client and server (in particular the TLS handshake). Modern
internet clients (including browsers) re-use connections for many http
requests.

We can do this in R by creating a persistent libcurl "multi-handle".
The R libcurl implementation already uses a multi-handle, however it
destroys it after each download, which defeats the purpose. The
purpose of the multi-handle is to keep it alive and let libcurl
maintain a persistent connection pool. This is particularly relevant
for install.packages() which needs to download many files from one and
the same server.

Here is a bare minimal proof of concept patch that re-uses one and the
same multi-handle for all requests in R:
https://github.com/r-devel/r-svn/pull/155/files

Some quick benchmarking shows that this can lead to big speedups for
download.packages() on high-bandwidth servers (such as CI). This quick
test to download 100 packages from CRAN showed more than 10x speedup
for me: https://github.com/r-devel/r-svn/pull/155

Moreover, I think this may make install.packages() more robust. In CI
build logs that download many packages, I often see one or two
downloads randomly failing with a TLS-connect error. I am hopeful this
problem will disappear when using a single connection to the CRAN
server to download all the packages.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel