Re: [R-pkg-devel] How to decrease time to import files in xlsx format?
Hey Igor, I have been dealing with *CSV*/*XLSX* files from time to time and depending on the size of those files you are mentioning, 180 seconds isn't really that much. >From my experience, *vroom *is the fastest I've encountered but it deals with *CSV* files (I can support its usage for up to 8gb size files). If you have the option to import CSVs instead, you should give it a try. Other than that there are several other factors to be considered, such as memory and disk read/write capabilities. And again, it is just 180 seconds so just suggest the user go get a cup of coffee :) Best, Diego On Wed, 5 Oct 2022 at 10:02, Igor L wrote: > According to my internet research, it looks like readxl is the fastest > package. > > The profvis package indicated that the bottleneck is indeed in importing > the files. > > My processor has six cores, but when I use four of them the computer > crashes completely. When I use three processors, it's still usable. So I > did one more benchmark comparing for loop, map_dfr and future_map_dfr (with > multisession and three cores). > > After the benchmark was run 10 times, the result was: > > expr min lq mean > medianuq max neval > import_for()140.9940 147.9722 160.7229 155.6459 172.4661 > 199.105910 > import_map_dfr() 161.6707 339.6769 480.5760 567.8389 643.8895 666.0726 > 10 >import_furrr()112.1374 116.4301 127.5976 129.0067 137.9179 > 140.863210 > > For me it is proven that using the furrr package is the best solution in > this case, but what would explain so much difference with map_dfr? > > Em ter., 4 de out. de 2022 às 16:58, Jeff Newmiller < > jdnew...@dcn.davis.ca.us> escreveu: > > > It looks like you are reading directly from URLs? How do you know the > > delay is not network I/O delay? > > > > Parallel computation is not a panacea. It allows tasks _that are > > CPU-bound_ to get through the CPU-intensive work faster. You need to be > > certain that your tasks actually can benefit from parallelism before > using > > it... there is a significant overhead and added complexity to using > > parallel processing that will lead to SLOWER processing if mis-used. > > > > On October 4, 2022 11:29:54 AM PDT, Igor L wrote: > > >Hello all, > > > > > >I'm developing an R package that basically downloads, imports, cleans > and > > >merges nine files in xlsx format updated monthly from a public > > institution. > > > > > >The problem is that importing files in xlsx format is time consuming. > > > > > >My initial idea was to parallelize the execution of the read_xlsx > function > > >according to the number of cores in the user's processor, but apparently > > it > > >didn't make much difference, since when trying to parallelize it the > > >execution time went from 185.89 to 184.12 seconds: > > > > > ># not parallelized code > > >y <- purrr::map_dfr(paste0(dir.temp, '/', lista.arquivos.locais), > > > readxl::read_excel, sheet = 1, skip = 4, col_types = > > >c(rep('text', 30))) > > > > > ># parallelized code > > >plan(strategy = future::multicore(workers = 4)) > > >y <- furrr::future_map_dfr(paste0(dir.temp, '/', lista.arquivos.locais), > > > readxl::read_excel, sheet = 1, skip = 4, > > >col_types = c(rep('text', 30))) > > > > > > Any suggestions to reduce the import processing time? > > > > > >Thanks in advance! > > > > > > > -- > > Sent from my phone. Please excuse my brevity. > > > > [[alternative HTML version deleted]] > > __ > R-package-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-package-devel > [[alternative HTML version deleted]] __ R-package-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel
Re: [R-pkg-devel] CRAN package isoband and its reverse dependencies
Yes, we will make sure that this is fixed ASAP. There is no need to worry. Hadley On Wed, Oct 5, 2022 at 7:32 AM John Harrold wrote: > > Howdy Folks, > > I got a message from CRAN today telling me that I have a strong reverse > dependency on the isoband package. But I'm not alone! It look like more > than 4700 other packages also have a strong dependency on this. Is there > some organized effort to deal with this? > > Thanks > John > > [[alternative HTML version deleted]] > > __ > R-package-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-package-devel -- http://hadley.nz __ R-package-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel
Re: [R-pkg-devel] How to decrease time to import files in xlsx format?
According to my internet research, it looks like readxl is the fastest package. The profvis package indicated that the bottleneck is indeed in importing the files. My processor has six cores, but when I use four of them the computer crashes completely. When I use three processors, it's still usable. So I did one more benchmark comparing for loop, map_dfr and future_map_dfr (with multisession and three cores). After the benchmark was run 10 times, the result was: expr min lq mean medianuq max neval import_for()140.9940 147.9722 160.7229 155.6459 172.4661 199.105910 import_map_dfr() 161.6707 339.6769 480.5760 567.8389 643.8895 666.0726 10 import_furrr()112.1374 116.4301 127.5976 129.0067 137.9179 140.863210 For me it is proven that using the furrr package is the best solution in this case, but what would explain so much difference with map_dfr? Em ter., 4 de out. de 2022 às 16:58, Jeff Newmiller < jdnew...@dcn.davis.ca.us> escreveu: > It looks like you are reading directly from URLs? How do you know the > delay is not network I/O delay? > > Parallel computation is not a panacea. It allows tasks _that are > CPU-bound_ to get through the CPU-intensive work faster. You need to be > certain that your tasks actually can benefit from parallelism before using > it... there is a significant overhead and added complexity to using > parallel processing that will lead to SLOWER processing if mis-used. > > On October 4, 2022 11:29:54 AM PDT, Igor L wrote: > >Hello all, > > > >I'm developing an R package that basically downloads, imports, cleans and > >merges nine files in xlsx format updated monthly from a public > institution. > > > >The problem is that importing files in xlsx format is time consuming. > > > >My initial idea was to parallelize the execution of the read_xlsx function > >according to the number of cores in the user's processor, but apparently > it > >didn't make much difference, since when trying to parallelize it the > >execution time went from 185.89 to 184.12 seconds: > > > ># not parallelized code > >y <- purrr::map_dfr(paste0(dir.temp, '/', lista.arquivos.locais), > > readxl::read_excel, sheet = 1, skip = 4, col_types = > >c(rep('text', 30))) > > > ># parallelized code > >plan(strategy = future::multicore(workers = 4)) > >y <- furrr::future_map_dfr(paste0(dir.temp, '/', lista.arquivos.locais), > > readxl::read_excel, sheet = 1, skip = 4, > >col_types = c(rep('text', 30))) > > > > Any suggestions to reduce the import processing time? > > > >Thanks in advance! > > > > -- > Sent from my phone. Please excuse my brevity. > [[alternative HTML version deleted]] __ R-package-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel
Re: [R-pkg-devel] CRAN package isoband and its reverse dependencies
Am 05.10.22 um 14:32 schrieb John Harrold: Howdy Folks, I got a message from CRAN today telling me that I have a strong reverse dependency on the isoband package. But I'm not alone! It look like more than 4700 other packages also have a strong dependency on this. Is there some organized effort to deal with this? R package ggplot2 imports isoband which resulted in the very large number of strong dependencies. For my own R packages I could move ggplot2 from Imports to Suggests, however, I hope that I can simply wait this out. Best, Guido __ R-package-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel
Re: [R-pkg-devel] CRAN package isoband and its reverse dependencies
Hi John, I think the short answer is yes, see the discussion on their GitHub: https://github.com/wilkelab/isoband/issues/31 See also: https://github.com/tidyverse/ggplot2/issues/5006 Best, Henrik Am Mi., 5. Okt. 2022 um 13:32 Uhr schrieb John Harrold : > > Howdy Folks, > > I got a message from CRAN today telling me that I have a strong reverse > dependency on the isoband package. But I'm not alone! It look like more > than 4700 other packages also have a strong dependency on this. Is there > some organized effort to deal with this? > > Thanks > John > > [[alternative HTML version deleted]] > > __ > R-package-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-package-devel -- Dr. Henrik Singmann Lecturer, Experimental Psychology University College London (UCL), UK http://singmann.org __ R-package-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel
[R-pkg-devel] CRAN package isoband and its reverse dependencies
Howdy Folks, I got a message from CRAN today telling me that I have a strong reverse dependency on the isoband package. But I'm not alone! It look like more than 4700 other packages also have a strong dependency on this. Is there some organized effort to deal with this? Thanks John [[alternative HTML version deleted]] __ R-package-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel