Re: [R-pkg-devel] How to decrease time to import files in xlsx format?

2022-10-05 Thread Diego de Freitas Coêlho
Hey Igor,

I have been dealing with *CSV*/*XLSX* files from time to time and depending
on the size of those files you are mentioning, 180 seconds isn't really
that much.
>From my experience, *vroom *is the fastest I've encountered but it deals
with *CSV* files (I can support its usage for up to 8gb size files).
If you have the option to import CSVs instead, you should give it a try.

Other than that there are several other factors to be considered, such as
memory and disk read/write capabilities.
And again, it is just 180 seconds so just suggest the user go get a cup of
coffee :)

Best,
Diego

On Wed, 5 Oct 2022 at 10:02, Igor L  wrote:

> According to my internet research, it looks like readxl is the fastest
> package.
>
> The profvis package indicated that the bottleneck is indeed in importing
> the files.
>
> My processor has six cores, but when I use four of them the computer
> crashes completely. When I use three processors, it's still usable. So I
> did one more benchmark comparing for loop, map_dfr and future_map_dfr (with
> multisession and three cores).
>
> After the benchmark was run 10 times, the result was:
>
>  expr  min  lq   mean
>  medianuq  max neval
>  import_for()140.9940 147.9722 160.7229 155.6459 172.4661
> 199.105910
>  import_map_dfr()   161.6707 339.6769 480.5760 567.8389 643.8895 666.0726
>  10
>import_furrr()112.1374 116.4301 127.5976 129.0067 137.9179
> 140.863210
>
> For me it is proven that using the furrr package is the best solution in
> this case, but what would explain so much difference with map_dfr?
>
> Em ter., 4 de out. de 2022 às 16:58, Jeff Newmiller <
> jdnew...@dcn.davis.ca.us> escreveu:
>
> > It looks like you are reading directly from URLs? How do you know the
> > delay is not network I/O delay?
> >
> > Parallel computation is not a panacea. It allows tasks _that are
> > CPU-bound_ to get through the CPU-intensive work faster. You need to be
> > certain that your tasks actually can benefit from parallelism before
> using
> > it... there is a significant overhead and added complexity to using
> > parallel processing that will lead to SLOWER processing if mis-used.
> >
> > On October 4, 2022 11:29:54 AM PDT, Igor L  wrote:
> > >Hello all,
> > >
> > >I'm developing an R package that basically downloads, imports, cleans
> and
> > >merges nine files in xlsx format updated monthly from a public
> > institution.
> > >
> > >The problem is that importing files in xlsx format is time consuming.
> > >
> > >My initial idea was to parallelize the execution of the read_xlsx
> function
> > >according to the number of cores in the user's processor, but apparently
> > it
> > >didn't make much difference, since when trying to parallelize it the
> > >execution time went from 185.89 to 184.12 seconds:
> > >
> > ># not parallelized code
> > >y <- purrr::map_dfr(paste0(dir.temp, '/', lista.arquivos.locais),
> > >   readxl::read_excel, sheet = 1, skip = 4, col_types =
> > >c(rep('text', 30)))
> > >
> > ># parallelized code
> > >plan(strategy = future::multicore(workers = 4))
> > >y <- furrr::future_map_dfr(paste0(dir.temp, '/', lista.arquivos.locais),
> > > readxl::read_excel, sheet = 1, skip = 4,
> > >col_types = c(rep('text', 30)))
> > >
> > > Any suggestions to reduce the import processing time?
> > >
> > >Thanks in advance!
> > >
> >
> > --
> > Sent from my phone. Please excuse my brevity.
> >
>
> [[alternative HTML version deleted]]
>
> __
> R-package-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-package-devel
>

[[alternative HTML version deleted]]

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [R-pkg-devel] CRAN package isoband and its reverse dependencies

2022-10-05 Thread Hadley Wickham
Yes, we will make sure that this is fixed ASAP. There is no need to worry.

Hadley

On Wed, Oct 5, 2022 at 7:32 AM John Harrold  wrote:
>
> Howdy Folks,
>
> I got a message from CRAN today telling me that I have a strong reverse
> dependency on the isoband package. But I'm not alone! It look like more
> than 4700 other packages also have a strong dependency on this. Is there
> some organized effort to deal with this?
>
> Thanks
> John
>
> [[alternative HTML version deleted]]
>
> __
> R-package-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-package-devel



-- 
http://hadley.nz

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [R-pkg-devel] How to decrease time to import files in xlsx format?

2022-10-05 Thread Igor L
According to my internet research, it looks like readxl is the fastest
package.

The profvis package indicated that the bottleneck is indeed in importing
the files.

My processor has six cores, but when I use four of them the computer
crashes completely. When I use three processors, it's still usable. So I
did one more benchmark comparing for loop, map_dfr and future_map_dfr (with
multisession and three cores).

After the benchmark was run 10 times, the result was:

 expr  min  lq   mean
 medianuq  max neval
 import_for()140.9940 147.9722 160.7229 155.6459 172.4661
199.105910
 import_map_dfr()   161.6707 339.6769 480.5760 567.8389 643.8895 666.0726
 10
   import_furrr()112.1374 116.4301 127.5976 129.0067 137.9179
140.863210

For me it is proven that using the furrr package is the best solution in
this case, but what would explain so much difference with map_dfr?

Em ter., 4 de out. de 2022 às 16:58, Jeff Newmiller <
jdnew...@dcn.davis.ca.us> escreveu:

> It looks like you are reading directly from URLs? How do you know the
> delay is not network I/O delay?
>
> Parallel computation is not a panacea. It allows tasks _that are
> CPU-bound_ to get through the CPU-intensive work faster. You need to be
> certain that your tasks actually can benefit from parallelism before using
> it... there is a significant overhead and added complexity to using
> parallel processing that will lead to SLOWER processing if mis-used.
>
> On October 4, 2022 11:29:54 AM PDT, Igor L  wrote:
> >Hello all,
> >
> >I'm developing an R package that basically downloads, imports, cleans and
> >merges nine files in xlsx format updated monthly from a public
> institution.
> >
> >The problem is that importing files in xlsx format is time consuming.
> >
> >My initial idea was to parallelize the execution of the read_xlsx function
> >according to the number of cores in the user's processor, but apparently
> it
> >didn't make much difference, since when trying to parallelize it the
> >execution time went from 185.89 to 184.12 seconds:
> >
> ># not parallelized code
> >y <- purrr::map_dfr(paste0(dir.temp, '/', lista.arquivos.locais),
> >   readxl::read_excel, sheet = 1, skip = 4, col_types =
> >c(rep('text', 30)))
> >
> ># parallelized code
> >plan(strategy = future::multicore(workers = 4))
> >y <- furrr::future_map_dfr(paste0(dir.temp, '/', lista.arquivos.locais),
> > readxl::read_excel, sheet = 1, skip = 4,
> >col_types = c(rep('text', 30)))
> >
> > Any suggestions to reduce the import processing time?
> >
> >Thanks in advance!
> >
>
> --
> Sent from my phone. Please excuse my brevity.
>

[[alternative HTML version deleted]]

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [R-pkg-devel] CRAN package isoband and its reverse dependencies

2022-10-05 Thread Guido Schwarzer

Am 05.10.22 um 14:32 schrieb John Harrold:


Howdy Folks,

I got a message from CRAN today telling me that I have a strong reverse
dependency on the isoband package. But I'm not alone! It look like more
than 4700 other packages also have a strong dependency on this. Is there
some organized effort to deal with this?


R package ggplot2 imports isoband which resulted in the very large 
number of strong dependencies.


For my own R packages I could move ggplot2 from Imports to Suggests, 
however, I hope that I can simply wait this out.


Best,

Guido

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [R-pkg-devel] CRAN package isoband and its reverse dependencies

2022-10-05 Thread Henrik Singmann
Hi John,

I think the short answer is yes, see the discussion on their GitHub:
https://github.com/wilkelab/isoband/issues/31
See also: https://github.com/tidyverse/ggplot2/issues/5006

Best,
Henrik

Am Mi., 5. Okt. 2022 um 13:32 Uhr schrieb John Harrold
:
>
> Howdy Folks,
>
> I got a message from CRAN today telling me that I have a strong reverse
> dependency on the isoband package. But I'm not alone! It look like more
> than 4700 other packages also have a strong dependency on this. Is there
> some organized effort to deal with this?
>
> Thanks
> John
>
> [[alternative HTML version deleted]]
>
> __
> R-package-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-package-devel



-- 
Dr. Henrik Singmann
Lecturer, Experimental Psychology
University College London (UCL), UK
http://singmann.org

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


[R-pkg-devel] CRAN package isoband and its reverse dependencies

2022-10-05 Thread John Harrold
Howdy Folks,

I got a message from CRAN today telling me that I have a strong reverse
dependency on the isoband package. But I'm not alone! It look like more
than 4700 other packages also have a strong dependency on this. Is there
some organized effort to deal with this?

Thanks
John

[[alternative HTML version deleted]]

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel