Re: [R] Possible Improvement to sapply

2018-03-14 Thread Doran, Harold
Well thanks, Martin, and glad to see there is some potential here. This
wasn¹t reported as a bug, but as you note really as a question originally
and with an invitation to critique my code.

On 3/14/18, 5:11 AM, "Martin Maechler" <maech...@stat.math.ethz.ch> wrote:

>>>>>> Henrik Bengtsson <henrik.bengts...@gmail.com>
>>>>>> on Tue, 13 Mar 2018 10:12:55 -0700 writes:
>
>> FYI, in R devel (to become 3.5.0), there's isFALSE() which will cut
>> some corners compared to identical():
>
>> > microbenchmark::microbenchmark(identical(FALSE, FALSE),
>>isFALSE(FALSE))
>> Unit: nanoseconds
>> expr min   lqmean median uq   max neval
>>  identical(FALSE, FALSE) 984 1138 1694.13 1218.0 1337.5 13584   100
>>   isFALSE(FALSE) 713  761 1133.53  809.5  871.5 18619   100
>
>> > microbenchmark::microbenchmark(identical(TRUE, FALSE), isFALSE(TRUE))
>> Unit: nanoseconds
>>expr  min lqmean median   uq   max neval
>>  identical(TRUE, FALSE) 1009 1103.5 2228.20 1170.5 1357 14346   100
>>   isFALSE(TRUE)  718  760.0 1298.98  798.0  898 17782   100
>
>> > microbenchmark::microbenchmark(identical("array", FALSE),
>>isFALSE("array"))
>> Unit: nanoseconds
>>   expr min lqmean median uq  max neval
>>  identical("array", FALSE) 975 1058.5 1257.95 1119.5 1250.0 9299   100
>>   isFALSE("array") 409  433.5  658.76  446.0  476.5 9383   100
>
>Thank you Henrik!
>
>The speed of the new isTRUE() and isFALSE() is indeed amazing
>compared to identical() which was written to be fast itself.
>
>Note that the new code goes back to a proposal by  Hervé Pagès
>(of Bioconductor fame) in a thread with R core in April 2017.
>The goal of the new code actually *was*  to allow call like
>
>  isTRUE(c(a = TRUE))
>
>to become TRUE rather than improving speed.
>The new source code is at the end of  R/src/library/base/R/identical.R
>
>## NB:  is.logical(.) will never dispatch:
>## -- base::is.logical(x)  <==>  typeof(x) == "logical"
>isTRUE  <- function(x) is.logical(x) && length(x) == 1L && !is.na(x) && x
>isFALSE <- function(x) is.logical(x) && length(x) == 1L && !is.na(x) && !x
>
>and one *reason* this is so fast is that all  6  functions which
>are called are primitives :
>
>> sapply(codetools::findGlobals(isTRUE), function(fn)
>>is.primitive(get(fn)))
> ! && == is.logical  is.na length
>  TRUE   TRUE   TRUE   TRUE   TRUE   TRUE
>
>and a 2nd reason is probably with the many recent improvements of the
>byte compiler.
>
>
>> That could probably be used also is sapply().  The difference is that
>> isFALSE() is a bit more liberal than identical(x, FALSE), e.g.
>
>> > isFALSE(c(a = FALSE))
>> [1] TRUE
>> > identical(c(a = FALSE), FALSE)
>> [1] FALSE
>
>> Assuming the latter is not an issue, there are 69 places in base R
>> where isFALSE() could be used:
>
>> $ grep -E "identical[(][^,]+,[ ]*FALSE[)]" -r --include="*.R" | grep -F
>>"/R/" | wc
>>  69 3265472
>
>> and another 59 where isTRUE() can be used:
>
>> $ grep -E "identical[(][^,]+,[ ]*TRUE[)]" -r --include="*.R" | grep -F
>>"/R/" | wc
>>  59 3075021
>
>Beautiful use of  'grep' -- thank you for those above, as well.
>It does need a quick manual check, but if I use the above grep
>from Emacs (via  'M-x grep')  or even better via a TAGS table
>and M-x tags-query-replace  I should be able to do the changes
>pretty quickly... and will start looking into that later today.
>
>Interestingly and to my great pleasure, the first part of the
>'Subject' of this mailing list thread, "Possible Improvement",
>*has* become true after all --
>
>-- thanks to Henrik !
>
>Martin Maechler
>ETH Zurich
>
>
>
>> On Tue, Mar 13, 2018 at 9:21 AM, Doran, Harold <hdo...@air.org> wrote:
>> > Quite possibly, and I¹ll look into that. Aside from the work I was
>>doing, however, I wonder if there is a way such that sapply could avoid
>>the overhead of having to call the identical function to determine the
>>conditional path.
>> >
>> >
>> >
>> > From: William Dunlap [mailto:wdun...@tibco.com]
>> > Sent: Tuesday, March 13, 2018 12:14 PM
>> > To: Doran, Harold <hdo...@air.org>
>> > Cc: 

Re: [R] Possible Improvement to sapply

2018-03-14 Thread Martin Maechler
>>>>> Henrik Bengtsson <henrik.bengts...@gmail.com>
>>>>> on Tue, 13 Mar 2018 10:12:55 -0700 writes:

> FYI, in R devel (to become 3.5.0), there's isFALSE() which will cut
> some corners compared to identical():

> > microbenchmark::microbenchmark(identical(FALSE, FALSE), isFALSE(FALSE))
> Unit: nanoseconds
> expr min   lqmean median uq   max neval
>  identical(FALSE, FALSE) 984 1138 1694.13 1218.0 1337.5 13584   100
>   isFALSE(FALSE) 713  761 1133.53  809.5  871.5 18619   100

> > microbenchmark::microbenchmark(identical(TRUE, FALSE), isFALSE(TRUE))
> Unit: nanoseconds
>expr  min lqmean median   uq   max neval
>  identical(TRUE, FALSE) 1009 1103.5 2228.20 1170.5 1357 14346   100
>   isFALSE(TRUE)  718  760.0 1298.98  798.0  898 17782   100

> > microbenchmark::microbenchmark(identical("array", FALSE), isFALSE("array"))
> Unit: nanoseconds
>   expr min lqmean median uq  max neval
>  identical("array", FALSE) 975 1058.5 1257.95 1119.5 1250.0 9299   100
>   isFALSE("array") 409  433.5  658.76  446.0  476.5 9383   100

Thank you Henrik!

The speed of the new isTRUE() and isFALSE() is indeed amazing
compared to identical() which was written to be fast itself.

Note that the new code goes back to a proposal by  Hervé Pagès
(of Bioconductor fame) in a thread with R core in April 2017.
The goal of the new code actually *was*  to allow call like

  isTRUE(c(a = TRUE))   

to become TRUE rather than improving speed.
The new source code is at the end of  R/src/library/base/R/identical.R

## NB:  is.logical(.) will never dispatch:
## -- base::is.logical(x)  <==>  typeof(x) == "logical"
isTRUE  <- function(x) is.logical(x) && length(x) == 1L && !is.na(x) && x
isFALSE <- function(x) is.logical(x) && length(x) == 1L && !is.na(x) && !x

and one *reason* this is so fast is that all  6  functions which
are called are primitives :

> sapply(codetools::findGlobals(isTRUE), function(fn) is.primitive(get(fn)))
 ! && == is.logical  is.na length 
  TRUE   TRUE   TRUE   TRUE   TRUE   TRUE 

and a 2nd reason is probably with the many recent improvements of the
byte compiler.


> That could probably be used also is sapply().  The difference is that
> isFALSE() is a bit more liberal than identical(x, FALSE), e.g.

> > isFALSE(c(a = FALSE))
> [1] TRUE
> > identical(c(a = FALSE), FALSE)
> [1] FALSE

> Assuming the latter is not an issue, there are 69 places in base R
> where isFALSE() could be used:

> $ grep -E "identical[(][^,]+,[ ]*FALSE[)]" -r --include="*.R" | grep -F "/R/" 
> | wc
>  69 3265472

> and another 59 where isTRUE() can be used:

> $ grep -E "identical[(][^,]+,[ ]*TRUE[)]" -r --include="*.R" | grep -F "/R/" 
> | wc
>  59 3075021

Beautiful use of  'grep' -- thank you for those above, as well.
It does need a quick manual check, but if I use the above grep
from Emacs (via  'M-x grep')  or even better via a TAGS table
and M-x tags-query-replace  I should be able to do the changes
pretty quickly... and will start looking into that later today.

Interestingly and to my great pleasure, the first part of the
'Subject' of this mailing list thread, "Possible Improvement",
*has* become true after all --

-- thanks to Henrik !

Martin Maechler
ETH Zurich



> On Tue, Mar 13, 2018 at 9:21 AM, Doran, Harold <hdo...@air.org> wrote:
> > Quite possibly, and I’ll look into that. Aside from the work I was doing, 
> > however, I wonder if there is a way such that sapply could avoid the 
> > overhead of having to call the identical function to determine the 
> > conditional path.
> >
> >
> >
> > From: William Dunlap [mailto:wdun...@tibco.com]
> > Sent: Tuesday, March 13, 2018 12:14 PM
> > To: Doran, Harold <hdo...@air.org>
> > Cc: Martin Morgan <martin.mor...@roswellpark.org>; r-help@r-project.org
> > Subject: Re: [R] Possible Improvement to sapply
> >
> > Could your code use vapply instead of sapply?  vapply forces you to declare 
> > the type and dimensions
> > of FUN's output and stops if any call to FUN does not match the 
> > declaration.  It can use much less
> > memory and time than sapply because it fills in the output array as it goes 
> > instead of calling lapply()
> > and seeing how it could be simplified.
> >
> > Bill Dunlap
> > TIBCO Software
> > wdunlap tibco.com<http://tibco.com>
> >
> > On Tue, Mar

Re: [R] Possible Improvement to sapply

2018-03-13 Thread Henrik Bengtsson
FYI, in R devel (to become 3.5.0), there's isFALSE() which will cut
some corners compared to identical():

> microbenchmark::microbenchmark(identical(FALSE, FALSE), isFALSE(FALSE))
Unit: nanoseconds
expr min   lqmean median uq   max neval
 identical(FALSE, FALSE) 984 1138 1694.13 1218.0 1337.5 13584   100
  isFALSE(FALSE) 713  761 1133.53  809.5  871.5 18619   100

> microbenchmark::microbenchmark(identical(TRUE, FALSE), isFALSE(TRUE))
Unit: nanoseconds
   expr  min lqmean median   uq   max neval
 identical(TRUE, FALSE) 1009 1103.5 2228.20 1170.5 1357 14346   100
  isFALSE(TRUE)  718  760.0 1298.98  798.0  898 17782   100

> microbenchmark::microbenchmark(identical("array", FALSE), isFALSE("array"))
Unit: nanoseconds
  expr min lqmean median uq  max neval
 identical("array", FALSE) 975 1058.5 1257.95 1119.5 1250.0 9299   100
  isFALSE("array") 409  433.5  658.76  446.0  476.5 9383   100

That could probably be used also is sapply().  The difference is that
isFALSE() is a bit more liberal than identical(x, FALSE), e.g.

> isFALSE(c(a = FALSE))
[1] TRUE
> identical(c(a = FALSE), FALSE)
[1] FALSE

Assuming the latter is not an issue, there are 69 places in base R
where isFALSE() could be used:

$ grep -E "identical[(][^,]+,[ ]*FALSE[)]" -r --include="*.R" | grep
-F "/R/" | wc
 69 3265472

and another 59 where isTRUE() can be used:

$ grep -E "identical[(][^,]+,[ ]*TRUE[)]" -r --include="*.R" | grep -F
"/R/" | wc
 59 3075021

/Henrik

On Tue, Mar 13, 2018 at 9:21 AM, Doran, Harold <hdo...@air.org> wrote:
> Quite possibly, and I’ll look into that. Aside from the work I was doing, 
> however, I wonder if there is a way such that sapply could avoid the overhead 
> of having to call the identical function to determine the conditional path.
>
>
>
> From: William Dunlap [mailto:wdun...@tibco.com]
> Sent: Tuesday, March 13, 2018 12:14 PM
> To: Doran, Harold <hdo...@air.org>
> Cc: Martin Morgan <martin.mor...@roswellpark.org>; r-help@r-project.org
> Subject: Re: [R] Possible Improvement to sapply
>
> Could your code use vapply instead of sapply?  vapply forces you to declare 
> the type and dimensions
> of FUN's output and stops if any call to FUN does not match the declaration.  
> It can use much less
> memory and time than sapply because it fills in the output array as it goes 
> instead of calling lapply()
> and seeing how it could be simplified.
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com<http://tibco.com>
>
> On Tue, Mar 13, 2018 at 7:06 AM, Doran, Harold 
> <hdo...@air.org<mailto:hdo...@air.org>> wrote:
> Martin
>
> In terms of context of the actual problem, sapply is called millions of times 
> because the work involves scoring individual students who took a test. A 
> score for student A is generated and then student B and such and there are 
> millions of students. The psychometric process of scoring students is complex 
> and our code makes use of sapply many times for each student.
>
> The toy example used length just to illustrate, our actual code doesn't do 
> that. But your point is well taken, there may be a very good counterexample 
> why my proposal doesn't achieve the goal is a generalizable way.
>
>
>
> -Original Message-
> From: Martin Morgan 
> [mailto:martin.mor...@roswellpark.org<mailto:martin.mor...@roswellpark.org>]
> Sent: Tuesday, March 13, 2018 9:43 AM
> To: Doran, Harold <hdo...@air.org<mailto:hdo...@air.org>>; 
> 'r-help@r-project.org<mailto:r-help@r-project.org>' 
> <r-help@r-project.org<mailto:r-help@r-project.org>>
> Subject: Re: [R] Possible Improvement to sapply
>
>
>
> On 03/13/2018 09:23 AM, Doran, Harold wrote:
>> While working with sapply, the documentation states that the simplify
>> argument will yield a vector, matrix etc "when possible". I was
>> curious how the code actually defined "as possible" and see this
>> within the function
>>
>> if (!identical(simplify, FALSE) && length(answer))
>>
>> This seems superfluous to me, in particular this part:
>>
>> !identical(simplify, FALSE)
>>
>> The preceding code could be reduced to
>>
>> if (simplify && length(answer))
>>
>> and it would not need to execute the call to identical in order to trigger 
>> the conditional execution, which is known from the user's simplify = TRUE or 
>> FALSE inputs. I *think* the extra call to identical is just unnecessary 
>> overhead in this instance.
>>
>

Re: [R] Possible Improvement to sapply

2018-03-13 Thread William Dunlap via R-help
Could your code use vapply instead of sapply?  vapply forces you to declare
the type and dimensions
of FUN's output and stops if any call to FUN does not match the
declaration.  It can use much less
memory and time than sapply because it fills in the output array as it goes
instead of calling lapply()
and seeing how it could be simplified.

Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Tue, Mar 13, 2018 at 7:06 AM, Doran, Harold <hdo...@air.org> wrote:

> Martin
>
> In terms of context of the actual problem, sapply is called millions of
> times because the work involves scoring individual students who took a
> test. A score for student A is generated and then student B and such and
> there are millions of students. The psychometric process of scoring
> students is complex and our code makes use of sapply many times for each
> student.
>
> The toy example used length just to illustrate, our actual code doesn't do
> that. But your point is well taken, there may be a very good counterexample
> why my proposal doesn't achieve the goal is a generalizable way.
>
>
>
> -Original Message-
> From: Martin Morgan [mailto:martin.mor...@roswellpark.org]
> Sent: Tuesday, March 13, 2018 9:43 AM
> To: Doran, Harold <hdo...@air.org>; 'r-help@r-project.org' <
> r-help@r-project.org>
> Subject: Re: [R] Possible Improvement to sapply
>
>
>
> On 03/13/2018 09:23 AM, Doran, Harold wrote:
> > While working with sapply, the documentation states that the simplify
> > argument will yield a vector, matrix etc "when possible". I was
> > curious how the code actually defined "as possible" and see this
> > within the function
> >
> > if (!identical(simplify, FALSE) && length(answer))
> >
> > This seems superfluous to me, in particular this part:
> >
> > !identical(simplify, FALSE)
> >
> > The preceding code could be reduced to
> >
> > if (simplify && length(answer))
> >
> > and it would not need to execute the call to identical in order to
> trigger the conditional execution, which is known from the user's simplify
> = TRUE or FALSE inputs. I *think* the extra call to identical is just
> unnecessary overhead in this instance.
> >
> > Take for example, the following toy example code and benchmark results
> and a small modification to sapply:
> >
> > myList <- list(a = rnorm(100), b = rnorm(100))
> >
> > answer <- lapply(X = myList, FUN = length) simplify = TRUE
> >
> > library(microbenchmark)
> >
> > mySapply <- function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE){
> >   FUN <- match.fun(FUN)
> >  answer <- lapply(X = X, FUN = FUN, ...)
> >  if (USE.NAMES && is.character(X) && is.null(names(answer)))
> >  names(answer) <- X
> >  if (simplify && length(answer))
> >  simplify2array(answer, higher = (simplify == "array"))
> >  else answer
> > }
> >
> >
> >> microbenchmark(sapply(myList, length), times = 1L)
> > Unit: microseconds
> > exprmin lq mean median uqmax
> neval
> >   sapply(myList, length) 14.156 15.572 16.67603 15.926 16.634 650.46
> > 1
> >> microbenchmark(mySapply(myList, length), times = 1L)
> > Unit: microseconds
> >   exprmin lq mean median uq  max
> neval
> >   mySapply(myList, length) 13.095 14.864 16.02964 15.218 15.573
> > 1671.804 1
> >
> > My benchmark timings show a timing improvement with only that small
> change made and it is seemingly nominal. In my actual work, the sapply
> function is called millions of times and this additional overhead
> propagates to some overall additional computing time.
> >
> > I have done some limited testing on various real data to verify that the
> objects produced under both variants of the sapply (base R and my modified)
> yield identical objects when simply is both TRUE or FALSE.
> >
> > Perhaps someone else sees a counterexample where my proposed fix does
> not cause for sapply to behave as expected.
> >
>
> Check out ?sapply for possible values of `simplify=` to see why your
> proposal is not adequate.
>
> For your example, lengths() is an order of magnitude faster than sapply(.,
> length). This is a example of the advantages of vectorization (single call
> to an R function implemented in C) versus iteration (`for` loops but also
> the *apply family calling an R function many times).
> vapply() might also be relevant.
>
> Often performance improvements come from look

Re: [R] Possible Improvement to sapply

2018-03-13 Thread Martin Maechler
>>>>> Doran, Harold <hdo...@air.org>
>>>>> on Tue, 13 Mar 2018 16:14:19 + writes:

> You’re right, it sure does. My suggestion causes it to fail when simplify 
= ‘array’

> From: William Dunlap [mailto:wdun...@tibco.com]
> Sent: Tuesday, March 13, 2018 12:11 PM
> To: Doran, Harold <hdo...@air.org>
> Cc: r-help@r-project.org
> Subject: Re: [R] Possible Improvement to sapply

> Wouldn't that change how simplify='array' is handled?

>> str(sapply(1:3, function(x)diag(x,5,2), simplify="array"))
> int [1:5, 1:2, 1:3] 1 0 0 0 0 0 1 0 0 0 ...
>> str(sapply(1:3, function(x)diag(x,5,2), simplify=TRUE))
> int [1:10, 1:3] 1 0 0 0 0 0 1 0 0 0 ...
>> str(sapply(1:3, function(x)diag(x,5,2), simplify=FALSE))
> List of 3
> $ : int [1:5, 1:2] 1 0 0 0 0 0 1 0 0 0
> $ : int [1:5, 1:2] 2 0 0 0 0 0 2 0 0 0
> $ : int [1:5, 1:2] 3 0 0 0 0 0 3 0 0 0


> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com<http://tibco.com>

Yes, indeed, thank you Bill!

I sometimes marvel at how much the mental capacities of R core
are underestimated.  Of course, nobody is perfect, but the bugs
we produce are really more subtle than that ...  ;-)

Martin Maechler
R core  


> On Tue, Mar 13, 2018 at 6:23 AM, Doran, Harold 
<hdo...@air.org<mailto:hdo...@air.org>> wrote:
> While working with sapply, the documentation states that the simplify 
argument will yield a vector, matrix etc "when possible". I was curious how the 
code actually defined "as possible" and see this within the function

> if (!identical(simplify, FALSE) && length(answer))

> This seems superfluous to me, in particular this part:

> !identical(simplify, FALSE)

> The preceding code could be reduced to

> if (simplify && length(answer))

> and it would not need to execute the call to identical in order to 
trigger the conditional execution, which is known from the user's simplify = 
TRUE or FALSE inputs. I *think* the extra call to identical is just unnecessary 
overhead in this instance.

> Take for example, the following toy example code and benchmark results 
and a small modification to sapply:

> myList <- list(a = rnorm(100), b = rnorm(100))

> answer <- lapply(X = myList, FUN = length)
> simplify = TRUE

> library(microbenchmark)

> mySapply <- function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE){
> FUN <- match.fun(FUN)
> answer <- lapply(X = X, FUN = FUN, ...)
> if (USE.NAMES && is.character(X) && is.null(names(answer)))
> names(answer) <- X
> if (simplify && length(answer))
> simplify2array(answer, higher = (simplify == "array"))
> else answer
> }


>> microbenchmark(sapply(myList, length), times = 1L)
> Unit: microseconds
> exprmin lq mean median uqmax neval
> sapply(myList, length) 14.156 15.572 16.67603 15.926 16.634 650.46 1
>> microbenchmark(mySapply(myList, length), times = 1L)
> Unit: microseconds
> exprmin lq mean median uq  max neval
> mySapply(myList, length) 13.095 14.864 16.02964 15.218 15.573 1671.804 
1

> My benchmark timings show a timing improvement with only that small 
change made and it is seemingly nominal. In my actual work, the sapply function 
is called millions of times and this additional overhead propagates to some 
overall additional computing time.

> I have done some limited testing on various real data to verify that the 
objects produced under both variants of the sapply (base R and my modified) 
yield identical objects when simply is both TRUE or FALSE.

> Perhaps someone else sees a counterexample where my proposed fix does not 
cause for sapply to behave as expected.

> Harold

> __
> R-help@r-project.org<mailto:R-help@r-project.org> mailing list -- To 
UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


> [[alternative HTML version deleted]]

> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Possible Improvement to sapply

2018-03-13 Thread Doran, Harold
Quite possibly, and I’ll look into that. Aside from the work I was doing, 
however, I wonder if there is a way such that sapply could avoid the overhead 
of having to call the identical function to determine the conditional path.



From: William Dunlap [mailto:wdun...@tibco.com]
Sent: Tuesday, March 13, 2018 12:14 PM
To: Doran, Harold <hdo...@air.org>
Cc: Martin Morgan <martin.mor...@roswellpark.org>; r-help@r-project.org
Subject: Re: [R] Possible Improvement to sapply

Could your code use vapply instead of sapply?  vapply forces you to declare the 
type and dimensions
of FUN's output and stops if any call to FUN does not match the declaration.  
It can use much less
memory and time than sapply because it fills in the output array as it goes 
instead of calling lapply()
and seeing how it could be simplified.

Bill Dunlap
TIBCO Software
wdunlap tibco.com<http://tibco.com>

On Tue, Mar 13, 2018 at 7:06 AM, Doran, Harold 
<hdo...@air.org<mailto:hdo...@air.org>> wrote:
Martin

In terms of context of the actual problem, sapply is called millions of times 
because the work involves scoring individual students who took a test. A score 
for student A is generated and then student B and such and there are millions 
of students. The psychometric process of scoring students is complex and our 
code makes use of sapply many times for each student.

The toy example used length just to illustrate, our actual code doesn't do 
that. But your point is well taken, there may be a very good counterexample why 
my proposal doesn't achieve the goal is a generalizable way.



-Original Message-
From: Martin Morgan 
[mailto:martin.mor...@roswellpark.org<mailto:martin.mor...@roswellpark.org>]
Sent: Tuesday, March 13, 2018 9:43 AM
To: Doran, Harold <hdo...@air.org<mailto:hdo...@air.org>>; 
'r-help@r-project.org<mailto:r-help@r-project.org>' 
<r-help@r-project.org<mailto:r-help@r-project.org>>
Subject: Re: [R] Possible Improvement to sapply



On 03/13/2018 09:23 AM, Doran, Harold wrote:
> While working with sapply, the documentation states that the simplify
> argument will yield a vector, matrix etc "when possible". I was
> curious how the code actually defined "as possible" and see this
> within the function
>
> if (!identical(simplify, FALSE) && length(answer))
>
> This seems superfluous to me, in particular this part:
>
> !identical(simplify, FALSE)
>
> The preceding code could be reduced to
>
> if (simplify && length(answer))
>
> and it would not need to execute the call to identical in order to trigger 
> the conditional execution, which is known from the user's simplify = TRUE or 
> FALSE inputs. I *think* the extra call to identical is just unnecessary 
> overhead in this instance.
>
> Take for example, the following toy example code and benchmark results and a 
> small modification to sapply:
>
> myList <- list(a = rnorm(100), b = rnorm(100))
>
> answer <- lapply(X = myList, FUN = length) simplify = TRUE
>
> library(microbenchmark)
>
> mySapply <- function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE){
>   FUN <- match.fun(FUN)
>  answer <- lapply(X = X, FUN = FUN, ...)
>  if (USE.NAMES && is.character(X) && is.null(names(answer)))
>  names(answer) <- X
>  if (simplify && length(answer))
>  simplify2array(answer, higher = (simplify == "array"))
>  else answer
> }
>
>
>> microbenchmark(sapply(myList, length), times = 1L)
> Unit: microseconds
> exprmin lq mean median uqmax neval
>   sapply(myList, length) 14.156 15.572 16.67603 15.926 16.634 650.46
> 1
>> microbenchmark(mySapply(myList, length), times = 1L)
> Unit: microseconds
>   exprmin lq mean median uq  max neval
>   mySapply(myList, length) 13.095 14.864 16.02964 15.218 15.573
> 1671.804 1
>
> My benchmark timings show a timing improvement with only that small change 
> made and it is seemingly nominal. In my actual work, the sapply function is 
> called millions of times and this additional overhead propagates to some 
> overall additional computing time.
>
> I have done some limited testing on various real data to verify that the 
> objects produced under both variants of the sapply (base R and my modified) 
> yield identical objects when simply is both TRUE or FALSE.
>
> Perhaps someone else sees a counterexample where my proposed fix does not 
> cause for sapply to behave as expected.
>

Check out ?sapply for possible values of `simplify=` to see why your proposal 
is not adequate.

For your example, lengths() is an order of magnitude faster than sapply(., 
length). This is a exa

Re: [R] Possible Improvement to sapply

2018-03-13 Thread Doran, Harold
You’re right, it sure does. My suggestion causes it to fail when simplify = 
‘array’


From: William Dunlap [mailto:wdun...@tibco.com]
Sent: Tuesday, March 13, 2018 12:11 PM
To: Doran, Harold <hdo...@air.org>
Cc: r-help@r-project.org
Subject: Re: [R] Possible Improvement to sapply

Wouldn't that change how simplify='array' is handled?

>  str(sapply(1:3, function(x)diag(x,5,2), simplify="array"))
 int [1:5, 1:2, 1:3] 1 0 0 0 0 0 1 0 0 0 ...
>  str(sapply(1:3, function(x)diag(x,5,2), simplify=TRUE))
 int [1:10, 1:3] 1 0 0 0 0 0 1 0 0 0 ...
>  str(sapply(1:3, function(x)diag(x,5,2), simplify=FALSE))
List of 3
 $ : int [1:5, 1:2] 1 0 0 0 0 0 1 0 0 0
 $ : int [1:5, 1:2] 2 0 0 0 0 0 2 0 0 0
 $ : int [1:5, 1:2] 3 0 0 0 0 0 3 0 0 0


Bill Dunlap
TIBCO Software
wdunlap tibco.com<http://tibco.com>

On Tue, Mar 13, 2018 at 6:23 AM, Doran, Harold 
<hdo...@air.org<mailto:hdo...@air.org>> wrote:
While working with sapply, the documentation states that the simplify argument 
will yield a vector, matrix etc "when possible". I was curious how the code 
actually defined "as possible" and see this within the function

if (!identical(simplify, FALSE) && length(answer))

This seems superfluous to me, in particular this part:

!identical(simplify, FALSE)

The preceding code could be reduced to

if (simplify && length(answer))

and it would not need to execute the call to identical in order to trigger the 
conditional execution, which is known from the user's simplify = TRUE or FALSE 
inputs. I *think* the extra call to identical is just unnecessary overhead in 
this instance.

Take for example, the following toy example code and benchmark results and a 
small modification to sapply:

myList <- list(a = rnorm(100), b = rnorm(100))

answer <- lapply(X = myList, FUN = length)
simplify = TRUE

library(microbenchmark)

mySapply <- function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE){
FUN <- match.fun(FUN)
answer <- lapply(X = X, FUN = FUN, ...)
if (USE.NAMES && is.character(X) && is.null(names(answer)))
names(answer) <- X
if (simplify && length(answer))
simplify2array(answer, higher = (simplify == "array"))
else answer
}


> microbenchmark(sapply(myList, length), times = 1L)
Unit: microseconds
   exprmin lq mean median uqmax neval
 sapply(myList, length) 14.156 15.572 16.67603 15.926 16.634 650.46 1
> microbenchmark(mySapply(myList, length), times = 1L)
Unit: microseconds
 exprmin lq mean median uq  max neval
 mySapply(myList, length) 13.095 14.864 16.02964 15.218 15.573 1671.804 1

My benchmark timings show a timing improvement with only that small change made 
and it is seemingly nominal. In my actual work, the sapply function is called 
millions of times and this additional overhead propagates to some overall 
additional computing time.

I have done some limited testing on various real data to verify that the 
objects produced under both variants of the sapply (base R and my modified) 
yield identical objects when simply is both TRUE or FALSE.

Perhaps someone else sees a counterexample where my proposed fix does not cause 
for sapply to behave as expected.

Harold

__
R-help@r-project.org<mailto:R-help@r-project.org> mailing list -- To 
UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Possible Improvement to sapply

2018-03-13 Thread William Dunlap via R-help
Wouldn't that change how simplify='array' is handled?

>  str(sapply(1:3, function(x)diag(x,5,2), simplify="array"))
 int [1:5, 1:2, 1:3] 1 0 0 0 0 0 1 0 0 0 ...
>  str(sapply(1:3, function(x)diag(x,5,2), simplify=TRUE))
 int [1:10, 1:3] 1 0 0 0 0 0 1 0 0 0 ...
>  str(sapply(1:3, function(x)diag(x,5,2), simplify=FALSE))
List of 3
 $ : int [1:5, 1:2] 1 0 0 0 0 0 1 0 0 0
 $ : int [1:5, 1:2] 2 0 0 0 0 0 2 0 0 0
 $ : int [1:5, 1:2] 3 0 0 0 0 0 3 0 0 0


Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Tue, Mar 13, 2018 at 6:23 AM, Doran, Harold  wrote:

> While working with sapply, the documentation states that the simplify
> argument will yield a vector, matrix etc "when possible". I was curious how
> the code actually defined "as possible" and see this within the function
>
> if (!identical(simplify, FALSE) && length(answer))
>
> This seems superfluous to me, in particular this part:
>
> !identical(simplify, FALSE)
>
> The preceding code could be reduced to
>
> if (simplify && length(answer))
>
> and it would not need to execute the call to identical in order to trigger
> the conditional execution, which is known from the user's simplify = TRUE
> or FALSE inputs. I *think* the extra call to identical is just unnecessary
> overhead in this instance.
>
> Take for example, the following toy example code and benchmark results and
> a small modification to sapply:
>
> myList <- list(a = rnorm(100), b = rnorm(100))
>
> answer <- lapply(X = myList, FUN = length)
> simplify = TRUE
>
> library(microbenchmark)
>
> mySapply <- function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE){
> FUN <- match.fun(FUN)
> answer <- lapply(X = X, FUN = FUN, ...)
> if (USE.NAMES && is.character(X) && is.null(names(answer)))
> names(answer) <- X
> if (simplify && length(answer))
> simplify2array(answer, higher = (simplify == "array"))
> else answer
> }
>
>
> > microbenchmark(sapply(myList, length), times = 1L)
> Unit: microseconds
>exprmin lq mean median uqmax neval
>  sapply(myList, length) 14.156 15.572 16.67603 15.926 16.634 650.46 1
> > microbenchmark(mySapply(myList, length), times = 1L)
> Unit: microseconds
>  exprmin lq mean median uq  max
> neval
>  mySapply(myList, length) 13.095 14.864 16.02964 15.218 15.573 1671.804
> 1
>
> My benchmark timings show a timing improvement with only that small change
> made and it is seemingly nominal. In my actual work, the sapply function is
> called millions of times and this additional overhead propagates to some
> overall additional computing time.
>
> I have done some limited testing on various real data to verify that the
> objects produced under both variants of the sapply (base R and my modified)
> yield identical objects when simply is both TRUE or FALSE.
>
> Perhaps someone else sees a counterexample where my proposed fix does not
> cause for sapply to behave as expected.
>
> Harold
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Possible Improvement to sapply

2018-03-13 Thread Doran, Harold
Martin

In terms of context of the actual problem, sapply is called millions of times 
because the work involves scoring individual students who took a test. A score 
for student A is generated and then student B and such and there are millions 
of students. The psychometric process of scoring students is complex and our 
code makes use of sapply many times for each student.

The toy example used length just to illustrate, our actual code doesn't do 
that. But your point is well taken, there may be a very good counterexample why 
my proposal doesn't achieve the goal is a generalizable way.



-Original Message-
From: Martin Morgan [mailto:martin.mor...@roswellpark.org] 
Sent: Tuesday, March 13, 2018 9:43 AM
To: Doran, Harold <hdo...@air.org>; 'r-help@r-project.org' 
<r-help@r-project.org>
Subject: Re: [R] Possible Improvement to sapply



On 03/13/2018 09:23 AM, Doran, Harold wrote:
> While working with sapply, the documentation states that the simplify 
> argument will yield a vector, matrix etc "when possible". I was 
> curious how the code actually defined "as possible" and see this 
> within the function
> 
> if (!identical(simplify, FALSE) && length(answer))
> 
> This seems superfluous to me, in particular this part:
> 
> !identical(simplify, FALSE)
> 
> The preceding code could be reduced to
> 
> if (simplify && length(answer))
> 
> and it would not need to execute the call to identical in order to trigger 
> the conditional execution, which is known from the user's simplify = TRUE or 
> FALSE inputs. I *think* the extra call to identical is just unnecessary 
> overhead in this instance.
> 
> Take for example, the following toy example code and benchmark results and a 
> small modification to sapply:
> 
> myList <- list(a = rnorm(100), b = rnorm(100))
> 
> answer <- lapply(X = myList, FUN = length) simplify = TRUE
> 
> library(microbenchmark)
> 
> mySapply <- function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE){
>   FUN <- match.fun(FUN)
>  answer <- lapply(X = X, FUN = FUN, ...)
>  if (USE.NAMES && is.character(X) && is.null(names(answer)))
>  names(answer) <- X
>  if (simplify && length(answer))
>  simplify2array(answer, higher = (simplify == "array"))
>  else answer
> }
> 
> 
>> microbenchmark(sapply(myList, length), times = 1L)
> Unit: microseconds
> exprmin lq mean median uqmax neval
>   sapply(myList, length) 14.156 15.572 16.67603 15.926 16.634 650.46 
> 1
>> microbenchmark(mySapply(myList, length), times = 1L)
> Unit: microseconds
>   exprmin lq mean median uq  max neval
>   mySapply(myList, length) 13.095 14.864 16.02964 15.218 15.573 
> 1671.804 1
> 
> My benchmark timings show a timing improvement with only that small change 
> made and it is seemingly nominal. In my actual work, the sapply function is 
> called millions of times and this additional overhead propagates to some 
> overall additional computing time.
> 
> I have done some limited testing on various real data to verify that the 
> objects produced under both variants of the sapply (base R and my modified) 
> yield identical objects when simply is both TRUE or FALSE.
> 
> Perhaps someone else sees a counterexample where my proposed fix does not 
> cause for sapply to behave as expected.
> 

Check out ?sapply for possible values of `simplify=` to see why your proposal 
is not adequate.

For your example, lengths() is an order of magnitude faster than sapply(., 
length). This is a example of the advantages of vectorization (single call to 
an R function implemented in C) versus iteration (`for` loops but also the 
*apply family calling an R function many times). 
vapply() might also be relevant.

Often performance improvements come from looking one layer up from where the 
problem occurs and re-thinking the algorithm. Why would one need to call 
sapply() millions of times, in a situation where this becomes rate-limiting? 
Can the algorithm be re-implemented to avoid this step?

Martin Morgan

> Harold
> 
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


This email message may contain legally privileged and/or confidential 
information.  If you are not the intended recipient(s), or the employee or 
agent responsible for the delivery of this message to the intended 
recipient(s), you are hereby notified tha

Re: [R] Possible Improvement to sapply

2018-03-13 Thread Martin Morgan



On 03/13/2018 09:23 AM, Doran, Harold wrote:

While working with sapply, the documentation states that the simplify argument will yield a vector, 
matrix etc "when possible". I was curious how the code actually defined "as 
possible" and see this within the function

if (!identical(simplify, FALSE) && length(answer))

This seems superfluous to me, in particular this part:

!identical(simplify, FALSE)

The preceding code could be reduced to

if (simplify && length(answer))

and it would not need to execute the call to identical in order to trigger the 
conditional execution, which is known from the user's simplify = TRUE or FALSE 
inputs. I *think* the extra call to identical is just unnecessary overhead in 
this instance.

Take for example, the following toy example code and benchmark results and a 
small modification to sapply:

myList <- list(a = rnorm(100), b = rnorm(100))

answer <- lapply(X = myList, FUN = length)
simplify = TRUE

library(microbenchmark)

mySapply <- function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE){
FUN <- match.fun(FUN)
 answer <- lapply(X = X, FUN = FUN, ...)
 if (USE.NAMES && is.character(X) && is.null(names(answer)))
 names(answer) <- X
 if (simplify && length(answer))
 simplify2array(answer, higher = (simplify == "array"))
 else answer
}



microbenchmark(sapply(myList, length), times = 1L)

Unit: microseconds
exprmin lq mean median uqmax neval
  sapply(myList, length) 14.156 15.572 16.67603 15.926 16.634 650.46 1

microbenchmark(mySapply(myList, length), times = 1L)

Unit: microseconds
  exprmin lq mean median uq  max neval
  mySapply(myList, length) 13.095 14.864 16.02964 15.218 15.573 1671.804 1

My benchmark timings show a timing improvement with only that small change made 
and it is seemingly nominal. In my actual work, the sapply function is called 
millions of times and this additional overhead propagates to some overall 
additional computing time.

I have done some limited testing on various real data to verify that the 
objects produced under both variants of the sapply (base R and my modified) 
yield identical objects when simply is both TRUE or FALSE.

Perhaps someone else sees a counterexample where my proposed fix does not cause 
for sapply to behave as expected.



Check out ?sapply for possible values of `simplify=` to see why your 
proposal is not adequate.


For your example, lengths() is an order of magnitude faster than 
sapply(., length). This is a example of the advantages of vectorization 
(single call to an R function implemented in C) versus iteration (`for` 
loops but also the *apply family calling an R function many times). 
vapply() might also be relevant.


Often performance improvements come from looking one layer up from where 
the problem occurs and re-thinking the algorithm. Why would one need to 
call sapply() millions of times, in a situation where this becomes 
rate-limiting? Can the algorithm be re-implemented to avoid this step?


Martin Morgan


Harold

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




This email message may contain legally privileged and/or...{{dropped:2}}

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Possible Improvement to sapply

2018-03-13 Thread Doran, Harold
While working with sapply, the documentation states that the simplify argument 
will yield a vector, matrix etc "when possible". I was curious how the code 
actually defined "as possible" and see this within the function

if (!identical(simplify, FALSE) && length(answer))

This seems superfluous to me, in particular this part:

!identical(simplify, FALSE)

The preceding code could be reduced to 

if (simplify && length(answer))

and it would not need to execute the call to identical in order to trigger the 
conditional execution, which is known from the user's simplify = TRUE or FALSE 
inputs. I *think* the extra call to identical is just unnecessary overhead in 
this instance.

Take for example, the following toy example code and benchmark results and a 
small modification to sapply:

myList <- list(a = rnorm(100), b = rnorm(100))

answer <- lapply(X = myList, FUN = length)
simplify = TRUE

library(microbenchmark)

mySapply <- function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE){
FUN <- match.fun(FUN)
answer <- lapply(X = X, FUN = FUN, ...)
if (USE.NAMES && is.character(X) && is.null(names(answer))) 
names(answer) <- X
if (simplify && length(answer)) 
simplify2array(answer, higher = (simplify == "array"))
else answer
}


> microbenchmark(sapply(myList, length), times = 1L)
Unit: microseconds
   exprmin lq mean median uqmax neval
 sapply(myList, length) 14.156 15.572 16.67603 15.926 16.634 650.46 1
> microbenchmark(mySapply(myList, length), times = 1L)
Unit: microseconds
 exprmin lq mean median uq  max neval
 mySapply(myList, length) 13.095 14.864 16.02964 15.218 15.573 1671.804 1

My benchmark timings show a timing improvement with only that small change made 
and it is seemingly nominal. In my actual work, the sapply function is called 
millions of times and this additional overhead propagates to some overall 
additional computing time.

I have done some limited testing on various real data to verify that the 
objects produced under both variants of the sapply (base R and my modified) 
yield identical objects when simply is both TRUE or FALSE.

Perhaps someone else sees a counterexample where my proposed fix does not cause 
for sapply to behave as expected.

Harold

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.