Re: [Rd] na.omit inconsistent with is.na on list

2021-08-16 Thread Gabriel Becker
Hi Toby,

Right, my point is that is.na being equivalent to "is an incomplete case"
is really only true for atomic vectors. I don't see it being the case for
lists, given what is.na does for lists. This is all just  my opinion, but
that's my take: vec[!is.na(vec)] happens to be the same as na.omit(vec) for
atomics, but in general the operations are not equivalent and I wouldn't
expect them to be.

Best,
~G

On Mon, Aug 16, 2021 at 10:54 AM Toby Hocking  wrote:

> To clarify, ?is.na docs say that 'na.omit' returns the object with
> incomplete cases removed.
> If we take is.na to be the definition of "incomplete cases" then a list
> element with scalar NA is incomplete.
> About the data.frame method, in my opinion it is highly
> confusing/inconsistent for na.omit to keep rows with incomplete cases in
> list columns, but not in columns which are atomic vectors,
>
> > (f.num <- data.frame(num=c(1,NA,2)))
>   num
> 1   1
> 2  NA
> 3   2
> > is.na(f.num)
>num
> [1,] FALSE
> [2,]  TRUE
> [3,] FALSE
> > na.omit(f.num)
>   num
> 1   1
> 3   2
>
> > (f.list <- data.frame(list=I(list(1,NA,2
>   list
> 11
> 2   NA
> 32
> > is.na(f.list)
>   list
> [1,] FALSE
> [2,]  TRUE
> [3,] FALSE
> > na.omit(f.list)
>   list
> 11
> 2   NA
> 32
>
> On Sat, Aug 14, 2021 at 5:15 PM Gabriel Becker 
> wrote:
>
> > I understand what is.na does, the issue I have is that its task is not
> > equivalent to the conceptual task na.omit is doing, in my opinion, as
> > illustrated by what the data.frame method does.
> >
> > Thus what i was getting at above about it not being clear that lst[is.na
> (lst)]
> > being the correct thing for na.omit to do
> >
> > ~G
> >
> > ~G
> >
> > On Sat, Aug 14, 2021, 1:49 PM Toby Hocking  wrote:
> >
> >> Some relevant information from ?is.na: the behavior for lists is
> >> documented,
> >>
> >>  For is.na, elementwise the result is false unless that element
> >>  is a length-one atomic vector and the single element of that
> >>  vector is regarded as NA or NaN (note that any is.na method
> >>  for the class of the element is ignored).
> >>
> >> Also there are other functions anyNA and is.na<- which are consistent
> >> with
> >> is.na. That is, anyNA only returns TRUE if the list has an element
> which
> >> is
> >> a scalar NA. And is.na<- sets list elements to logical NA to indicate
> >> missingness.
> >>
> >> On Fri, Aug 13, 2021 at 1:10 AM Hugh Parsonage <
> hugh.parson...@gmail.com>
> >> wrote:
> >>
> >> > The data.frame method deliberately skips non-atomic columns before
> >> > invoking is.na(x) so I think it is fair to assume this behaviour is
> >> > intentional and assumed.
> >> >
> >> > Not so clear to me that there is a sensible answer for list columns.
> >> > (List columns seem to collide with the expectation that in each
> >> > variable every observation will be of the same type)
> >> >
> >> > Consider your list L as
> >> >
> >> > L <- list(NULL, NA, c(NA, NA))
> >> >
> >> > Seems like every observation could have a claim to be 'missing' here.
> >> > Concretely, if a data.frame had a list column representing the lat-lon
> >> > of an observation, we might only be able to represent missing values
> >> > like c(NA, NA).
> >> >
> >> > On Fri, 13 Aug 2021 at 17:27, Iñaki Ucar 
> >> wrote:
> >> > >
> >> > > On Thu, 12 Aug 2021 at 22:20, Gabriel Becker  >
> >> > wrote:
> >> > > >
> >> > > > Hi Toby,
> >> > > >
> >> > > > This definitely appears intentional, the first  expression of
> >> > > > stats:::na.omit.default is
> >> > > >
> >> > > >if (!is.atomic(object))
> >> > > >
> >> > > > return(object)
> >> > >
> >> > > I don't follow your point. This only means that the *default* method
> >> > > is not intended for non-atomic cases, but it doesn't mean it
> shouldn't
> >> > > exist a method for lists.
> >> > >
> >> > > > So it is explicitly just returning the object in non-atomic cases,
> >> > which
> >> > > > includes lists. I was not involved in this decision (obviously)
> but
> >> my
> >> > > > guess is that it is due to the fact that what constitutes an
> >> > observation
> >> > > > "being complete" in unclear in the list case. What should
> >> > > >
> >> > > > na.omit(list(5, NA, c(NA, 5)))
> >> > > >
> >> > > > return? Just the first element, or the first and the last? It
> >> seems, at
> >> > > > least to me, unclear. A small change to the documentation to to
> add
> >> > "atomic
> >> > >
> >> > > > is.na(list(5, NA, c(NA, 5)))
> >> > > [1] FALSE  TRUE FALSE
> >> > >
> >> > > Following Toby's argument, it's clear to me: the first and the last.
> >> > >
> >> > > Iñaki
> >> > >
> >> > > > (in the sense of is.atomic returning \code{TRUE})" in front of
> >> > "vectors"
> >> > > > or similar  where what types of objects are supported seems
> >> justified,
> >> > > > though, imho, as the current documentation is either ambiguous or
> >> > > > technically incorrect, depending on what we take "vector" to mean.
> >> > > >
> >> > > > Best,
> >> > > > ~G

Re: [Rd] na.omit inconsistent with is.na on list

2021-08-16 Thread Toby Hocking
To clarify, ?is.na docs say that 'na.omit' returns the object with
incomplete cases removed.
If we take is.na to be the definition of "incomplete cases" then a list
element with scalar NA is incomplete.
About the data.frame method, in my opinion it is highly
confusing/inconsistent for na.omit to keep rows with incomplete cases in
list columns, but not in columns which are atomic vectors,

> (f.num <- data.frame(num=c(1,NA,2)))
  num
1   1
2  NA
3   2
> is.na(f.num)
   num
[1,] FALSE
[2,]  TRUE
[3,] FALSE
> na.omit(f.num)
  num
1   1
3   2

> (f.list <- data.frame(list=I(list(1,NA,2
  list
11
2   NA
32
> is.na(f.list)
  list
[1,] FALSE
[2,]  TRUE
[3,] FALSE
> na.omit(f.list)
  list
11
2   NA
32

On Sat, Aug 14, 2021 at 5:15 PM Gabriel Becker 
wrote:

> I understand what is.na does, the issue I have is that its task is not
> equivalent to the conceptual task na.omit is doing, in my opinion, as
> illustrated by what the data.frame method does.
>
> Thus what i was getting at above about it not being clear that lst[is.na(lst)]
> being the correct thing for na.omit to do
>
> ~G
>
> ~G
>
> On Sat, Aug 14, 2021, 1:49 PM Toby Hocking  wrote:
>
>> Some relevant information from ?is.na: the behavior for lists is
>> documented,
>>
>>  For is.na, elementwise the result is false unless that element
>>  is a length-one atomic vector and the single element of that
>>  vector is regarded as NA or NaN (note that any is.na method
>>  for the class of the element is ignored).
>>
>> Also there are other functions anyNA and is.na<- which are consistent
>> with
>> is.na. That is, anyNA only returns TRUE if the list has an element which
>> is
>> a scalar NA. And is.na<- sets list elements to logical NA to indicate
>> missingness.
>>
>> On Fri, Aug 13, 2021 at 1:10 AM Hugh Parsonage 
>> wrote:
>>
>> > The data.frame method deliberately skips non-atomic columns before
>> > invoking is.na(x) so I think it is fair to assume this behaviour is
>> > intentional and assumed.
>> >
>> > Not so clear to me that there is a sensible answer for list columns.
>> > (List columns seem to collide with the expectation that in each
>> > variable every observation will be of the same type)
>> >
>> > Consider your list L as
>> >
>> > L <- list(NULL, NA, c(NA, NA))
>> >
>> > Seems like every observation could have a claim to be 'missing' here.
>> > Concretely, if a data.frame had a list column representing the lat-lon
>> > of an observation, we might only be able to represent missing values
>> > like c(NA, NA).
>> >
>> > On Fri, 13 Aug 2021 at 17:27, Iñaki Ucar 
>> wrote:
>> > >
>> > > On Thu, 12 Aug 2021 at 22:20, Gabriel Becker 
>> > wrote:
>> > > >
>> > > > Hi Toby,
>> > > >
>> > > > This definitely appears intentional, the first  expression of
>> > > > stats:::na.omit.default is
>> > > >
>> > > >if (!is.atomic(object))
>> > > >
>> > > > return(object)
>> > >
>> > > I don't follow your point. This only means that the *default* method
>> > > is not intended for non-atomic cases, but it doesn't mean it shouldn't
>> > > exist a method for lists.
>> > >
>> > > > So it is explicitly just returning the object in non-atomic cases,
>> > which
>> > > > includes lists. I was not involved in this decision (obviously) but
>> my
>> > > > guess is that it is due to the fact that what constitutes an
>> > observation
>> > > > "being complete" in unclear in the list case. What should
>> > > >
>> > > > na.omit(list(5, NA, c(NA, 5)))
>> > > >
>> > > > return? Just the first element, or the first and the last? It
>> seems, at
>> > > > least to me, unclear. A small change to the documentation to to add
>> > "atomic
>> > >
>> > > > is.na(list(5, NA, c(NA, 5)))
>> > > [1] FALSE  TRUE FALSE
>> > >
>> > > Following Toby's argument, it's clear to me: the first and the last.
>> > >
>> > > Iñaki
>> > >
>> > > > (in the sense of is.atomic returning \code{TRUE})" in front of
>> > "vectors"
>> > > > or similar  where what types of objects are supported seems
>> justified,
>> > > > though, imho, as the current documentation is either ambiguous or
>> > > > technically incorrect, depending on what we take "vector" to mean.
>> > > >
>> > > > Best,
>> > > > ~G
>> > > >
>> > > > On Wed, Aug 11, 2021 at 10:16 PM Toby Hocking 
>> > wrote:
>> > > >
>> > > > > Also, the na.omit method for data.frame with list column seems to
>> be
>> > > > > inconsistent with is.na,
>> > > > >
>> > > > > > L <- list(NULL, NA, 0)
>> > > > > > str(f <- data.frame(I(L)))
>> > > > > 'data.frame': 3 obs. of  1 variable:
>> > > > >  $ L:List of 3
>> > > > >   ..$ : NULL
>> > > > >   ..$ : logi NA
>> > > > >   ..$ : num 0
>> > > > >   ..- attr(*, "class")= chr "AsIs"
>> > > > > > is.na(f)
>> > > > >  L
>> > > > > [1,] FALSE
>> > > > > [2,]  TRUE
>> > > > > [3,] FALSE
>> > > > > > na.omit(f)
>> > > > >L
>> > > > > 1
>> > > > > 2 NA
>> > > > > 3  0
>> > > > >
>> > > > > On Wed, Aug 11, 2021 at 9:58 PM Toby Hocking 
>> > wrote:
>> > > > >
>> > > > >

Re: [Rd] svd for Large Matrix

2021-08-16 Thread Avraham Adler
If you’re crossproding X by itself, I think passing symmetric = TRUE to
eigen will eke out more speed.

Avi

On Mon, Aug 16, 2021 at 6:30 PM Radford Neal  wrote:

> > Dario Strbenac  writes:
> >
> > I have a real scenario involving 45 million biological cells
> > (samples) and 60 proteins (variables) which leads to a segmentation
> > fault for svd. I thought this might be a good example of why it
> > might benefit from a long vector upgrade.
>
> Rather than the full SVD of a 4500x60 X, my guess is that you
> may really only be interested in the eigenvalues and eigenvectors of
> X^T X, in which case eigen(t(X)%*%X) would probably be much faster.
> (And eigen(crossprod(X)) would be even faster.)
>
> Note that if you instead want the eigenvalues and eigenvectors of
> X X^T (which is an enormous matrix), the eigenvalues of this are the
> same as those of X^T X, and the eigenvectors are Xv, where v is an
> eigenvector of X^T X.
>
> For example, with R 4.0.2, and the reference BLAS/LAPACK, I get
>
>   > X<-matrix(rnorm(10),1,10)
>   > system.time(for(i in 1:1000) rs<-svd(X))
>  user  system elapsed
> 2.393   0.008   2.403
>   > system.time(for(i in 1:1000) re<-eigen(crossprod(X)))
>  user  system elapsed
> 0.609   0.000   0.609
>   > rs$d^2
>[1] 10568.003 10431.864 10318.959 10219.961 10138.025 10068.566
> 9931.538
>[8]  9813.841  9703.818  9598.532
>   > re$values
>[1] 10568.003 10431.864 10318.959 10219.961 10138.025 10068.566
> 9931.538
>[8]  9813.841  9703.818  9598.532
>
> Possibly some other LAPACK might implement svd better, though I
> suspect that R will allocate more big matrices than really necessary
> for the svd even aside from whatever LAPACK is doing.
>
> Regards,
>
>Radford Neal
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
-- 
Sent from Gmail Mobile

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] svd for Large Matrix

2021-08-16 Thread Radford Neal
> Dario Strbenac  writes:
>
> I have a real scenario involving 45 million biological cells
> (samples) and 60 proteins (variables) which leads to a segmentation
> fault for svd. I thought this might be a good example of why it
> might benefit from a long vector upgrade.

Rather than the full SVD of a 4500x60 X, my guess is that you
may really only be interested in the eigenvalues and eigenvectors of
X^T X, in which case eigen(t(X)%*%X) would probably be much faster.
(And eigen(crossprod(X)) would be even faster.)

Note that if you instead want the eigenvalues and eigenvectors of
X X^T (which is an enormous matrix), the eigenvalues of this are the
same as those of X^T X, and the eigenvectors are Xv, where v is an
eigenvector of X^T X.

For example, with R 4.0.2, and the reference BLAS/LAPACK, I get

  > X<-matrix(rnorm(10),1,10)
  > system.time(for(i in 1:1000) rs<-svd(X))
 user  system elapsed
2.393   0.008   2.403
  > system.time(for(i in 1:1000) re<-eigen(crossprod(X)))
 user  system elapsed
0.609   0.000   0.609
  > rs$d^2
   [1] 10568.003 10431.864 10318.959 10219.961 10138.025 10068.566  9931.538
   [8]  9813.841  9703.818  9598.532
  > re$values
   [1] 10568.003 10431.864 10318.959 10219.961 10138.025 10068.566  9931.538
   [8]  9813.841  9703.818  9598.532

Possibly some other LAPACK might implement svd better, though I
suspect that R will allocate more big matrices than really necessary
for the svd even aside from whatever LAPACK is doing.

Regards,

   Radford Neal

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Force quitting a FORK cluster node on macOS and Solaris wreaks havoc

2021-08-16 Thread Henrik Bengtsson
Thank you Simon, this is helpful.  I take this is specific to quit(),
so it's a poor choice for emulating crashed parallel workers, and
Sys.kill() is much better for that.

I was focusing on that odd extra execution/output, but as you say,
there are lots of other things that is done by quit() here, e.g.
regardless of platform quit() damages the main R process too:

> f <- parallel::mcparallel(quit("no"))
> v <- parallel::mccollect(f)
Warning message:
In parallel::mccollect(f) : 1 parallel job did not deliver a result
> file.exists(tempdir())
[1] FALSE


Would it be sufficient to make quit() fork safe by, conceptually,
doing something like:

quit <- function(save = "default", status = 0, runLast = TRUE) {
  if (parallel:::isChild())
  stop("quit() must not be called in a forked process")
  .Internal(quit(save, status, runLast))
}

This would protect against calling quit() in forked code by mistake,
e.g. when someone parallelize over code/scripts they don't have full
control over and the ones who write those scripts might not be aware
that they may be used in forks.

Thanks,

Henrik

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel