Re: [Rd] na.omit inconsistent with is.na on list
Hi Toby, Right, my point is that is.na being equivalent to "is an incomplete case" is really only true for atomic vectors. I don't see it being the case for lists, given what is.na does for lists. This is all just my opinion, but that's my take: vec[!is.na(vec)] happens to be the same as na.omit(vec) for atomics, but in general the operations are not equivalent and I wouldn't expect them to be. Best, ~G On Mon, Aug 16, 2021 at 10:54 AM Toby Hocking wrote: > To clarify, ?is.na docs say that 'na.omit' returns the object with > incomplete cases removed. > If we take is.na to be the definition of "incomplete cases" then a list > element with scalar NA is incomplete. > About the data.frame method, in my opinion it is highly > confusing/inconsistent for na.omit to keep rows with incomplete cases in > list columns, but not in columns which are atomic vectors, > > > (f.num <- data.frame(num=c(1,NA,2))) > num > 1 1 > 2 NA > 3 2 > > is.na(f.num) >num > [1,] FALSE > [2,] TRUE > [3,] FALSE > > na.omit(f.num) > num > 1 1 > 3 2 > > > (f.list <- data.frame(list=I(list(1,NA,2 > list > 11 > 2 NA > 32 > > is.na(f.list) > list > [1,] FALSE > [2,] TRUE > [3,] FALSE > > na.omit(f.list) > list > 11 > 2 NA > 32 > > On Sat, Aug 14, 2021 at 5:15 PM Gabriel Becker > wrote: > > > I understand what is.na does, the issue I have is that its task is not > > equivalent to the conceptual task na.omit is doing, in my opinion, as > > illustrated by what the data.frame method does. > > > > Thus what i was getting at above about it not being clear that lst[is.na > (lst)] > > being the correct thing for na.omit to do > > > > ~G > > > > ~G > > > > On Sat, Aug 14, 2021, 1:49 PM Toby Hocking wrote: > > > >> Some relevant information from ?is.na: the behavior for lists is > >> documented, > >> > >> For is.na, elementwise the result is false unless that element > >> is a length-one atomic vector and the single element of that > >> vector is regarded as NA or NaN (note that any is.na method > >> for the class of the element is ignored). > >> > >> Also there are other functions anyNA and is.na<- which are consistent > >> with > >> is.na. That is, anyNA only returns TRUE if the list has an element > which > >> is > >> a scalar NA. And is.na<- sets list elements to logical NA to indicate > >> missingness. > >> > >> On Fri, Aug 13, 2021 at 1:10 AM Hugh Parsonage < > hugh.parson...@gmail.com> > >> wrote: > >> > >> > The data.frame method deliberately skips non-atomic columns before > >> > invoking is.na(x) so I think it is fair to assume this behaviour is > >> > intentional and assumed. > >> > > >> > Not so clear to me that there is a sensible answer for list columns. > >> > (List columns seem to collide with the expectation that in each > >> > variable every observation will be of the same type) > >> > > >> > Consider your list L as > >> > > >> > L <- list(NULL, NA, c(NA, NA)) > >> > > >> > Seems like every observation could have a claim to be 'missing' here. > >> > Concretely, if a data.frame had a list column representing the lat-lon > >> > of an observation, we might only be able to represent missing values > >> > like c(NA, NA). > >> > > >> > On Fri, 13 Aug 2021 at 17:27, Iñaki Ucar > >> wrote: > >> > > > >> > > On Thu, 12 Aug 2021 at 22:20, Gabriel Becker > > >> > wrote: > >> > > > > >> > > > Hi Toby, > >> > > > > >> > > > This definitely appears intentional, the first expression of > >> > > > stats:::na.omit.default is > >> > > > > >> > > >if (!is.atomic(object)) > >> > > > > >> > > > return(object) > >> > > > >> > > I don't follow your point. This only means that the *default* method > >> > > is not intended for non-atomic cases, but it doesn't mean it > shouldn't > >> > > exist a method for lists. > >> > > > >> > > > So it is explicitly just returning the object in non-atomic cases, > >> > which > >> > > > includes lists. I was not involved in this decision (obviously) > but > >> my > >> > > > guess is that it is due to the fact that what constitutes an > >> > observation > >> > > > "being complete" in unclear in the list case. What should > >> > > > > >> > > > na.omit(list(5, NA, c(NA, 5))) > >> > > > > >> > > > return? Just the first element, or the first and the last? It > >> seems, at > >> > > > least to me, unclear. A small change to the documentation to to > add > >> > "atomic > >> > > > >> > > > is.na(list(5, NA, c(NA, 5))) > >> > > [1] FALSE TRUE FALSE > >> > > > >> > > Following Toby's argument, it's clear to me: the first and the last. > >> > > > >> > > Iñaki > >> > > > >> > > > (in the sense of is.atomic returning \code{TRUE})" in front of > >> > "vectors" > >> > > > or similar where what types of objects are supported seems > >> justified, > >> > > > though, imho, as the current documentation is either ambiguous or > >> > > > technically incorrect, depending on what we take "vector" to mean. > >> > > > > >> > > > Best, > >> > > > ~G
Re: [Rd] na.omit inconsistent with is.na on list
To clarify, ?is.na docs say that 'na.omit' returns the object with incomplete cases removed. If we take is.na to be the definition of "incomplete cases" then a list element with scalar NA is incomplete. About the data.frame method, in my opinion it is highly confusing/inconsistent for na.omit to keep rows with incomplete cases in list columns, but not in columns which are atomic vectors, > (f.num <- data.frame(num=c(1,NA,2))) num 1 1 2 NA 3 2 > is.na(f.num) num [1,] FALSE [2,] TRUE [3,] FALSE > na.omit(f.num) num 1 1 3 2 > (f.list <- data.frame(list=I(list(1,NA,2 list 11 2 NA 32 > is.na(f.list) list [1,] FALSE [2,] TRUE [3,] FALSE > na.omit(f.list) list 11 2 NA 32 On Sat, Aug 14, 2021 at 5:15 PM Gabriel Becker wrote: > I understand what is.na does, the issue I have is that its task is not > equivalent to the conceptual task na.omit is doing, in my opinion, as > illustrated by what the data.frame method does. > > Thus what i was getting at above about it not being clear that lst[is.na(lst)] > being the correct thing for na.omit to do > > ~G > > ~G > > On Sat, Aug 14, 2021, 1:49 PM Toby Hocking wrote: > >> Some relevant information from ?is.na: the behavior for lists is >> documented, >> >> For is.na, elementwise the result is false unless that element >> is a length-one atomic vector and the single element of that >> vector is regarded as NA or NaN (note that any is.na method >> for the class of the element is ignored). >> >> Also there are other functions anyNA and is.na<- which are consistent >> with >> is.na. That is, anyNA only returns TRUE if the list has an element which >> is >> a scalar NA. And is.na<- sets list elements to logical NA to indicate >> missingness. >> >> On Fri, Aug 13, 2021 at 1:10 AM Hugh Parsonage >> wrote: >> >> > The data.frame method deliberately skips non-atomic columns before >> > invoking is.na(x) so I think it is fair to assume this behaviour is >> > intentional and assumed. >> > >> > Not so clear to me that there is a sensible answer for list columns. >> > (List columns seem to collide with the expectation that in each >> > variable every observation will be of the same type) >> > >> > Consider your list L as >> > >> > L <- list(NULL, NA, c(NA, NA)) >> > >> > Seems like every observation could have a claim to be 'missing' here. >> > Concretely, if a data.frame had a list column representing the lat-lon >> > of an observation, we might only be able to represent missing values >> > like c(NA, NA). >> > >> > On Fri, 13 Aug 2021 at 17:27, Iñaki Ucar >> wrote: >> > > >> > > On Thu, 12 Aug 2021 at 22:20, Gabriel Becker >> > wrote: >> > > > >> > > > Hi Toby, >> > > > >> > > > This definitely appears intentional, the first expression of >> > > > stats:::na.omit.default is >> > > > >> > > >if (!is.atomic(object)) >> > > > >> > > > return(object) >> > > >> > > I don't follow your point. This only means that the *default* method >> > > is not intended for non-atomic cases, but it doesn't mean it shouldn't >> > > exist a method for lists. >> > > >> > > > So it is explicitly just returning the object in non-atomic cases, >> > which >> > > > includes lists. I was not involved in this decision (obviously) but >> my >> > > > guess is that it is due to the fact that what constitutes an >> > observation >> > > > "being complete" in unclear in the list case. What should >> > > > >> > > > na.omit(list(5, NA, c(NA, 5))) >> > > > >> > > > return? Just the first element, or the first and the last? It >> seems, at >> > > > least to me, unclear. A small change to the documentation to to add >> > "atomic >> > > >> > > > is.na(list(5, NA, c(NA, 5))) >> > > [1] FALSE TRUE FALSE >> > > >> > > Following Toby's argument, it's clear to me: the first and the last. >> > > >> > > Iñaki >> > > >> > > > (in the sense of is.atomic returning \code{TRUE})" in front of >> > "vectors" >> > > > or similar where what types of objects are supported seems >> justified, >> > > > though, imho, as the current documentation is either ambiguous or >> > > > technically incorrect, depending on what we take "vector" to mean. >> > > > >> > > > Best, >> > > > ~G >> > > > >> > > > On Wed, Aug 11, 2021 at 10:16 PM Toby Hocking >> > wrote: >> > > > >> > > > > Also, the na.omit method for data.frame with list column seems to >> be >> > > > > inconsistent with is.na, >> > > > > >> > > > > > L <- list(NULL, NA, 0) >> > > > > > str(f <- data.frame(I(L))) >> > > > > 'data.frame': 3 obs. of 1 variable: >> > > > > $ L:List of 3 >> > > > > ..$ : NULL >> > > > > ..$ : logi NA >> > > > > ..$ : num 0 >> > > > > ..- attr(*, "class")= chr "AsIs" >> > > > > > is.na(f) >> > > > > L >> > > > > [1,] FALSE >> > > > > [2,] TRUE >> > > > > [3,] FALSE >> > > > > > na.omit(f) >> > > > >L >> > > > > 1 >> > > > > 2 NA >> > > > > 3 0 >> > > > > >> > > > > On Wed, Aug 11, 2021 at 9:58 PM Toby Hocking >> > wrote: >> > > > > >> > > > >
Re: [Rd] svd for Large Matrix
If you’re crossproding X by itself, I think passing symmetric = TRUE to eigen will eke out more speed. Avi On Mon, Aug 16, 2021 at 6:30 PM Radford Neal wrote: > > Dario Strbenac writes: > > > > I have a real scenario involving 45 million biological cells > > (samples) and 60 proteins (variables) which leads to a segmentation > > fault for svd. I thought this might be a good example of why it > > might benefit from a long vector upgrade. > > Rather than the full SVD of a 4500x60 X, my guess is that you > may really only be interested in the eigenvalues and eigenvectors of > X^T X, in which case eigen(t(X)%*%X) would probably be much faster. > (And eigen(crossprod(X)) would be even faster.) > > Note that if you instead want the eigenvalues and eigenvectors of > X X^T (which is an enormous matrix), the eigenvalues of this are the > same as those of X^T X, and the eigenvectors are Xv, where v is an > eigenvector of X^T X. > > For example, with R 4.0.2, and the reference BLAS/LAPACK, I get > > > X<-matrix(rnorm(10),1,10) > > system.time(for(i in 1:1000) rs<-svd(X)) > user system elapsed > 2.393 0.008 2.403 > > system.time(for(i in 1:1000) re<-eigen(crossprod(X))) > user system elapsed > 0.609 0.000 0.609 > > rs$d^2 >[1] 10568.003 10431.864 10318.959 10219.961 10138.025 10068.566 > 9931.538 >[8] 9813.841 9703.818 9598.532 > > re$values >[1] 10568.003 10431.864 10318.959 10219.961 10138.025 10068.566 > 9931.538 >[8] 9813.841 9703.818 9598.532 > > Possibly some other LAPACK might implement svd better, though I > suspect that R will allocate more big matrices than really necessary > for the svd even aside from whatever LAPACK is doing. > > Regards, > >Radford Neal > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > -- Sent from Gmail Mobile [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] svd for Large Matrix
> Dario Strbenac writes: > > I have a real scenario involving 45 million biological cells > (samples) and 60 proteins (variables) which leads to a segmentation > fault for svd. I thought this might be a good example of why it > might benefit from a long vector upgrade. Rather than the full SVD of a 4500x60 X, my guess is that you may really only be interested in the eigenvalues and eigenvectors of X^T X, in which case eigen(t(X)%*%X) would probably be much faster. (And eigen(crossprod(X)) would be even faster.) Note that if you instead want the eigenvalues and eigenvectors of X X^T (which is an enormous matrix), the eigenvalues of this are the same as those of X^T X, and the eigenvectors are Xv, where v is an eigenvector of X^T X. For example, with R 4.0.2, and the reference BLAS/LAPACK, I get > X<-matrix(rnorm(10),1,10) > system.time(for(i in 1:1000) rs<-svd(X)) user system elapsed 2.393 0.008 2.403 > system.time(for(i in 1:1000) re<-eigen(crossprod(X))) user system elapsed 0.609 0.000 0.609 > rs$d^2 [1] 10568.003 10431.864 10318.959 10219.961 10138.025 10068.566 9931.538 [8] 9813.841 9703.818 9598.532 > re$values [1] 10568.003 10431.864 10318.959 10219.961 10138.025 10068.566 9931.538 [8] 9813.841 9703.818 9598.532 Possibly some other LAPACK might implement svd better, though I suspect that R will allocate more big matrices than really necessary for the svd even aside from whatever LAPACK is doing. Regards, Radford Neal __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Force quitting a FORK cluster node on macOS and Solaris wreaks havoc
Thank you Simon, this is helpful. I take this is specific to quit(), so it's a poor choice for emulating crashed parallel workers, and Sys.kill() is much better for that. I was focusing on that odd extra execution/output, but as you say, there are lots of other things that is done by quit() here, e.g. regardless of platform quit() damages the main R process too: > f <- parallel::mcparallel(quit("no")) > v <- parallel::mccollect(f) Warning message: In parallel::mccollect(f) : 1 parallel job did not deliver a result > file.exists(tempdir()) [1] FALSE Would it be sufficient to make quit() fork safe by, conceptually, doing something like: quit <- function(save = "default", status = 0, runLast = TRUE) { if (parallel:::isChild()) stop("quit() must not be called in a forked process") .Internal(quit(save, status, runLast)) } This would protect against calling quit() in forked code by mistake, e.g. when someone parallelize over code/scripts they don't have full control over and the ones who write those scripts might not be aware that they may be used in forks. Thanks, Henrik __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel