Re: [R] replacing all NA's in a dataframe with zeros...
On Wed, 2007-03-14 at 20:16 -0700, Steven McKinney wrote: Since you can index a matrix or dataframe with a matrix of logicals, you can use is.na() to index all the NA locations and replace them all with 0 in one command. A quicker solution, that, IIRC, was posted to the list by Peter Dalgaard several years ago is: sapply(mydata.df, function(x) {x[is.na(x)] - 0; x})) Some timings on a larger problem with 100 columns: mydata.df - as.data.frame(matrix(sample(c(as.numeric(NA), 1), size = 1000*100, replace = TRUE), nrow = 1000)) system.time(retval - sapply(mydata.df, function(x) {x[is.na(x)] - 0; x})) [1] 0.108 0.008 0.120 0.000 0.000 system.time(mydata.df[is.na(mydata.df)] - 0) [1] 2.460 0.028 2.498 0.000 0.000 And a larger problem still, 1000 columns mydata.df - as.data.frame(matrix(sample(c(as.numeric(NA), 1), size = 1000*1000, replace = TRUE), nrow = 1000)) system.time(retval - sapply(mydata.df, function(x) {x[is.na(x)] - 0; x})) [1] 0.908 0.068 2.657 0.000 0.000 system.time(mydata.df[is.na(mydata.df)] - 0) [1] 43.127 0.332 46.440 0.000 0.000 Profiling mydata.df[is.na(mydata.df)] - 0 shows that it spends most of this time subsetting the the individual cells of the data frame in turn and setting the NA ones to 0. HTH G mydata.df - as.data.frame(matrix(sample(c(as.numeric(NA), 1), size = 30, replace = TRUE), nrow = 6)) mydata.df V1 V2 V3 V4 V5 1 1 NA 1 1 1 2 1 NA NA NA 1 3 NA NA 1 NA NA 4 NA NA NA NA 1 5 NA 1 NA NA 1 6 1 NA NA 1 1 is.na(mydata.df) V1V2V3V4V5 1 FALSE TRUE FALSE FALSE FALSE 2 FALSE TRUE TRUE TRUE FALSE 3 TRUE TRUE FALSE TRUE TRUE 4 TRUE TRUE TRUE TRUE FALSE 5 TRUE FALSE TRUE TRUE FALSE 6 FALSE TRUE TRUE FALSE FALSE mydata.df[is.na(mydata.df)] - 0 mydata.df V1 V2 V3 V4 V5 1 1 0 1 1 1 2 1 0 0 0 1 3 0 0 1 0 0 4 0 0 0 0 1 5 0 1 0 0 1 6 1 0 0 1 1 Steven McKinney Statistician Molecular Oncology and Breast Cancer Program British Columbia Cancer Research Centre email: [EMAIL PROTECTED] tel: 604-675-8000 x7561 BCCRC Molecular Oncology 675 West 10th Ave, Floor 4 Vancouver B.C. V5Z 1L3 Canada -Original Message- From: [EMAIL PROTECTED] on behalf of David L. Van Brunt, Ph.D. Sent: Wed 3/14/2007 5:22 PM To: R-Help List Subject: [R] replacing all NA's in a dataframe with zeros... I've seen how to replace the NA's in a single column with a data frame * mydata$ncigs[is.na(mydata$ncigs)]-0 *But this is just one column... I have thousands of columns (!) that I need to do this, and I can't figure out a way, outside of the dreaded loop, do replace all NA's in an entire data frame (all vars) without naming each var separately. Yikes. I'm racking my brain on this, seems like I must be staring at the obvious, but it eludes me. Searches have come up CLOSE, but not quite what I need.. Any pointers? -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Gavin Simpson [t] +44 (0)20 7679 0522 ECRC [f] +44 (0)20 7679 0565 UCL Department of Geography Pearson Building [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street London, UK[w] http://www.ucl.ac.uk/~ucfagls/ WC1E 6BT [w] http://www.freshwaters.org.uk/ %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] replacing all NA's in a dataframe with zeros...
Gavin Simpson wrote: On Wed, 2007-03-14 at 20:16 -0700, Steven McKinney wrote: Since you can index a matrix or dataframe with a matrix of logicals, you can use is.na() to index all the NA locations and replace them all with 0 in one command. A quicker solution, that, IIRC, was posted to the list by Peter Dalgaard several years ago is: sapply(mydata.df, function(x) {x[is.na(x)] - 0; x})) I hope your memory fails you, because it doesn't actually work. sapply(test.df, function(x) {x[is.na(x)] - 0; x}) x1 x2 x3 [1,] 0 1 1 [2,] 2 2 0 [3,] 3 3 0 [4,] 0 4 4 is a matrix, not a data frame. Instead: test.df[] - lapply(test.df, function(x) {x[is.na(x)] - 0; x}) test.df x1 x2 x3 1 0 1 1 2 2 2 0 3 3 3 0 4 0 4 4 Speedwise, sapply() is doing lapply() internally, and the assignment overhead should be small, so I'd expect similar timings. -- O__ Peter Dalgaard Ă˜ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] replacing all NA's in a dataframe with zeros...
On Thu, Mar 15, 2007 at 10:21:22AM +0100, Peter Dalgaard wrote: Gavin Simpson wrote: On Wed, 2007-03-14 at 20:16 -0700, Steven McKinney wrote: Since you can index a matrix or dataframe with a matrix of logicals, you can use is.na() to index all the NA locations and replace them all with 0 in one command. A quicker solution, that, IIRC, was posted to the list by Peter Dalgaard several years ago is: sapply(mydata.df, function(x) {x[is.na(x)] - 0; x})) I hope your memory fails you, because it doesn't actually work. sapply(test.df, function(x) {x[is.na(x)] - 0; x}) x1 x2 x3 [1,] 0 1 1 [2,] 2 2 0 [3,] 3 3 0 [4,] 0 4 4 is a matrix, not a data frame. Instead: test.df[] - lapply(test.df, function(x) {x[is.na(x)] - 0; x}) test.df x1 x2 x3 1 0 1 1 2 2 2 0 3 3 3 0 4 0 4 4 Speedwise, sapply() is doing lapply() internally, and the assignment overhead should be small, so I'd expect similar timings. just an idea: given the order of magnitude difference (factor 17 or so) in runtime between the obvious solution and the fast one: would'nt it be possible/sensible to modify the corresponding subsetting method ([.data.frame) such that it recognizes the case when it is called with an arbitrary index matrix (the problem is not restricted to indexing with a logical matrix, I presume?) and switch internally to the fast solution given above? in my (admittedly limited) experience it seems that one of the not so nice properties of R is that one encounters in quite a few situations exactly the above situation: unexpected massive differences in run time between different solutions (I'm not talking about explicit loop penalty). what concerns me most, are the very basic scenarios (not complex algorithms): data frames vs. matrices, naming vector components or not, subsetting, read.table vs. scan, etc. if their were a concise HOW TO list for the cases when speed matters, that would be helpful, too. I understand that part of the uneven performance is unavoidable and one must expect the user to go to the trouble to understand the reasons, e.g. for differences between handling purely numerical data in either matrices or data frames. but a factor of 17 between the obvious approach and the wise one seems a trap in which 99% of the people will step (probably never thinking that their might be a faster approach). joerg __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] replacing all NA's in a dataframe with zeros...
On Thu, 2007-03-15 at 10:21 +0100, Peter Dalgaard wrote: Gavin Simpson wrote: On Wed, 2007-03-14 at 20:16 -0700, Steven McKinney wrote: Since you can index a matrix or dataframe with a matrix of logicals, you can use is.na() to index all the NA locations and replace them all with 0 in one command. A quicker solution, that, IIRC, was posted to the list by Peter Dalgaard several years ago is: sapply(mydata.df, function(x) {x[is.na(x)] - 0; x})) I hope your memory fails you, because it doesn't actually work. Ah, yes, apologies Peter. I have the sapply version embedded in a package function that I happened to be working on (where I wanted the result to be a matrix) and pasted directly from there and not my crib sheet of useful R-help snippets where I do have it as lapply(...). I'd forgotten I'd changed Peter's suggestion slightly in my function. That'll teach me to reply before my morning cup of Earl Grey. All the best, G sapply(test.df, function(x) {x[is.na(x)] - 0; x}) x1 x2 x3 [1,] 0 1 1 [2,] 2 2 0 [3,] 3 3 0 [4,] 0 4 4 is a matrix, not a data frame. Instead: test.df[] - lapply(test.df, function(x) {x[is.na(x)] - 0; x}) test.df x1 x2 x3 1 0 1 1 2 2 2 0 3 3 3 0 4 0 4 4 Speedwise, sapply() is doing lapply() internally, and the assignment overhead should be small, so I'd expect similar timings. -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Gavin Simpson [t] +44 (0)20 7679 0522 ECRC, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] replacing all NA's in a dataframe with zeros...
Thanks, one and all... I knew it had to be simple. On 3/14/07, Jason Barnhart [EMAIL PROTECTED] wrote: This should work. test.df - data.frame(x1=c(NA,2,3,NA), x2=c(1,2,3,4), x3=c(1,NA,NA,4)) test.df x1 x2 x3 1 NA 1 1 2 2 2 NA 3 3 3 NA 4 NA 4 4 test.df[is.na(test.df)] - 1000 test.df x1 x2 x3 1 1000 11 22 2 1000 33 3 1000 4 1000 44 The following search string cran r replace data.frame NA in Google (as US user) yielded some good results (5th and 7th entry), but there was another example that explicitly yielded this technique. I can't seem to recall my exact search string. - Original Message - From: David L. Van Brunt, Ph.D. [EMAIL PROTECTED] To: R-Help List r-help@stat.math.ethz.ch Sent: Wednesday, March 14, 2007 5:22 PM Subject: [R] replacing all NA's in a dataframe with zeros... I've seen how to replace the NA's in a single column with a data frame * mydata$ncigs[is.na(mydata$ncigs)]-0 *But this is just one column... I have thousands of columns (!) that I need to do this, and I can't figure out a way, outside of the dreaded loop, do replace all NA's in an entire data frame (all vars) without naming each var separately. Yikes. I'm racking my brain on this, seems like I must be staring at the obvious, but it eludes me. Searches have come up CLOSE, but not quite what I need.. Any pointers? -- --- David L. Van Brunt, Ph.D. mailto:[EMAIL PROTECTED] If Tyranny and Oppression come to this land, it will be in the guise of fighting a foreign enemy. --James Madison [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- --- David L. Van Brunt, Ph.D. mailto:[EMAIL PROTECTED] If Tyranny and Oppression come to this land, it will be in the guise of fighting a foreign enemy. --James Madison [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] replacing all NA's in a dataframe with zeros...
This should work. test.df - data.frame(x1=c(NA,2,3,NA), x2=c(1,2,3,4), x3=c(1,NA,NA,4)) test.df x1 x2 x3 1 NA 1 1 2 2 2 NA 3 3 3 NA 4 NA 4 4 test.df[is.na(test.df)] - 1000 test.df x1 x2 x3 1 1000 11 22 2 1000 33 3 1000 4 1000 44 The following search string cran r replace data.frame NA in Google (as US user) yielded some good results (5th and 7th entry), but there was another example that explicitly yielded this technique. I can't seem to recall my exact search string. - Original Message - From: David L. Van Brunt, Ph.D. [EMAIL PROTECTED] To: R-Help List r-help@stat.math.ethz.ch Sent: Wednesday, March 14, 2007 5:22 PM Subject: [R] replacing all NA's in a dataframe with zeros... I've seen how to replace the NA's in a single column with a data frame * mydata$ncigs[is.na(mydata$ncigs)]-0 *But this is just one column... I have thousands of columns (!) that I need to do this, and I can't figure out a way, outside of the dreaded loop, do replace all NA's in an entire data frame (all vars) without naming each var separately. Yikes. I'm racking my brain on this, seems like I must be staring at the obvious, but it eludes me. Searches have come up CLOSE, but not quite what I need.. Any pointers? -- --- David L. Van Brunt, Ph.D. mailto:[EMAIL PROTECTED] If Tyranny and Oppression come to this land, it will be in the guise of fighting a foreign enemy. --James Madison [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] replacing all NA's in a dataframe with zeros...
Since you can index a matrix or dataframe with a matrix of logicals, you can use is.na() to index all the NA locations and replace them all with 0 in one command. mydata.df - as.data.frame(matrix(sample(c(as.numeric(NA), 1), size = 30, replace = TRUE), nrow = 6)) mydata.df V1 V2 V3 V4 V5 1 1 NA 1 1 1 2 1 NA NA NA 1 3 NA NA 1 NA NA 4 NA NA NA NA 1 5 NA 1 NA NA 1 6 1 NA NA 1 1 is.na(mydata.df) V1V2V3V4V5 1 FALSE TRUE FALSE FALSE FALSE 2 FALSE TRUE TRUE TRUE FALSE 3 TRUE TRUE FALSE TRUE TRUE 4 TRUE TRUE TRUE TRUE FALSE 5 TRUE FALSE TRUE TRUE FALSE 6 FALSE TRUE TRUE FALSE FALSE mydata.df[is.na(mydata.df)] - 0 mydata.df V1 V2 V3 V4 V5 1 1 0 1 1 1 2 1 0 0 0 1 3 0 0 1 0 0 4 0 0 0 0 1 5 0 1 0 0 1 6 1 0 0 1 1 Steven McKinney Statistician Molecular Oncology and Breast Cancer Program British Columbia Cancer Research Centre email: [EMAIL PROTECTED] tel: 604-675-8000 x7561 BCCRC Molecular Oncology 675 West 10th Ave, Floor 4 Vancouver B.C. V5Z 1L3 Canada -Original Message- From: [EMAIL PROTECTED] on behalf of David L. Van Brunt, Ph.D. Sent: Wed 3/14/2007 5:22 PM To: R-Help List Subject: [R] replacing all NA's in a dataframe with zeros... I've seen how to replace the NA's in a single column with a data frame * mydata$ncigs[is.na(mydata$ncigs)]-0 *But this is just one column... I have thousands of columns (!) that I need to do this, and I can't figure out a way, outside of the dreaded loop, do replace all NA's in an entire data frame (all vars) without naming each var separately. Yikes. I'm racking my brain on this, seems like I must be staring at the obvious, but it eludes me. Searches have come up CLOSE, but not quite what I need.. Any pointers? -- --- David L. Van Brunt, Ph.D. mailto:[EMAIL PROTECTED] If Tyranny and Oppression come to this land, it will be in the guise of fighting a foreign enemy. --James Madison [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.