[R] persuade tabulate function to count NAs in a data frame
Hi, I'd like to ask you a question again. It is basically about data frames, NAs and tabulate function. I have this data frame. I already used this in one of the previous questions of mine. It intentionally looks this simple, my real 'df' dataframe is much bigger actually and again, I am not willing to annoy anyone with huge databases... So, my database: id -c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3) a -c(3,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3) b -c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2) c -c(1,3,2,3,2,1,2,3,3,2,2,3,1,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2) d -c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2) e -c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,2,1,4) df -data.frame(id,a,b,c,d,e) df I have managed to calculate the distributions of the numbers occurring in columns 'b' to 'e' but considering the fact at the very same time that these distributions should be 'groupped by' the id numbers in column 'id'. It works fine, check it - matrix(matrix(unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,2] [[1]])),ncol=3,nrow=3,byrow=TRUE) matrix(matrix(unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,3] [[2]])),ncol=3,nrow=3,byrow=TRUE) matrix(matrix(unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,4] [[3]])),ncol=3,nrow=3,byrow=TRUE) matrix(matrix(unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,5] [[4]])),ncol=3,nrow=3,byrow=TRUE) matrix(matrix(unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,6] [[5]])),ncol=4,nrow=3,byrow=TRUE) Now my problem is: what if my data frame contains NA values here and there and what if I want my in-built tabulate function to collect these NAs as well? So what if I want it to count how many occurrences I have from these NAs? Here's my modified data frame with the NAs: id -c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3) a -c(NA,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3) b -c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2) c -c(1,3,2,3,2,1,2,3,3,2,2,3,NA,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2) d -c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2) e -c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,NA,1,4) df -data.frame(id,a,b,c,d,e) df At first I tried something like this (you see, the only thing I did was that I tried to apply this exclude=NULL thing). unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,2],exclude=NULL [[1]]) At least my code realizes the fact that I have 4 different levels in column 'a' (1,2,3,NA) and not only three (1,2,3). Check it here: nlevels(factor(df[,2],exclude=NULL)) But you see in the result that somehow it could not calculate the NAs. It says 3 0 6 0(!) 4 3 3 0 4 1 5 0 Instead of the correct: 3 0 6 1(!) 4 3 3 0 4 1 5 0 Or in case of: unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,4],exclude=NULL [[3]]) It says 2 4 4 0 2 3 4 0(!) 1 5 4 0 Instead of the correct 2 4 4 0 2 3 4 1(!) 1 5 4 0 etc. Does someone have any ideas how to persuade the function tabulate to count NAs? Is it possible at all? Thanks very much and have a pleasant weekend, Laszlo Ez az e-mail és az összes hozzá tartozó csatolt melléklet titkos és/vagy jogilag, szakmailag vagy más módon védett információt tartalmazhat. Amennyiben nem Ãn a levél cÃmzettje akkor a levél tartalmának közlése, reprodukálása, másolása, vagy egyéb más úton történÅ terjesztése, felhasználása szigorúan tilos. Amennyiben tévedésbÅl kapta meg ezt az üzenetet kérjük azonnal értesÃtse az üzenet küldÅjét. Az Erste Bank Hungary Zrt. (EBH) nem vállal felelÅsséget az információ teljes és pontos - cÃmzett(ek)hez történÅ - eljuttatásáért, valamint semmilyen késésért, kapcsolat megszakadásból eredÅ hibáért, vagy az információ felhasználásából vagy annak megbÃzhatatlanságából eredÅ kárért. Az üzenetek EBH-n kÃvüli küldÅje vagy cÃmzettje tudomásul veszi és hozzájárul, hogy az üzenetekhez más banki alkalmazott is hozzáférhet az EBH folytonos munkamenetének biztosÃtása érdekében. This e-mail and any attached files are confidential and/...{{dropped:19}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] persuade tabulate function to count NAs in a data frame
On Sat, 2011-03-19 at 15:58 +0100, Bodnar Laszlo EB_HU wrote: Hi, I'll top-post as the original Q is very lengthy: tabs -lapply(df[,2:6], function(x, id){ t(table(addNA(x), id, useNA = ifany)) }, df$id) is one way of doing what you want. More details are here: http://stackoverflow.com/questions/5362702/persuading-tabulate-function-to-count-nas-in-a-data-frame-in-r where you also posted your Q. HTH G I'd like to ask you a question again. It is basically about data frames, NAs and tabulate function. I have this data frame. I already used this in one of the previous questions of mine. It intentionally looks this simple, my real 'df' dataframe is much bigger actually and again, I am not willing to annoy anyone with huge databases... So, my database: id -c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3) a -c(3,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3) b -c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2) c -c(1,3,2,3,2,1,2,3,3,2,2,3,1,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2) d -c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2) e -c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,2,1,4) df -data.frame(id,a,b,c,d,e) df I have managed to calculate the distributions of the numbers occurring in columns 'b' to 'e' but considering the fact at the very same time that these distributions should be 'groupped by' the id numbers in column 'id'. It works fine, check it - matrix(matrix(unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,2] [[1]])),ncol=3,nrow=3,byrow=TRUE) matrix(matrix(unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,3] [[2]])),ncol=3,nrow=3,byrow=TRUE) matrix(matrix(unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,4] [[3]])),ncol=3,nrow=3,byrow=TRUE) matrix(matrix(unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,5] [[4]])),ncol=3,nrow=3,byrow=TRUE) matrix(matrix(unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,6] [[5]])),ncol=4,nrow=3,byrow=TRUE) Now my problem is: what if my data frame contains NA values here and there and what if I want my in-built tabulate function to collect these NAs as well? So what if I want it to count how many occurrences I have from these NAs? Here's my modified data frame with the NAs: id -c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3) a -c(NA,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3) b -c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2) c -c(1,3,2,3,2,1,2,3,3,2,2,3,NA,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2) d -c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2) e -c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,NA,1,4) df -data.frame(id,a,b,c,d,e) df At first I tried something like this (you see, the only thing I did was that I tried to apply this exclude=NULL thing). unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,2],exclude=NULL [[1]]) At least my code realizes the fact that I have 4 different levels in column 'a' (1,2,3,NA) and not only three (1,2,3). Check it here: nlevels(factor(df[,2],exclude=NULL)) But you see in the result that somehow it could not calculate the NAs. It says 3 0 6 0(!) 4 3 3 0 4 1 5 0 Instead of the correct: 3 0 6 1(!) 4 3 3 0 4 1 5 0 Or in case of: unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,4],exclude=NULL [[3]]) It says 2 4 4 0 2 3 4 0(!) 1 5 4 0 Instead of the correct 2 4 4 0 2 3 4 1(!) 1 5 4 0 etc. Does someone have any ideas how to persuade the function tabulate to count NAs? Is it possible at all? Thanks very much and have a pleasant weekend, Laszlo Ez az e-mail és az összes hozzá tartozó csatolt melléklet titkos és/vagy jogilag, szakmailag vagy más módon védett információt tartalmazhat. Amennyiben nem Ön a levél címzettje akkor a levél tartalmának közlése, reprodukálása, másolása, vagy egyéb más úton történő terjesztése, felhasználása szigorúan tilos. Amennyiben tévedésből kapta meg ezt az üzenetet kérjük azonnal értesítse az üzenet küldőjét. Az Erste Bank Hungary Zrt. (EBH) nem vállal felelősséget az információ teljes és pontos - címzett(ek)hez történő - eljuttatásáért, valamint semmilyen késésért, kapcsolat megszakadásból eredő hibáért, vagy az információ felhasználásából vagy annak megbízhatatlanságából eredő kárért. Az üzenetek EBH-n kívüli küldője vagy címzettje tudomásul veszi és hozzájárul, hogy az üzenetekhez más banki alkalmazott is hozzáférhet az EBH folytonos munkamenetének biztosítása érdekében. This e-mail
Re: [R] persuade tabulate function to count NAs in a data frame
On 03/20/2011 01:58 AM, Bodnar Laszlo EB_HU wrote: Hi, I'd like to ask you a question again. It is basically about data frames, NAs and tabulate function. Hi Bodnar, The freq function in the prettyR package might do what you want. Jim __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.