[R] Checking for duplicate rows in data frame efficiently
I wrote something to check for duplicate rows in a data frame, but it is too inefficient. Is there a way to do this without the nested loops? This code correctly indicates rows 1-7, 1-8, 2-9 and 7-8 are duplicates. m - matrix(c(1,1,1,1,1, 2,2,2,2,2, 6,6,6,6,6, 3,3,3,3,3, 4,4,4,4,4, 5,5,5,5,5, 1,1,1,1,1, 1,1,1,1,1, 2,2,2,2,2, 7,7,7,7,7), ncol=5, byrow=TRUE) df - data.frame(m) df X1 X2 X3 X4 X5 1 1 1 1 1 1 2 2 2 2 2 2 3 6 6 6 6 6 4 3 3 3 3 3 5 4 4 4 4 4 6 5 5 5 5 5 7 1 1 1 1 1 8 1 1 1 1 1 9 2 2 2 2 2 10 7 7 7 7 7 compareTwoRows - function(row1, row2){ + numCol - 5 + logicalRow - row1==row2 + duplicate - sum(logicalRow)==numCol + return(as.numeric(duplicate))} same - matrix(0, byrow=TRUE, ncol=10,nrow=10) for (j in 1:9) + for (k in (j+1):10) + same[j,k] - compareTwoRows(df[j,],df[k,]) same [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,]000000110 0 [2,]000000001 0 [3,]000000000 0 [4,]000000000 0 [5,]000000000 0 [6,]000000000 0 [7,]000000010 0 [8,]000000000 0 [9,]000000000 0 [10,]000000000 0 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] shifted window of string
basically I need to create a sliding window in a string. a way to explain this is: v - c(a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y) window - 5 shift - 2 I want a matrix of characters with window columns filled with v by filling a row, then shifting over shift and continuing to the next row until v is exhausted. You can assume v will evenly fit m so the result needs to look like this matrix where each row is shifted 2 (in this case): m [,1] [,2] [,3] [,4] [,5] [1,] a b c d e [2,] c d e f g [3,] e f g h i [4,] g h i j k [5,] i j k l m [6,] k l m n o [7,] m n o p q [8,] o p q r s [9,] q r s t u [10,] s t u v w [11,] t u v w x This needs to be very efficient as my data is large, loops would be too slow. Any ideas? It could also be done in a string and then put into the matrix but I don't think this would be easier. I will want to put this in a function: shiftedMatrix - function(v, window=5, shift=2){... return(m)} thanks dhs [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] color2D.matplot not giving colors
I am using color2D.matplot to plot a matrix about 400 by 200. The values in the matrix are 0:5 and NA. The resulting plot is not color, but shaded b/w. I tried to figure out how to add colors, I would like something like c(blue, green, red, cyan, green) #example motifx - matrix(NA, nrow=100,ncol=20) motifx[,1:5] - 1 motifx[,6:10] - 2 motifx[,11:15] - 3 motifx[,15:19] - 4 motifx color2D.matplot(motifx, na.color=white,show.legend=TRUE) or is there a better function to use than color2D.matplot? [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] aggregate by factor
I have a data frame with two columns, a factor and a numeric. I want to create data frame with the factor, its frequency and the median of the numeric column head(motifList) events score 1 aeijm -0.2500 2 begjm -0.2500 3 afgjm -0.2500 4 afhjm -0.2500 5 aeijm -0.2500 6 aehjm 0.0833 To get the frequency table of events: motifTable - as.data.frame(table(motifList$events)) head(motifTable) Var1 Freq 1 aeijm 110 2 begjm 46 3 afgjm 337 4 afhjm 102 5 aehjm 190 6 adijm 18 Now get the score column back in. motifTable2 - merge(motifList, motifTable, by=events) head(motifTable2) events percent freq 1 adgjm 0. 111 2 adgjm NA 111 3 adgjm 0.1333 111 4 adgjm 0.0667 111 5 adgjm -0.1667 111 6 adgjm NA 111 Then lastly to aggregate on the events column getting the median of the score motifTable3 - aggregate.data.frame(motifTable2, by=list(motifTable2$events), FUN=median, na.rm=TRUE) Error in median.default(X[[1L]], ...) : need numeric data Which gives the error as events are a factor. Can someone enlighten me to a more obvious approach? dhs [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] aggregate by factor
On 30 Jan 2010, at 4:20 PM, David Winsemius wrote: On Jan 30, 2010, at 4:09 PM, david hilton shanabrook wrote: I have a data frame with two columns, a factor and a numeric. I want to create data frame with the factor, its frequency and the median of the numeric column head(motifList) events score 1 aeijm -0.2500 2 begjm -0.2500 3 afgjm -0.2500 4 afhjm -0.2500 5 aeijm -0.2500 6 aehjm 0.0833 To get the frequency table of events: motifTable - as.data.frame(table(motifList$events)) head(motifTable) Var1 Freq 1 aeijm 110 2 begjm 46 3 afgjm 337 4 afhjm 102 5 aehjm 190 6 adijm 18 Now get the score column back in. motifTable2 - merge(motifList, motifTable, by=events) head(motifTable2) events percent freq 1 adgjm 0. 111 2 adgjm NA 111 3 adgjm 0.1333 111 4 adgjm 0.0667 111 5 adgjm -0.1667 111 6 adgjm NA 111 Then lastly to aggregate on the events column getting the median of the score motifTable3 - aggregate.data.frame(motifTable2, by=list(motifTable2$events), FUN=median, na.rm=TRUE) Error in median.default(X[[1L]], ...) : need numeric data Which gives the error as events are a factor. Can someone enlighten me to a more obvious approach? I don't think grouping on a factor is the source of your error. You have NA's in your data and median will choke on those unless you specify na.rm=TRUE. -- I thought the na.rm=TRUE in the aggregate function would do this (see above). I also tried it with medianRmNa - function(data) { return(median(data, na.rm=TRUE))} motifTable3 - aggregate.data.frame(motifTable2, by=list(motifTable2$events), FUN=medianRmNa) Error in median.default(data, na.rm = TRUE) : need numeric data same error. I did leave a line out of the above script, names(motifTable) - c(events, freq) which helps explain why the merge works dhs [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] function in aggregate applied to specific columns only
I want to use aggregate with the mean function on specific columns gender - factor(c(m, m, f, f, m)) student - c(0001, 0002, 0003, 0003, 0001) score - c(50, 60, 70, 65, 60) basicSub - data.frame(student, gender, score) basicSubMean - aggregate(basicSub, by=list(basicSub$student), FUN=mean, na.rm=TRUE) This doesn't work, one cannot take the mean of a factor (gender). Is there any way of specifying which columns to use for the mean? I want to aggregate by student, obtaining mean scores, and assume any other factors are unchanging in a specific student, ie. gender. Thanks [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.