[R] Checking for duplicate rows in data frame efficiently

2010-07-13 Thread david hilton shanabrook
I wrote something to check for duplicate rows in a data frame, but it is too 
inefficient.  Is there a way to do this without the nested loops?  

This code correctly indicates rows 1-7, 1-8, 2-9 and 7-8 are duplicates.

 m - matrix(c(1,1,1,1,1, 2,2,2,2,2, 6,6,6,6,6, 3,3,3,3,3, 4,4,4,4,4, 
 5,5,5,5,5, 1,1,1,1,1, 1,1,1,1,1, 2,2,2,2,2, 7,7,7,7,7), ncol=5, byrow=TRUE)
 df - data.frame(m)
 df
   X1 X2 X3 X4 X5
1   1  1  1  1  1
2   2  2  2  2  2
3   6  6  6  6  6
4   3  3  3  3  3
5   4  4  4  4  4
6   5  5  5  5  5
7   1  1  1  1  1
8   1  1  1  1  1
9   2  2  2  2  2
10  7  7  7  7  7
 
 compareTwoRows - function(row1, row2){
+   numCol - 5 
+   logicalRow - row1==row2
+   duplicate - sum(logicalRow)==numCol
+   return(as.numeric(duplicate))}
   
 same - matrix(0, byrow=TRUE, ncol=10,nrow=10)
 
 for (j in 1:9)
+   for (k in (j+1):10)
+   same[j,k] - compareTwoRows(df[j,],df[k,])
   
 same
  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]000000110 0
 [2,]000000001 0
 [3,]000000000 0
 [4,]000000000 0
 [5,]000000000 0
 [6,]000000000 0
 [7,]000000010 0
 [8,]000000000 0
 [9,]000000000 0
[10,]000000000 0
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] shifted window of string

2010-06-14 Thread david hilton shanabrook
basically I need to create a sliding window in a string.  a way to explain this 
is:

 v - 
 c(a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y)
 window - 5
 shift - 2

I want a matrix of characters with window columns filled with v by filling 
a row, then shifting over shift and continuing to the next row until v is 
exhausted.  You can assume v will evenly fit m

so the result needs to look like this matrix where each row is shifted 2 (in 
this case):

 m
  [,1] [,2] [,3] [,4] [,5]
 [1,] a  b  c  d  e 
 [2,] c  d  e  f  g 
 [3,] e  f  g  h  i 
 [4,] g  h  i  j  k 
 [5,] i  j  k  l  m 
 [6,] k  l  m  n  o 
 [7,] m  n  o  p  q 
 [8,] o  p  q  r  s 
 [9,] q  r  s  t  u 
[10,] s  t  u  v  w 
[11,] t  u  v  w  x 

This needs to be very efficient as my data is large, loops would be too slow.  
Any ideas?  It could also be done in a string and then put into the matrix but 
I don't think this would be easier.

I will want to put this in a function:

shiftedMatrix - function(v, window=5, shift=2){...

return(m)}

thanks

dhs
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] color2D.matplot not giving colors

2010-03-06 Thread david hilton shanabrook
I am using color2D.matplot to plot a matrix about 400 by 200.  

The values in the matrix are 0:5 and NA.  The resulting plot is not color, but 
shaded b/w.  I tried to figure out how to add colors, I would like something 
like c(blue, green, red, cyan, green)

#example
motifx - matrix(NA, nrow=100,ncol=20)
motifx[,1:5] - 1
motifx[,6:10] - 2
motifx[,11:15] - 3
motifx[,15:19] - 4
motifx
color2D.matplot(motifx, na.color=white,show.legend=TRUE)

or is there a better function to use than color2D.matplot?
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] aggregate by factor

2010-01-30 Thread david hilton shanabrook
I have a data frame with two columns, a factor and a numeric.  I want to create 
data frame with the factor, its frequency and the median of the numeric column
 head(motifList)
  events score
1  aeijm -0.2500
2  begjm -0.2500
3  afgjm -0.2500
4  afhjm -0.2500
5  aeijm -0.2500
6  aehjm  0.0833

To get the frequency table of events:

 motifTable - as.data.frame(table(motifList$events))
 head(motifTable)
   Var1 Freq
1 aeijm  110
2 begjm   46
3 afgjm  337
4 afhjm  102
5 aehjm  190
6 adijm   18
 

Now get the score column back in.

 motifTable2 - merge(motifList, motifTable, by=events)
 head(motifTable2)
  events percent freq
1  adgjm  0.  111
2  adgjm  NA  111
3  adgjm  0.1333  111
4  adgjm  0.0667  111
5  adgjm -0.1667  111
6  adgjm  NA  111
 

Then lastly to aggregate on the events column getting the median of the score
 motifTable3 - aggregate.data.frame(motifTable2, by=list(motifTable2$events), 
 FUN=median, na.rm=TRUE)
Error in median.default(X[[1L]], ...) : need numeric data

Which gives the error as events are a factor.  Can someone enlighten me to a 
more obvious approach?

dhs
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] aggregate by factor

2010-01-30 Thread david hilton shanabrook

On 30 Jan 2010, at 4:20 PM, David Winsemius wrote:

 
 On Jan 30, 2010, at 4:09 PM, david hilton shanabrook wrote:
 
 I have a data frame with two columns, a factor and a numeric.  I want to 
 create data frame with the factor, its frequency and the median of the 
 numeric column
 head(motifList)
 events score
 1  aeijm -0.2500
 2  begjm -0.2500
 3  afgjm -0.2500
 4  afhjm -0.2500
 5  aeijm -0.2500
 6  aehjm  0.0833
 
 To get the frequency table of events:
 
 motifTable - as.data.frame(table(motifList$events))
 head(motifTable)
  Var1 Freq
 1 aeijm  110
 2 begjm   46
 3 afgjm  337
 4 afhjm  102
 5 aehjm  190
 6 adijm   18
 
 
 Now get the score column back in.
 
 motifTable2 - merge(motifList, motifTable, by=events)
 head(motifTable2)
 events percent freq
 1  adgjm  0.  111
 2  adgjm  NA  111
 3  adgjm  0.1333  111
 4  adgjm  0.0667  111
 5  adgjm -0.1667  111
 6  adgjm  NA  111
 
 
 Then lastly to aggregate on the events column getting the median of the score
 motifTable3 - aggregate.data.frame(motifTable2, 
 by=list(motifTable2$events), FUN=median, na.rm=TRUE)
 Error in median.default(X[[1L]], ...) : need numeric data
 
 Which gives the error as events are a factor.  Can someone enlighten me to a 
 more obvious approach?
 
 I don't think grouping on a factor is the source of your error. You have NA's 
 in your data and median will choke on those unless you specify na.rm=TRUE.
 
 -- 

I thought the na.rm=TRUE in the aggregate function would do this (see above).  
I also tried it with 

 medianRmNa - function(data) {
return(median(data, na.rm=TRUE))}

 motifTable3 - aggregate.data.frame(motifTable2, by=list(motifTable2$events), 
 FUN=medianRmNa)
Error in median.default(data, na.rm = TRUE) : need numeric data

same error.

I did leave a line out of the above script, 

names(motifTable) - c(events, freq)
which helps explain why the merge works

dhs


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] function in aggregate applied to specific columns only

2010-01-03 Thread david hilton shanabrook
I want to use aggregate with the mean function on specific columns

gender - factor(c(m, m, f, f, m))
student - c(0001, 0002, 0003, 0003, 0001)
score - c(50, 60, 70, 65, 60)
basicSub - data.frame(student, gender, score)
basicSubMean - aggregate(basicSub, by=list(basicSub$student), FUN=mean, 
na.rm=TRUE)

This doesn't work, one cannot take the mean of a factor (gender).  Is there any 
way of specifying which columns to use for the mean?  I want to aggregate by 
student, obtaining mean scores, and assume any other factors are unchanging in 
a specific student, ie. gender.

Thanks
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.