Re: [R] missing handling

2005-10-04 Thread Weiwei Shi
Hi, Jim:
I tried your code and get the following error:
trn1-read.table('trn1.svm', header=F, na.string='.', sep='|')
Med-apply(trn1, 2, median, na.rm=T)
Ind-which(is.na(trn1), arr.ind=T)
trn1[Ind]-Med[Ind[,'col']]
Error in [-.data.frame(`*tmp*`, Ind, value = c(1.00802124455,
1.00802124455, :
only logical matrix subscripts are allowed in replacement


I cannot figure out why.

Thanks for help,

On 9/27/05, jim holtman [EMAIL PROTECTED] wrote:

 Use 'which(...arr.ind=T)'
   x.1
 [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,] 6 10 3 4 10 7 9 8 4 10
 [2,] 8 7 4 7 4 8 3 NA 3 4
 [3,] 7 7 10 10 3 5 3 2 2 2
 [4,] 3 4 5 10 10 2 6 9 4 5
 [5,] 3 5 9 5 6 NA 3 NA 6 7
 [6,] 9 6 10 5 10 4 2 10 NA 5
 [7,] 5 2 5 10 3 7 6 4 6 8
 [8,] 2 6 1 8 9 2 7 8 3 8
 [9,] 9 1 4 9 8 10 2 NA 1 7
 [10,] 2 4 8 7 NA 4 3 NA 5 5
  x.4
 [1] 5.5 5.5 5.0 7.5 8.0 5.0 3.0 8.0 4.0 6.0
  Med - apply(x.1, 2, median, na.rm=T) # get median
  Ind - which(is.na(x.1), arr.ind=T) # determine which are NA
  x.1[Ind] - Med[Ind[,'col']] # replace with median
  x.1
 [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,] 6 10 3 4 10 7 9 8 4 10
 [2,] 8 7 4 7 4 8 3 8 3 4
 [3,] 7 7 10 10 3 5 3 2 2 2
 [4,] 3 4 5 10 10 2 6 9 4 5
 [5,] 3 5 9 5 6 5 3 8 6 7
 [6,] 9 6 10 5 10 4 2 10 4 5
 [7,] 5 2 5 10 3 7 6 4 6 8
 [8,] 2 6 1 8 9 2 7 8 3 8
 [9,] 9 1 4 9 8 10 2 8 1 7
 [10,] 2 4 8 7 8 4 3 8 5 5
 


  On 9/27/05, Weiwei Shi [EMAIL PROTECTED] wrote:

  Hi,
  I have the following codes to replace missing using median, assuming
  missing
  only occurs on continuous variables:
 
  trn1-read.table('trn1.fv', header=F, na.string='.', sep='|')
 
  # median
  m.trn1-sapply(1:ncol(trn1), function(i) median(trn1[,i], na.rm=T))
 
  #replace
  trn2-trn1
  for (each in 1:nrow(trn1)){
  index.missing=which(is.na(trn1[each,]))
  trn2[each,]-replace(trn1[each,], index.missing, m.trn1[index.missing])
  }
 
 
  Anyone can suggest some ways to improve it since replacing 10 takes 1.5sec:
   system.time(for (each in 1:10){index.missing=which(is.na
  (trn1[each,]));
  trn2[each,]-replace(trn1[each,], index.missing, m.trn1[index.missing
  ]);})
  [1] 1.53 0.00 1.53 0.00 0.00
 
 
  Another general question is
  are there some packages in R doing missing handling?
 
  Thanks,
 
  --
  Weiwei Shi, Ph.D
 
  Did you always know?
  No, I did not. But I believed...
  ---Matrix III
 
  [[alternative HTML version deleted]]
 
  __
  R-help@stat.math.ethz.ch mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide!
  http://www.R-project.org/posting-guide.html
 



 --
 Jim Holtman
 Cincinnati, OH
 +1 513 247 0281

 What the problem you are trying to solve?




--
Weiwei Shi, Ph.D

Did you always know?
No, I did not. But I believed...
---Matrix III

[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] missing handling

2005-10-04 Thread Prof Brian Ripley
On Tue, 4 Oct 2005, Weiwei Shi wrote:

 Hi, Jim:
 I tried your code and get the following error:
 trn1-read.table('trn1.svm', header=F, na.string='.', sep='|')
 Med-apply(trn1, 2, median, na.rm=T)
 Ind-which(is.na(trn1), arr.ind=T)
 trn1[Ind]-Med[Ind[,'col']]
 Error in [-.data.frame(`*tmp*`, Ind, value = c(1.00802124455,
 1.00802124455, :
 only logical matrix subscripts are allowed in replacement


 I cannot figure out why.

Read the help for [-.data.frame to be told the answer.

A data frame (as given by read.table) is not a matrix, as the example 
presumably was.  Indexing whole matrices at once is efficient, but it 
hides loops for data frames.

You will not do better than looping over columns for a data frame, but you 
certainly do not need to loop over rows which is very inefficient. 
Something like

trn2 - trn1
for(i in names(trn2)) {
 Med - median(trn2[[i]], na.rm = TRUE)
 trn2[i, is.na(trn2[[i]])] - Med
}



 Thanks for help,

 On 9/27/05, jim holtman [EMAIL PROTECTED] wrote:

 Use 'which(...arr.ind=T)'
  x.1
 [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,] 6 10 3 4 10 7 9 8 4 10
 [2,] 8 7 4 7 4 8 3 NA 3 4
 [3,] 7 7 10 10 3 5 3 2 2 2
 [4,] 3 4 5 10 10 2 6 9 4 5
 [5,] 3 5 9 5 6 NA 3 NA 6 7
 [6,] 9 6 10 5 10 4 2 10 NA 5
 [7,] 5 2 5 10 3 7 6 4 6 8
 [8,] 2 6 1 8 9 2 7 8 3 8
 [9,] 9 1 4 9 8 10 2 NA 1 7
 [10,] 2 4 8 7 NA 4 3 NA 5 5
 x.4
 [1] 5.5 5.5 5.0 7.5 8.0 5.0 3.0 8.0 4.0 6.0
 Med - apply(x.1, 2, median, na.rm=T) # get median
 Ind - which(is.na(x.1), arr.ind=T) # determine which are NA
 x.1[Ind] - Med[Ind[,'col']] # replace with median
 x.1
 [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,] 6 10 3 4 10 7 9 8 4 10
 [2,] 8 7 4 7 4 8 3 8 3 4
 [3,] 7 7 10 10 3 5 3 2 2 2
 [4,] 3 4 5 10 10 2 6 9 4 5
 [5,] 3 5 9 5 6 5 3 8 6 7
 [6,] 9 6 10 5 10 4 2 10 4 5
 [7,] 5 2 5 10 3 7 6 4 6 8
 [8,] 2 6 1 8 9 2 7 8 3 8
 [9,] 9 1 4 9 8 10 2 8 1 7
 [10,] 2 4 8 7 8 4 3 8 5 5



  On 9/27/05, Weiwei Shi [EMAIL PROTECTED] wrote:

 Hi,
 I have the following codes to replace missing using median, assuming
 missing
 only occurs on continuous variables:

 trn1-read.table('trn1.fv', header=F, na.string='.', sep='|')

 # median
 m.trn1-sapply(1:ncol(trn1), function(i) median(trn1[,i], na.rm=T))

 #replace
 trn2-trn1
 for (each in 1:nrow(trn1)){
 index.missing=which(is.na(trn1[each,]))
 trn2[each,]-replace(trn1[each,], index.missing, m.trn1[index.missing])
 }


 Anyone can suggest some ways to improve it since replacing 10 takes 1.5sec:
 system.time(for (each in 1:10){index.missing=which(is.na
 (trn1[each,]));
 trn2[each,]-replace(trn1[each,], index.missing, m.trn1[index.missing
 ]);})
 [1] 1.53 0.00 1.53 0.00 0.00


 Another general question is
 are there some packages in R doing missing handling?

 Thanks,

 --
 Weiwei Shi, Ph.D

 Did you always know?
 No, I did not. But I believed...
 ---Matrix III

 [[alternative HTML version deleted]]

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide!
 http://www.R-project.org/posting-guide.html




 --
 Jim Holtman
 Cincinnati, OH
 +1 513 247 0281

 What the problem you are trying to solve?




 --
 Weiwei Shi, Ph.D

 Did you always know?
 No, I did not. But I believed...
 ---Matrix III

   [[alternative HTML version deleted]]

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] missing handling

2005-10-04 Thread Don MacQueen
At 8:35 PM +0100 10/4/05, Prof Brian Ripley wrote:
On Tue, 4 Oct 2005, Weiwei Shi wrote:

  Hi, Jim:
  I tried your code and get the following error:
  trn1-read.table('trn1.svm', header=F, na.string='.', sep='|')
  Med-apply(trn1, 2, median, na.rm=T)
  Ind-which(is.na(trn1), arr.ind=T)
  trn1[Ind]-Med[Ind[,'col']]
  Error in [-.data.frame(`*tmp*`, Ind, value = c(1.00802124455,
  1.00802124455, :
  only logical matrix subscripts are allowed in replacement


  I cannot figure out why.

Read the help for [-.data.frame to be told the answer.

A data frame (as given by read.table) is not a matrix, as the example
presumably was.  Indexing whole matrices at once is efficient, but it
hides loops for data frames.

You will not do better than looping over columns for a data frame, but you
certainly do not need to loop over rows which is very inefficient.
Something like

trn2 - trn1
for(i in names(trn2)) {
  Med - median(trn2[[i]], na.rm = TRUE)
  trn2[i, is.na(trn2[[i]])] - Med
}


But exchange the indices:

trn2[ is.na(trn2[[i]]) , i] - Med


  
  Thanks for help,

  On 9/27/05, jim holtman [EMAIL PROTECTED] wrote:

  Use 'which(...arr.ind=T)'
   x.1
  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
  [1,] 6 10 3 4 10 7 9 8 4 10
  [2,] 8 7 4 7 4 8 3 NA 3 4
  [3,] 7 7 10 10 3 5 3 2 2 2
  [4,] 3 4 5 10 10 2 6 9 4 5
  [5,] 3 5 9 5 6 NA 3 NA 6 7
  [6,] 9 6 10 5 10 4 2 10 NA 5
  [7,] 5 2 5 10 3 7 6 4 6 8
  [8,] 2 6 1 8 9 2 7 8 3 8
  [9,] 9 1 4 9 8 10 2 NA 1 7
  [10,] 2 4 8 7 NA 4 3 NA 5 5
  x.4
  [1] 5.5 5.5 5.0 7.5 8.0 5.0 3.0 8.0 4.0 6.0
  Med - apply(x.1, 2, median, na.rm=T) # get median
  Ind - which(is.na(x.1), arr.ind=T) # determine which are NA
  x.1[Ind] - Med[Ind[,'col']] # replace with median
  x.1
  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
  [1,] 6 10 3 4 10 7 9 8 4 10
  [2,] 8 7 4 7 4 8 3 8 3 4
  [3,] 7 7 10 10 3 5 3 2 2 2
  [4,] 3 4 5 10 10 2 6 9 4 5
  [5,] 3 5 9 5 6 5 3 8 6 7
  [6,] 9 6 10 5 10 4 2 10 4 5
  [7,] 5 2 5 10 3 7 6 4 6 8
  [8,] 2 6 1 8 9 2 7 8 3 8
  [9,] 9 1 4 9 8 10 2 8 1 7
  [10,] 2 4 8 7 8 4 3 8 5 5



   On 9/27/05, Weiwei Shi [EMAIL PROTECTED] wrote:

  Hi,
  I have the following codes to replace missing using median, assuming
   missing
   only occurs on continuous variables:
  
   trn1-read.table('trn1.fv', header=F, na.string='.', sep='|')
  
   # median
   m.trn1-sapply(1:ncol(trn1), function(i) median(trn1[,i], na.rm=T))

  #replace
  trn2-trn1
  for (each in 1:nrow(trn1)){
  index.missing=which(is.na(trn1[each,]))
  trn2[each,]-replace(trn1[each,], index.missing, m.trn1[index.missing])
  }


  Anyone can suggest some ways to improve it since replacing 10 
takes 1.5sec:
  system.time(for (each in 1:10){index.missing=which(is.na
  (trn1[each,]));
  trn2[each,]-replace(trn1[each,], index.missing, m.trn1[index.missing
  ]);})
  [1] 1.53 0.00 1.53 0.00 0.00


  Another general question is
  are there some packages in R doing missing handling?

  Thanks,

  --
  Weiwei Shi, Ph.D

  Did you always know?
  No, I did not. But I believed...
  ---Matrix III

  [[alternative HTML version deleted]]

  __
  R-help@stat.math.ethz.ch mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide!
  http://www.R-project.org/posting-guide.html




  --
  Jim Holtman
  Cincinnati, OH
  +1 513 247 0281

  What the problem you are trying to solve?




  --
  Weiwei Shi, Ph.D

  Did you always know?
  No, I did not. But I believed...
  ---Matrix III

  [[alternative HTML version deleted]]

  __
  R-help@stat.math.ethz.ch mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
   PLEASE do read the posting guide! 
http://www.R-project.org/posting-guide.html


--
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


-- 
--
Don MacQueen
Environmental Protection Department
Lawrence Livermore National Laboratory
Livermore, CA, USA

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] missing handling

2005-09-27 Thread jim holtman
Use 'which(...arr.ind=T)'
  x.1
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 6 10 3 4 10 7 9 8 4 10
[2,] 8 7 4 7 4 8 3 NA 3 4
[3,] 7 7 10 10 3 5 3 2 2 2
[4,] 3 4 5 10 10 2 6 9 4 5
[5,] 3 5 9 5 6 NA 3 NA 6 7
[6,] 9 6 10 5 10 4 2 10 NA 5
[7,] 5 2 5 10 3 7 6 4 6 8
[8,] 2 6 1 8 9 2 7 8 3 8
[9,] 9 1 4 9 8 10 2 NA 1 7
[10,] 2 4 8 7 NA 4 3 NA 5 5
 x.4
[1] 5.5 5.5 5.0 7.5 8.0 5.0 3.0 8.0 4.0 6.0
 Med - apply(x.1, 2, median, na.rm=T) # get median
 Ind - which(is.na(x.1), arr.ind=T) # determine which are NA
 x.1[Ind] - Med[Ind[,'col']] # replace with median
 x.1
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 6 10 3 4 10 7 9 8 4 10
[2,] 8 7 4 7 4 8 3 8 3 4
[3,] 7 7 10 10 3 5 3 2 2 2
[4,] 3 4 5 10 10 2 6 9 4 5
[5,] 3 5 9 5 6 5 3 8 6 7
[6,] 9 6 10 5 10 4 2 10 4 5
[7,] 5 2 5 10 3 7 6 4 6 8
[8,] 2 6 1 8 9 2 7 8 3 8
[9,] 9 1 4 9 8 10 2 8 1 7
[10,] 2 4 8 7 8 4 3 8 5 5



 On 9/27/05, Weiwei Shi [EMAIL PROTECTED] wrote:

 Hi,
 I have the following codes to replace missing using median, assuming
 missing
 only occurs on continuous variables:

 trn1-read.table('trn1.fv', header=F, na.string='.', sep='|')

 # median
 m.trn1-sapply(1:ncol(trn1), function(i) median(trn1[,i], na.rm=T))

 #replace
 trn2-trn1
 for (each in 1:nrow(trn1)){
 index.missing=which(is.na(trn1[each,]))
 trn2[each,]-replace(trn1[each,], index.missing, m.trn1[index.missing])
 }


 Anyone can suggest some ways to improve it since replacing 10 takes 1.5sec:
  system.time(for (each in 1:10){index.missing=which(is.na(trn1[each,]));
 trn2[each,]-replace(trn1[each,], index.missing, m.trn1[index.missing]);})
 [1] 1.53 0.00 1.53 0.00 0.00


 Another general question is
 are there some packages in R doing missing handling?

 Thanks,

 --
 Weiwei Shi, Ph.D

 Did you always know?
 No, I did not. But I believed...
 ---Matrix III

 [[alternative HTML version deleted]]

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide!
 http://www.R-project.org/posting-guide.html




--
Jim Holtman
Cincinnati, OH
+1 513 247 0281

What the problem you are trying to solve?

[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html