[R] Rounding and printing

2009-10-29 Thread Alan Cohen
Hello,

I am trying to print a table with numbers all rounded to the same number of 
digits (one after the decimal), but R seems to want to not print ".0" for 
integers.  I can go in and fix it one number at a time, but I'd like to 
understand the principle.  Here's an example of the code.  The problem is the 
13th element, 21 or 21.0:
>nvb_deaths <- round(ss[,10]/100,digits=1)   
> nvb_deaths
 [1] 56.5  1.6  0.2  3.9  0.1  2.2  0.2  2.6  1.5  4.1  1.1  6.1 21.0
>nvb_dths <- paste(nvb_deaths," 
>(",round(100*nvb_deaths/nvb_deaths[1],digits=1),"%)",sep="")
> nvb_dths
 [1] "56.5 (100%)" "1.6 (2.8%)"  "0.2 (0.4%)"  "3.9 (6.9%)"  "0.1 (0.2%)"  "2.2 
(3.9%)" 
 [7] "0.2 (0.4%)"  "2.6 (4.6%)"  "1.5 (2.7%)"  "4.1 (7.3%)"  "1.1 (1.9%)"  "6.1 
(10.8%)"
[13] "21 (37.2%)" 
> print(nvb_deaths,digits=1)
 [1] 56.5  1.6  0.2  3.9  0.1  2.2  0.2  2.6  1.5  4.1  1.1  6.1 21.0
> paste(print(nvb_deaths,digits=1)," 
> (",round(100*nvb_deaths/nvb_deaths[1],digits=1),"%)",sep="")
 [1] 56.5  1.6  0.2  3.9  0.1  2.2  0.2  2.6  1.5  4.1  1.1  6.1 21.0
 [1] "56.5 (100%)" "1.6 (2.8%)"  "0.2 (0.4%)"  "3.9 (6.9%)"  "0.1 (0.2%)"  "2.2 
(3.9%)" 
 [7] "0.2 (0.4%)"  "2.6 (4.6%)"  "1.5 (2.7%)"  "4.1 (7.3%)"  "1.1 (1.9%)"  "6.1 
(10.8%)"
[13] "21 (37.2%)" 

I'm running R v2.8.1 on Windows.  Any help is much appreciated.

Cheers,
Alan Cohen

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Drawing lines in margins

2009-07-29 Thread Alan Cohen
Hi all,

Quick question: What function can I use to draw a line in the margin of a plot? 
 segments() and lines() both stop at the margin.

In case the answer depends on exactly what I'm trying to do, see below.  I'm 
using R v. 2.8.1 on Windows XP.

Cheers,
Alan

I'm trying to make a horizontal barplot with a column of numbers on the right 
side.  I'd like to put a line between the column header and the numbers.  The 
following reconstructs the idea - just copy and paste it in:
aa <- 1:10
plot.mtx2<-cbind(aa,aa+1)
colnames(plot.mtx2)<-c("Male","Female")
lci2<- cbind(aa-1,aa)
uci2<- cbind(aa+1,aa+2)
par(mar=c(5,6,4,5))
cols <- c("grey79","grey41")
bplot2<-barplot(t(plot.mtx2),beside=TRUE,xlab="Malaria death rates per 100,000",
names.arg=paste("state",aa,sep=""),legend.text=F,las=1,xlim=c(0,13), horiz=T, 
col=cols,
main="Malaria death rates by state and sex")
legend(8,6,legend=c("Female","Male"),fill=cols[order(2:1)])
segments(y0=bplot2, y1=bplot2, x0=t(lci2), x1=t(uci2))
mtext(10*(aa+1),side=4,line=4,at=seq(3,3*length(aa),by=3)-0.35,padj=0.5,adj=1,las=1,cex=0.85)
mtext(10*aa,side=4,line=4,at=seq(2,3*length(aa)-1,by=3)-0.65,padj=0.5,adj=1,las=1,cex=0.85)
mtext("Estimated",side=4,line=3,at=3*length(aa)+2.75,padj=0.5,adj=0.5,las=1,cex=0.85)
mtext("Deaths",side=4,line=3,at=3*length(aa)+1.25,padj=0.5,adj=0.5,las=1,cex=0.85)

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Long to wide format without time variable

2009-06-23 Thread Alan Cohen
Hi all,

I am trying to convert a data set of physician death codings (each individual's 
cause of death is coded by multiple physicians) from long to wide format, but 
the "reshape" function doesn't seem to work because it requires a "time" 
variable to identify the sequence among the repeated observations within 
individuals.  My data set has no order, and different numbers of physicians 
code each death, up to 23.  It is also quite large, so for-loops are very slow, 
and I'll need to repeat the procedure multiple times.  So I'm looking for a 
processor-efficient way to replicate "reshape" without a time variable.

Thanks in advance for any help you can provide.  A worked example and some code 
I've tried are below.  I'm working with R v2.8.1 on Windows XP Professional.

Cheers,
Alan Cohen

Here's what my data look like now:

> id <- rep(1:5,2)
> COD <- c("A01","A02","A03","A04","A05","B01","A02","B03","B04","A05")
> MDid <- c(1:6,3,5,7,2)
> data <- as.data.frame(cbind(id,COD,MDid))
> data
   id COD MDid
1   1 A011
2   2 A022
3   3 A033
4   4 A044
5   5 A055
6   1 B016
7   2 A023
8   3 B035
9   4 B047
10  5 A052

And here's what I'd like them to look like:

> id2 <- 1:5
> COD.1 <- c("A01","A02","A03","A04","A05")
> COD.2 <- c("B01","A02","B03","B04","A05")
> MDid.1 <- 1:5
> MDid.2 <-c(6,3,5,7,2)
> data.wide <- as.data.frame(cbind(id2,COD.1,COD.2,MDid.1,MDid.2))
> data.wide
  id2 COD.1 COD.2 MDid.1 MDid.2
1   1   A01   B01  1  6
2   2   A02   A02  2  3
3   3   A03   B03  3  5
4   4   A04   B04  4  7
5   5   A05   A05  5  2

Here's the for-loop that's very slow (with or without the if-clauses activated):

ids<-unique(data$id)
ct<-length(ids)
codes<-matrix(0,ct,11)
colnames(codes)<-c("ID","ICD1","Coder1","ICD2","Coder2","ICD3","Coder3","ICD4","Coder4","ICD5","Coder5")
j<-0
for (i in 1:ct){
  kkk <- ids[i] 
  rpt<-data[data$id==kkk,]
  j<-max(j,nrow(rpt))
  codes[i,1]<-kkk
  codes[i,2]<-rpt$ICDCode[1]
  codes[i,3]<-rpt$T_Physician_ID[1]
  #if (nrow(rpt)>=2){
   codes[i,4]<-rpt$ICDCode[2]
   codes[i,5]<-rpt$T_Physician_ID[2] 
#if (nrow(rpt)>=3) {
 codes[i,6]<-rpt$ICDCode[3]
 codes[i,7]<-rpt$T_Physician_ID[3]
  #if (nrow(rpt)>=4) {
   codes[i,8]<-rpt$ICDCode[4]
   codes[i,9]<-rpt$T_Physician_ID[4]
  #if (nrow(rpt)>=5) {
   codes[i,10]<-rpt$ICDCode[5]
   codes[i,11]<-rpt$T_Physician_ID[5]
#
}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Problem with "apply"

2009-04-22 Thread Alan Cohen
Hi R users,

I am trying to assign ages to age classes for a large data set (123,000 
records), and using a for-loop was too slow, so I wrote a function and used 
apply.  However, the function does not properly assign the first two classes 
(the rest are fine).  It appears that when age is one digit, it does not get 
assigned properly.  

I tried to provide a small-scale work-up (at the end of the email) but it does 
not reproduce the problem; the best I can do is to provide my code and the 
output below.  As you can see, I've confirmed that age is numeric, that all 
values are integers, and that pieces of the code work independently.  Any 
thoughts would be appreciated.  

To add to the mystery, depending which rows of my data set I select, I get 
different problems.  mds[1:100,] gives the problem above, as do mds[100:200,] , 
mds[150:250,] and mds[1:10100,].  However, with mds[200:300,], 
mds[250:350,] and mds[1000:1100,], only ages with 3 digits are correctly 
assigned - all ages <100 are returned as NA.

I'm using R v 2.8.1 on Windows XP.

Cheers,
Alan Cohen
Centre for Global Health Research, 
Toronto,ON

> ageassign <- function(x){
+   y <- NA
+   if (x[11] %in% c(0:4)) {y <- "0-4"}
+   else if (x[11] %in% c(5:14)) {y <- "5-14" }
+   else if (x[11] %in% c(15:29)) {y <- "15-29" }
+   else if (x[11] %in% c(30:69)) {y <- "30-69"}
+   else if (x[11] %in% c(70:79)) {y <- "70-79"}
+   else if (x[11] %in% c(80:125)) {y <- "80+"}
+   return(y)
+ }
> jj <- apply(mds[1:100,],1,FUN=ageassign)
> jj
  1   2   3   4   5   6   7   8   9  10 
 11  12  13 
 NA   "80+" "30-69" "30-69"   "80+"  NA "30-69" "30-69" "70-79" "15-29" 
"15-29" "30-69" "70-79" 
 14  15  16  17  18  19  20  21  22  23 
 24  25  26 
  "80+"  NA "30-69" "30-69" "30-69"   "80+"   "80+" "15-29" "70-79" "30-69" 
"70-79" "70-79" "30-69" 
 27  28  29  30  31  32  33  34  35  36 
 37  38  39 
"70-79"   "80+"  NA   "80+" "70-79"  NA "15-29" "15-29"  NA  NA 
"70-79" "30-69" "30-69" 
 40  41  42  43  44  45  46  47  48  49 
 50  51  52 
"70-79" "30-69" "30-69" "30-69" "70-79" "30-69" "30-69" "70-79" "15-29" "30-69" 
 NA "15-29" "30-69" 
 53  54  55  56  57  58  59  60  61  62 
 63  64  65 
"30-69"  NA "70-79" "30-69" "30-69" "30-69" "30-69" "15-29" "30-69" "30-69" 
"70-79" "30-69"  NA 
 66  67  68  69  70  71  72  73  74  75 
 76  77  78 
"30-69" "30-69" "30-69" "30-69" "30-69"   "80+" "30-69"   "80+" "70-79" "30-69" 
"30-69" "30-69"  NA 
 79  80  81  82  83  84  85  86  87  88 
 89  90  91 
"30-69" "30-69" "30-69"  NA   "80+" "30-69" "30-69" "30-69"  NA "15-29" 
"30-69" "30-69" "30-69" 
 92  93  94  95  96  97  98  99 100 
"30-69" "30-69" "30-69" "30-69" "70-79" "30-69" "30-69" "30-69" "30-69" 
> mds[1:100,11]
  [1]  3 82 40 35 82  1 37 57 71 22 21 52 73 86  1 43 60 63 84 88 29 73 69 75 
73 43 75 83  4 83 77  1 27
 [34] 15  1  6 76 51 45 71 54 64 69 70 48 38 74 26 37  4 18 63 59  8 78 63 67 
62 50 21 66 69 75 57  4 50
 [67] 58 60 61 62 83 69 92 75 30 49 69  1 69 63 69  0 93 64 59 69  2 25 32 60 
66 67 54 53 64 79 59 49 59
[100] 64
> table(mds[,11])

   0123456789   10   11   12   13   14   15 
  16   17   18   19 
3123 6441 3856 2884 1968 1615 1386 1088 1098  721  943  681  511  380  426  835 
 571  555  719  653 
  20   21   22   23   24   25   26   27   28   29   30   31   32   33   34   35 
  36   37   38   39 
 879  715  672  631  655  773  680  713  769  538  685  566  729  702  652  766 
 683  723  821  675 
  40   41   42   43   44   45   46   47   48   49   50   51   52   53   54   55 
  56   57

[R] Weighted principal components analysis?

2009-04-03 Thread Alan Cohen
Hello R-ers,

I'm trying to do a weighted principal components analysis.  I couldn't find any 
such option with princomp or prcomp.  Does anyone know of a package or way to 
do this?

More specifically, the observations I'm working with are averages from 
populations of varying sizes.  I thus need to weight the observations by sample 
size.  Ideally I could apply these weights at the cell level (i.e., allowing 
sample size to vary within observations across variables), but even applying 
them just to the observations would get me most of the way there.

I'm using R v2.8.1 on Windows XP.  I've searched Help and the R site and had no 
luck.  Thanks for any help you can provide.

Cheers,
Alan Cohen
Centre for Global Health Research
Toronto, Ontario

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Using apply to get group means

2009-03-31 Thread Alan Cohen
Hi all,

I'm trying to improve my R skills and make my programming more efficient and 
succinct.  I can solve the following question, but wonder if there's a better 
way to do it:

I'm trying to calculate mean by several variables and then put this back into 
the original data set as a new variable.  For example, if I were measuring 
weight, I might want to have each individual's weight, and also the group mean 
by, say, race, sex, and geographic region.  The following code works:

> x1<-rep(c("A","B","C"),3)
> x2<-c(rep(1,3),rep(2,3),1,2,1)
> x3<-c(1,2,3,4,5,6,2,6,4)
> x<-as.data.frame(cbind(x1,x2,x3))
> x3.mean<-rep(0,nrow(x))
> for (i in 1:nrow(x)){
+   x3.mean[i]<-mean(as.numeric(x[,3][x[,1]==x[,1][i]&x[,2]==x[,2][i]]))
+   }  
> cbind(x,x3.mean)
  x1 x2 x3 x3.mean
1  A  1  1 1.5
2  B  1  2 2.0
3  C  1  3 3.5
4  A  2  4 4.0
5  B  2  5 5.5
6  C  2  6 6.0
7  A  1  2 1.5
8  B  2  6 5.5
9  C  1  4 3.5

However, I'd love to be able to do this with "apply" rather than a for-loop.  
Or is there a built-in function? Any suggestions?

Also, any way to avoid the hassles with having to convert to a data frame and 
then again to numeric when one variable is character?

Cheers,
Alan Cohen

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Failure to subset in R v 2.8.0

2008-12-01 Thread Alan Cohen
Hello,

I've been using a pre-release version of R v 2.8.0 for Windows for the last 
couple months.  I think that there have been consistent problems with 
subsetting data sets, but I had usually been able to find work-arounds or was 
unable to confirm this as a bug.  I think now I have, and would love advice on 
what to do if I've made some error.

The data set in question ("c") has 500,000 observations and 44 variables.  The 
problematic variable, "month," takes integer values 1:12, and all are present 
in the data set:

> unique(c$month)
 [1] 11 10  9  8 12  1  7  4  6  2  5  3

However, I can't select observations of c for certain values of month:

> c[c$month==11,]
 [1] STATEDISTRICT TALUKVILLAGE  TYPE SERIALNO  
   INTDATE  QH101P  
 [9] QH114QH115A1  QH115B1  QH115C1  QH115A2  QH115B2   
   QH115C2  QH115A3 
[17] QH115B3  QH115C3  QH115A4  QH115B4  QH115C4  QH115A5   
   QH115B5  QH115C5 
[25] QH116QH117A1  QH117B1  QH117C1  QH117A2  QH117B2   
   QH117C2  QH117A3 
[33] QH117B3  QH117C3  QH117A4  QH117B4  QH117C4  QH117A5   
   QH117B5  QH117C5 
[41] phaseyear monthstdistid.rch
<0 rows> (or 0-length row.names)

I get the same result for c[c[,43]==11,], and 

> length(c$month[c$month==11])
[1] 0

This is true for most values of month (1,2,4,5,7,8,10,11), but the multiples of 
3 work, apparently correctly.

Other variables do not have this problem (the columns shift in the email, but 
these three observations have month=11):

> c[c$STATE==11,][1:3,]
  STATE DISTRICT TALUK VILLAGE TYPE SERIALNO INTDATE QH101P QH114 QH115A1 
QH115B1 QH115C1 QH115A2 QH115B2 QH115C2 QH115A3 QH115B3
87556112 1   1151187  6 0   0   
0   0   0   0   0   0   0
87557112 1   11   101187  3 0   0   
0   0   0   0   0   0   0
87558112 1   11   141187  5 0   0   
0   0   0   0   0   0   0
  QH115C3 QH115A4 QH115B4 QH115C4 QH115A5 QH115B5 QH115C5 QH116 QH117A1 
QH117B1 QH117C1 QH117A2 QH117B2 QH117C2 QH117A3 QH117B3 QH117C3
87556   0   0   0   0   0   0   0 0   0 
  0   0   0   0   0   0   0   0
87557   0   0   0   0   0   0   0 0   0 
  0   0   0   0   0   0   0   0
87558   0   0   0   0   0   0   0 0   0 
  0   0   0   0   0   0   0   0
  QH117A4 QH117B4 QH117C4 QH117A5 QH117B5 QH117C5 phase year month 
stdistid.rch
87556   0   0   0   0   0   0 1 199811 
1102
87557   0   0   0   0   0   0 1 199811 
1102
87558   0   0   0   0   0   0 1 199811 
1102

The data set is called directly from a csv file, where all variables should be 
stored in the same way, and using as.numeric(as.character(c$month)) does not 
help.  Nor does restarting R, restarting the computer, or trying the operation 
on smaller subsets of c.  I'd appreciate any help you an provide.

Sincerely,
Alan Cohen

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Memory limits for large data sets

2008-11-05 Thread Alan Cohen
Hello,

I have several very large data sets (1-7 million observations, sometimes 
hundreds of variables) that I'm trying to work with in R, and memory seems to 
be a big issue.  I'm currently using a 2 GB Windows setup, but might have the 
option to run R on a server remotely.  Windows R seems basically limited to 2 
GB memory if I'm right; is there the possibility to go much beyond that with 
server-based R?  In other words, am I limited by R or by my hardware, and how 
much might R be able to handle if I get the hardware necessary?

Also, any possibility of using web-based R for this kind of thing?

Cheers,
Alan Cohen

Alan Cohen
Post-doctoral Fellow
Centre for Global Health Research
70 Richmond St. East, Suite 202A
Toronto, ON M5C 1N8
Canada
(416) 854-3121 (cell)
(416) 864-6060 ext. 3156 (0ffice)

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.