Re: [R] Dataframe of factors transform speed?

2007-07-21 Thread jim holtman
The problem is in the way that 'as.data.frame' works.  Use Rprof on a
small list and you will see where it is spending its time.

Now if you are really sure that all your data is consistent with being
a data frame,
you can create your own dataframe structure your self.  Not that I
would advocate it, but if you look at the output of 'dput' on a
dataframe, you can construct your own.

Here it took 20 seconds to create the test data with a list of 50,000
and only 2 seconds to create the data frame from that.

> set.seed(123)
> n <- 5
> system.time({
+ genoT <- lapply(1:n, function(i) factor(sample(c("AA",
+ "AB", "BB"), 1000, prob=c(1000, 1, 1), rep=T)))
+ })
   user  system elapsed
  20.850.12   22.83
> names(genoT) = paste("snp", 1:n, sep="")
>
> # create your own data frame structure -- if you are real sure of your data
>
> system.time(genoTz <- structure(genoT, .Names=names(genoT),
+ row.names=c(NA, -length(genoT[[1]])), class='data.frame'))
   user  system elapsed
   2.000.082.11
> str(genoTz)
'data.frame':   1000 obs. of  5 variables:
 $ snp1: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ...
 $ snp2: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 1 1 1 ...
 $ snp3: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ...
 $ snp4: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ...
 $ snp5: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 1 1 1 ...
 $ snp6: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ...
 $ snp7: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ snp8: Factor w/ 2 levels "AA","BB": 1 1 1 1 1 1 1 1 1 1 ...
 $ snp9: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 1 1 1 ...
 $ snp10   : Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 1 1 1 ...
 $ snp11   : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
>


On 7/21/07, Latchezar Dimitrov <[EMAIL PROTECTED]> wrote:
> Jim,
>
> No, this is _not the problem. If you go to my 1st mail I have a monster
> (at least was when I purchased it) with 32GB (sic :-) of RAM and 4 dual
> core AMD64 285 (the fastest at that time and still pretty fast now :-)
>
> The machine stats paging when I run 2 copies of R working on two things
> like that :-). If you look at my last e-mail I found a solution but
> still have no clue why the heck x<-as.data.frame(y) where why is a list
> of the same columns take real for ever and this the thing that killed me
> before.
>
> Thanks,
> Latchezar
>
> > -Original Message-
> > From: jim holtman [mailto:[EMAIL PROTECTED]
> > Sent: Saturday, July 21, 2007 5:33 PM
> > To: Latchezar Dimitrov
> > Cc: Benilton Carvalho; r-help@stat.math.ethz.ch
> > Subject: Re: [R] Dataframe of factors transform speed?
> >
> > One of the problems is that you are probably paging on your
> > system with an object that size (24 x 1000).  This is
> > about 1GB for a single object:
> >
> > > set.seed(123)
> > > n <- 24
> > > system.time({
> > + genoT <- lapply(1:n, function(i) factor(sample(c("AA", "AB", "BB"),
> > + 1000, prob=c(1000, 1, 1), rep=T)))
> > + })
> >user  system elapsed
> >   95.000.61  104.71
> > > names(genoT) = paste("snp", 1:n, sep="")
> > >
> > > object.size(genoT)
> > [1] 1045258752
> > >
> >
> > I can create it on my 2GB machine as a list, but have
> > problems converting it to a dataframe because I don't have
> > enough memory.
> >
> > So unless you have at least 4GB on your system, it might take
> > a long time.  Look at your performance measurements on your
> > system and see if you have run out of physical memory and are paging.
> >
> > On 7/21/07, Latchezar Dimitrov <[EMAIL PROTECTED]> wrote:
> > > Hi,
> > >
> > > Thanks for the help. My 1st question still unanswered though :-)
> > > Please see bellow
> > >
> > > > -Original Message-
> > > > From: Benilton Carvalho [mailto:[EMAIL PROTECTED]
> > > > Sent: Friday, July 20, 2007 3:30 AM
> > > > To: Latchezar Dimitrov
> > > > Cc: r-help@stat.math.ethz.ch
> > > > Subject: Re: [R] Dataframe of factors transform speed?
> > > >
> > > > set.seed(123)
> > > > genoT = lapply(1:24, function(i) factor(sample(c("AA", &quo

Re: [R] Dataframe of factors transform speed?

2007-07-21 Thread Latchezar Dimitrov
Jim,

No, this is _not the problem. If you go to my 1st mail I have a monster
(at least was when I purchased it) with 32GB (sic :-) of RAM and 4 dual
core AMD64 285 (the fastest at that time and still pretty fast now :-) 

The machine stats paging when I run 2 copies of R working on two things
like that :-). If you look at my last e-mail I found a solution but
still have no clue why the heck x<-as.data.frame(y) where why is a list
of the same columns take real for ever and this the thing that killed me
before.

Thanks,
Latchezar

> -Original Message-
> From: jim holtman [mailto:[EMAIL PROTECTED] 
> Sent: Saturday, July 21, 2007 5:33 PM
> To: Latchezar Dimitrov
> Cc: Benilton Carvalho; r-help@stat.math.ethz.ch
> Subject: Re: [R] Dataframe of factors transform speed?
> 
> One of the problems is that you are probably paging on your 
> system with an object that size (24 x 1000).  This is 
> about 1GB for a single object:
> 
> > set.seed(123)
> > n <- 24
> > system.time({
> + genoT <- lapply(1:n, function(i) factor(sample(c("AA", "AB", "BB"), 
> + 1000, prob=c(1000, 1, 1), rep=T)))
> + })
>user  system elapsed
>   95.000.61  104.71
> > names(genoT) = paste("snp", 1:n, sep="")
> >
> > object.size(genoT)
> [1] 1045258752
> >
> 
> I can create it on my 2GB machine as a list, but have 
> problems converting it to a dataframe because I don't have 
> enough memory.
> 
> So unless you have at least 4GB on your system, it might take 
> a long time.  Look at your performance measurements on your 
> system and see if you have run out of physical memory and are paging.
> 
> On 7/21/07, Latchezar Dimitrov <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > Thanks for the help. My 1st question still unanswered though :-) 
> > Please see bellow
> >
> > > -Original Message-
> > > From: Benilton Carvalho [mailto:[EMAIL PROTECTED]
> > > Sent: Friday, July 20, 2007 3:30 AM
> > > To: Latchezar Dimitrov
> > > Cc: r-help@stat.math.ethz.ch
> > > Subject: Re: [R] Dataframe of factors transform speed?
> > >
> > > set.seed(123)
> > > genoT = lapply(1:24, function(i) factor(sample(c("AA", "AB", 
> > > "BB"), 1000, prob=sample(c(1, 1000, 1000), 3), rep=T)))
> > > names(genoT) = paste("snp", 1:24, sep="") genoT =
> > > as.data.frame(genoT)
> >
> > Now this _is the problem. Everything before converting to 
> data.frame 
> > worked almost instantaneously however as.data.frame runs forever.
> > Obviously there is some scalability memory management issue. When I 
> > tried my own method but creating a new result (instead of modifying 
> > the
> > old) dataframe it worked like a charm for the 1st 100 cols ~ .3s. I 
> > figured 300,000 cols should be ~1000s. Nope! It ran for about 
> > 50,000(!)s to finish about 42,000 cols only.
> >
> > BTW, what ver. of R is yours?
> >
> > Now here's what I "discovered" further.
> >
> > #-- create a 1-col frame:
> >geno   <-
> > 
> data.frame(c(geno.GASP[[1]],geno.JAG[[1]]),row.names=c(rownames(geno.G
> > AS
> > P),rownames(geno.JAG)))
> >
> > #-- main code I repeated it w/ j in 1:1000, 2001:3000, and 
> 3001:4000, 
> > i.e., adding a 1000 of cols to geno each time
> >
> > system.time(
> > #   for(j in 1:(ncol(geno.GASP  ))){
> >for(j in 3001:(4000  )){
> >  gt.GASP<-geno.GASP[[j]]
> >   for(l in 1:length([EMAIL PROTECTED])){
> > levels(gt.GASP)[l] <-
> > switch([EMAIL PROTECTED],AA="0",AB="1",BB="2")
> >   }
> >   gt.JAG <-geno.JAG [[j]]
> > #  for(l in 1:length(gt.JAG @levels)){
> > #levels(gt.JAG )[l] <- switch(gt.JAG
> > @levels[l],AA="0",AB="1",BB="2")
> > #  }
> >   geno[[j]]<-factor(c(as.numeric(factor(gt.GASP,levels=0:2))-1
> > ###   factor(c(as.numeric(factor(gt.GASP,levels=0:2))-1
> >  ,as.numeric(factor(gt.JAG, levels=0:2))-1
> >  )
> >,levels=0:2
> >)
> >}
> > )
> >
> > Times (each one is for a 1000 cols!):
> > [1] 26.673  0.032 26.705  0.000  0.000 [1] 77.186  0.037 
> 77.225  0.000  
> > 0.000
> > [1] 128.165   0.042 128.209   0.000   0.000
> > [1] 180.940   0.047 180.989   0.000   0.000
> >
> > See

Re: [R] Dataframe of factors transform speed?

2007-07-21 Thread jim holtman
One of the problems is that you are probably paging on your system
with an object that size (24 x 1000).  This is about 1GB for a
single object:

> set.seed(123)
> n <- 24
> system.time({
+ genoT <- lapply(1:n, function(i) factor(sample(c("AA",
+ "AB", "BB"), 1000, prob=c(1000, 1, 1), rep=T)))
+ })
   user  system elapsed
  95.000.61  104.71
> names(genoT) = paste("snp", 1:n, sep="")
>
> object.size(genoT)
[1] 1045258752
>

I can create it on my 2GB machine as a list, but have problems
converting it to a dataframe because I don't have enough memory.

So unless you have at least 4GB on your system, it might take a long
time.  Look at your performance measurements on your system and see if
you have run out of physical memory and are paging.

On 7/21/07, Latchezar Dimitrov <[EMAIL PROTECTED]> wrote:
> Hi,
>
> Thanks for the help. My 1st question still unanswered though :-) Please
> see bellow
>
> > -Original Message-
> > From: Benilton Carvalho [mailto:[EMAIL PROTECTED]
> > Sent: Friday, July 20, 2007 3:30 AM
> > To: Latchezar Dimitrov
> > Cc: r-help@stat.math.ethz.ch
> > Subject: Re: [R] Dataframe of factors transform speed?
> >
> > set.seed(123)
> > genoT = lapply(1:24, function(i) factor(sample(c("AA",
> > "AB", "BB"), 1000, prob=sample(c(1, 1000, 1000), 3), rep=T)))
> > names(genoT) = paste("snp", 1:24, sep="") genoT =
> > as.data.frame(genoT)
>
> Now this _is the problem. Everything before converting to data.frame
> worked almost instantaneously however as.data.frame runs forever.
> Obviously there is some scalability memory management issue. When I
> tried my own method but creating a new result (instead of modifying the
> old) dataframe it worked like a charm for the 1st 100 cols ~ .3s. I
> figured 300,000 cols should be ~1000s. Nope! It ran for about 50,000(!)s
> to finish about 42,000 cols only.
>
> BTW, what ver. of R is yours?
>
> Now here's what I "discovered" further.
>
> #-- create a 1-col frame:
>geno   <-
> data.frame(c(geno.GASP[[1]],geno.JAG[[1]]),row.names=c(rownames(geno.GAS
> P),rownames(geno.JAG)))
>
> #-- main code I repeated it w/ j in 1:1000, 2001:3000, and 3001:4000,
> i.e., adding a 1000 of cols to geno each time
>
> system.time(
> #   for(j in 1:(ncol(geno.GASP  ))){
>for(j in 3001:(4000  )){
>  gt.GASP<-geno.GASP[[j]]
>   for(l in 1:length([EMAIL PROTECTED])){
> levels(gt.GASP)[l] <-
> switch([EMAIL PROTECTED],AA="0",AB="1",BB="2")
>   }
>   gt.JAG <-geno.JAG [[j]]
> #  for(l in 1:length(gt.JAG @levels)){
> #levels(gt.JAG )[l] <- switch(gt.JAG
> @levels[l],AA="0",AB="1",BB="2")
> #  }
>   geno[[j]]<-factor(c(as.numeric(factor(gt.GASP,levels=0:2))-1
> ###   factor(c(as.numeric(factor(gt.GASP,levels=0:2))-1
>  ,as.numeric(factor(gt.JAG, levels=0:2))-1
>  )
>,levels=0:2
>)
>}
> )
>
> Times (each one is for a 1000 cols!):
> [1] 26.673  0.032 26.705  0.000  0.000
> [1] 77.186  0.037 77.225  0.000  0.000
> [1] 128.165   0.042 128.209   0.000   0.000
> [1] 180.940   0.047 180.989   0.000   0.000
>
> See the big diff and the scaling I mentioned above?
>
> Further more I removed geno[[j]] assignment leaving the operation
> though, i.e., replaced it with ### line above. Times:
>
> [1] 0.857 0.008 0.865 0.000 0.000
>
> Huh!? What the heck! That's my second question :-) Any ideas?
>
> I still believe my method is near optimal. Of course I have to somehow
> get rid of the assignment bottleneck.
>
> For now the lesson is: "God bless lists"
>
> Here is my final solution:
>
> > system.time({
> + geno.GASP.L<-lapply(geno.GASP
> +,function(x){
> +   for(l in 1:length([EMAIL 
> PROTECTED])){levels(x)[l] <-
> switch([EMAIL PROTECTED],AA="0",AB="1",BB="2")}
> +   factor(x,levels=0:2)
> + }
> +  )
> + geno.JAG.L <-lapply(geno.JAG
> +,function(x){
> + # for(l in 1:length([EMAIL 
> PROTECTED])){levels(x)[l] <-
> switch([EMAIL PROTECTED],AA="0",AB="1",BB="2")}
> +   factor(x,levels=0:2)
> + }
> +  )
> + })
> [1] 192.800   1.5

Re: [R] Dataframe of factors transform speed?

2007-07-20 Thread Latchezar Dimitrov
Hi,

Thanks for the help. My 1st question still unanswered though :-) Please
see bellow 

> -Original Message-
> From: Benilton Carvalho [mailto:[EMAIL PROTECTED] 
> Sent: Friday, July 20, 2007 3:30 AM
> To: Latchezar Dimitrov
> Cc: r-help@stat.math.ethz.ch
> Subject: Re: [R] Dataframe of factors transform speed?
> 
> set.seed(123)
> genoT = lapply(1:24, function(i) factor(sample(c("AA", 
> "AB", "BB"), 1000, prob=sample(c(1, 1000, 1000), 3), rep=T)))
> names(genoT) = paste("snp", 1:24, sep="") genoT = 
> as.data.frame(genoT)

Now this _is the problem. Everything before converting to data.frame
worked almost instantaneously however as.data.frame runs forever.
Obviously there is some scalability memory management issue. When I
tried my own method but creating a new result (instead of modifying the
old) dataframe it worked like a charm for the 1st 100 cols ~ .3s. I
figured 300,000 cols should be ~1000s. Nope! It ran for about 50,000(!)s
to finish about 42,000 cols only. 

BTW, what ver. of R is yours?

Now here's what I "discovered" further.

#-- create a 1-col frame:
geno   <-
data.frame(c(geno.GASP[[1]],geno.JAG[[1]]),row.names=c(rownames(geno.GAS
P),rownames(geno.JAG)))

#-- main code I repeated it w/ j in 1:1000, 2001:3000, and 3001:4000,
i.e., adding a 1000 of cols to geno each time

system.time(
#   for(j in 1:(ncol(geno.GASP  ))){
for(j in 3001:(4000  )){
  gt.GASP<-geno.GASP[[j]]
   for(l in 1:length([EMAIL PROTECTED])){
 levels(gt.GASP)[l] <-
switch([EMAIL PROTECTED],AA="0",AB="1",BB="2")
   }
   gt.JAG <-geno.JAG [[j]]
#  for(l in 1:length(gt.JAG @levels)){
#levels(gt.JAG )[l] <- switch(gt.JAG
@levels[l],AA="0",AB="1",BB="2")
#  }
   geno[[j]]<-factor(c(as.numeric(factor(gt.GASP,levels=0:2))-1
###   factor(c(as.numeric(factor(gt.GASP,levels=0:2))-1
  ,as.numeric(factor(gt.JAG, levels=0:2))-1
  )
,levels=0:2
)
}
)

Times (each one is for a 1000 cols!):
[1] 26.673  0.032 26.705  0.000  0.000
[1] 77.186  0.037 77.225  0.000  0.000
[1] 128.165   0.042 128.209   0.000   0.000
[1] 180.940   0.047 180.989   0.000   0.000

See the big diff and the scaling I mentioned above?

Further more I removed geno[[j]] assignment leaving the operation
though, i.e., replaced it with ### line above. Times:

[1] 0.857 0.008 0.865 0.000 0.000

Huh!? What the heck! That's my second question :-) Any ideas?

I still believe my method is near optimal. Of course I have to somehow
get rid of the assignment bottleneck.

For now the lesson is: "God bless lists"

Here is my final solution:

> system.time({
+ geno.GASP.L<-lapply(geno.GASP
+,function(x){
+   for(l in 1:length([EMAIL PROTECTED])){levels(x)[l] 
<-
switch([EMAIL PROTECTED],AA="0",AB="1",BB="2")}
+   factor(x,levels=0:2)
+ }
+  )
+ geno.JAG.L <-lapply(geno.JAG
+,function(x){
+ # for(l in 1:length([EMAIL PROTECTED])){levels(x)[l] 
<-
switch([EMAIL PROTECTED],AA="0",AB="1",BB="2")}
+   factor(x,levels=0:2)
+ }
+  )
+ })
[1] 192.800   1.566 194.413   0.000   0.000   ! :-)
> system.time({
+ class(geno.GASP.L)<-"data.frame"
+ row.names(geno.GASP.L)<-row.names(geno.GASP)
+ class(geno.JAG.L )<-"data.frame"
+ row.names(geno.JAG.L )<-row.names(geno.JAG )
+ })
[1] 12.156  0.001 12.155  0.000  0.000
> system.time({
+ geno<-rbind(geno.GASP.L,geno.JAG.L)
+ })
[1] 1542.3409.072 2066.3100.0000.000

I logged my notes here as I was trying various things. Partly the reason
is my two questions:

"What was wrong with me?" and
"What the heck?!" remember above? :-)))

which  still remain unanswered :-(

I would have had a lot of fun if I had not to have this done by ...
Yesterday :-))

Thanks a lot for the help

Latchezar  

> dim(genoT)
> class(genoT)
> system.time(out <- lapply(genoT, function(x) match(x, c("AA", "AB",
> "BB"))-1))
> ##
> ##
> user  system elapsed
> 119.288   0.004 119.339
> 
> (for all 240K)
> 
> best,
> b
> 
> ps: note that "out" is a list.
> 
> On Jul 20, 2007, at 2:01 AM, Latchezar Dimitrov wrote:
> 
> > Hi,
> >
> >> -Original Message-
> >> From: Benilton Carvalho [mailto:[EMAIL PROTECTED]
> >> Sent: Friday, July 20, 2007 1

Re: [R] Dataframe of factors transform speed?

2007-07-20 Thread Charles C. Berry
On Thu, 19 Jul 2007, Latchezar Dimitrov wrote:

> Hello,
>
> This is a speed question. I have a dataframe genoT:
>
>> dim(genoT)
> [1]   1002 238304

It looks like these are all numeric originally. Handling these as a
vector or matrix will speed things up a bit. You can then stitch
together a data.frame:

# simulate: 
#   genoT.names <- scan('data.file, what='a', nlines=1,  ) 
#   genoT <- scan('data.file',skip=1)
#
>
> genoT <- sample(0:2, 24*1002, repl=T)
> t1 <- proc.time()
> genoT <- factor(genoT,0:2,c("AA","AB","BB"))
> dim(genoT) <- c(1002,24)
> genoT.list <- lapply(1:24, function(x) genoT[,x])
> # simulate: names(genoT.list) <- genoT.names :
> names(genoT.list) <- make.names(1:24)
> class(genoT.list) <- "data.frame"
> row.names(genoT.list) <- 1:1002
> proc.time()-t1
user  system elapsed
  20.978   2.036  49.714
>

Most of the _elapsed_ time is due to lags in copy-and-paste-ing in the 
commands.

HTH,

Chuck
>
>> str(genoT)
> 'data.frame':   1002 obs. of  238304 variables:
> $ SNP_A.4261647: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3
> ...
> $ SNP_A.4261610: Factor w/ 3 levels "0","1","2": 1 1 3 3 1 1 1 2 2 2
> ...
> $ SNP_A.4261601: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1
> ...
> $ SNP_A.4261704: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3
> ...
> $ SNP_A.4261563: Factor w/ 3 levels "0","1","2": 3 1 2 1 2 3 2 3 3 1
> ...
> $ SNP_A.4261554: Factor w/ 3 levels "0","1","2": 1 1 NA 1 NA 2 1 1 2 1
> ...
> $ SNP_A.4261666: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 1 1 1 1 2
> ...
> $ SNP_A.4261634: Factor w/ 3 levels "0","1","2": 3 3 2 3 3 3 3 3 3 2
> ...
> $ SNP_A.4261656: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 1 1 1 1 2
> ...
> $ SNP_A.4261637: Factor w/ 3 levels "0","1","2": 1 3 2 3 2 1 2 1 1 3
> ...
> $ SNP_A.4261597: Factor w/ 3 levels "AA","AB","BB": 2 2 3 3 3 2 1 2 2 3
> ...
> $ SNP_A.4261659: Factor w/ 3 levels "AA","AB","BB": 3 3 3 3 3 3 3 3 3 3
> ...
> $ SNP_A.4261594: Factor w/ 3 levels "AA","AB","BB": 2 2 2 1 1 1 2 2 2 2
> ...
> $ SNP_A.4261698: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ...
> $ SNP_A.4261538: Factor w/ 3 levels "AA","AB","BB": 2 3 2 2 3 2 2 1 1 2
> ...
> $ SNP_A.4261621: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 1 1 1
> ...
> $ SNP_A.4261553: Factor w/ 3 levels "AA","AB","BB": 1 1 2 1 1 1 1 1 1 1
> ...
> $ SNP_A.4261528: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ...
> $ SNP_A.4261579: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 2 1 1 1 2
> ...
> $ SNP_A.4261513: Factor w/ 3 levels "AA","AB","BB": 2 1 2 2 2 NA 1 NA 2
> 1 ...
> $ SNP_A.4261532: Factor w/ 3 levels "AA","AB","BB": 1 2 2 1 1 1 3 1 1 1
> ...
> $ SNP_A.4261600: Factor w/ 2 levels "AB","BB": 2 2 2 2 2 2 2 2 2 2 ...
> $ SNP_A.4261706: Factor w/ 2 levels "AA","BB": 1 1 1 1 1 1 1 1 1 1 ...
> $ SNP_A.4261575: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 2 2 1
> ...
>
> Its columns are factors with different number of levels (from 1 to 3 -
> that's what I got from read.table, i.e., it dropped missing levels). I
> want to convert it to uniform factors with 3 levels. The 1st 10 rows
> above show already converted columns and the rest are not yet converted.
> Here's my attempt wich is a complete failure as speed:
>
>> system.time(
> + for(j in 1:(10 )){ #-- this is to try 1st 10 cols and
> measure the time, it otherwise is ncol(genoT) instead of 10
>
> +gt<-genoT[[j]]  #-- this is to avoid 2D indices
> +for(l in 1:length([EMAIL PROTECTED])){
> +  levels(gt)[l] <- switch([EMAIL PROTECTED],AA="0",AB="1",BB="2")
> #-- convert levels to "0","1", or "2"
> +  genoT[[j]]<-factor(gt,levels=0:2)   #-- make a 3-level factor
> and put it back
> +}
> + }
> + )
> [1] 785.085   4.358 789.454   0.000   0.000
>
> 789s for 10 columns only!
>
> To me it seems like replacing 10 x 3 levels and then making a factor of
> 1002 element vector x 10 is a "negligible" amount of operations needed.
>
> So, what's wrong with me? Any idea how to accelerate significantly the
> transformation or (to go to the very beginning) to make read.table use a
> fixed set of levels ("AA","AB", and "BB") and not to drop any (missing)
> level?
>
> R-devel_2006-08-26, Sun Solaris 10 OS - x86 64-bit
>
> The machine is with 32G RAM and AMD Opteron 285 (2.? GHz) so it's not
> it.
>
> Thank you very much for the help,
>
> Latchezar Dimitrov,
> Analyst/Programmer IV,
> Wake Forest University School of Medicine,
> Winston-Salem, North Carolina, USA
>
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Charles C. Berry(858) 534-2098
 Dept of Family/Preventive Medicine
E mailto:[EMAIL PROTECTED]  UC San Diego
http

Re: [R] Dataframe of factors transform speed?

2007-07-20 Thread Benilton Carvalho
set.seed(123)
genoT = lapply(1:24, function(i) factor(sample(c("AA", "AB",  
"BB"), 1000, prob=sample(c(1, 1000, 1000), 3), rep=T)))
names(genoT) = paste("snp", 1:24, sep="")
genoT = as.data.frame(genoT)
dim(genoT)
class(genoT)
system.time(out <- lapply(genoT, function(x) match(x, c("AA", "AB",  
"BB"))-1))
##
##
user  system elapsed
119.288   0.004 119.339

(for all 240K)

best,
b

ps: note that "out" is a list.

On Jul 20, 2007, at 2:01 AM, Latchezar Dimitrov wrote:

> Hi,
>
>> -Original Message-
>> From: Benilton Carvalho [mailto:[EMAIL PROTECTED]
>> Sent: Friday, July 20, 2007 12:25 AM
>> To: Latchezar Dimitrov
>> Cc: r-help@stat.math.ethz.ch
>> Subject: Re: [R] Dataframe of factors transform speed?
>>
>> it looks like that whatever method you used to genotype the
>> 1002 samples on the STY array gave you a transposed matrix of
>> genotype calls. :-)
>
> It only looks like :-)
>
> Otherwise it is correctly created dataframe of 1002 samples X (big
> number) of columns (SNP genotypes). It worked perfectly until I  
> decided
> to put together to cohorts independently processed in R already. I got
> stuck with my lack of foreseeing. Otherwise I would have put 3 dummy
> lines w/ AA,AB, and AB on each one to make sure all 3 genotypes are
> present and that's it! Lesson for the future :-)
>
> Maybe I am not using columns and rows appropriately here but the
> dataframe is correct (I have not used FORTRAN since FORTRAN IV ;-)  
> - as
> str says 1002 observ. of (big number) vars.
>
>>
>> i'd use:
>>
>> genoT = read.table(yourFile, stringsAsFactors = FALSE)
>>
>> as a starting point... but I don't think that would be
>> efficient (as you'd need to fix one column at a time - lapply).
>
> No it was not efficient at all. 'matter of fact nothing is more
> efficient then loading already read data, alas :-(
>
>>
>> i'd preprocess yourFile before trying to load it:
>>
>> cat yourFile | sed -e 's/AA/1/g' | sed -e 's/AB/2/g' | sed -e
>> 's/BB/3/ g' > outFile
>>
>> and, now, in R:
>>
>> genoT = read.table(outFile, header=TRUE)
>
> ... Too late ;-) As it must be clear now I have two dataframes I  
> want to
> put together with rbind(geno1,geno2). The issue again is
> "uniformization" of factor variables w/ missing factors - they  
> ended up
> like levels AA,BB on one of the and levels AB,BB on the other which
> means as.numeric of AA is 1 on the 1st and as.numeric of AB is 1 on  
> the
> second - complete mess. That's why I tried to make both uniform, i.e.
> levels "AA","AB", and "BB" for every SNP and then rbind works.
>
> In any case my 1st questions remains: "What's wrong with me?" :-)
>
> Thanks,
> Latchezar
>
>>
>> b
>>
>> On Jul 19, 2007, at 11:51 PM, Latchezar Dimitrov wrote:
>>
>>> Hello,
>>>
>>> This is a speed question. I have a dataframe genoT:
>>>
>>>> dim(genoT)
>>> [1]   1002 238304
>>>
>>>> str(genoT)
>>> 'data.frame':   1002 obs. of  238304 variables:
>>>  $ SNP_A.4261647: Factor w/ 3 levels "0","1","2": 3 3 3 3 3
>> 3 3 3 3 3
>>> ...
>>>  $ SNP_A.4261610: Factor w/ 3 levels "0","1","2": 1 1 3 3 1
>> 1 1 2 2 2
>>> ...
>>>  $ SNP_A.4261601: Factor w/ 3 levels "0","1","2": 1 1 1 1 1
>> 1 1 1 1 1
>>> ...
>>>  $ SNP_A.4261704: Factor w/ 3 levels "0","1","2": 3 3 3 3 3
>> 3 3 3 3 3
>>> ...
>>>  $ SNP_A.4261563: Factor w/ 3 levels "0","1","2": 3 1 2 1 2
>> 3 2 3 3 1
>>> ...
>>>  $ SNP_A.4261554: Factor w/ 3 levels "0","1","2": 1 1 NA 1 NA 2 1 1
>>> 2 1
>>> ...
>>>  $ SNP_A.4261666: Factor w/ 3 levels "0","1","2": 1 1 2 1 1
>> 1 1 1 1 2
>>> ...
>>>  $ SNP_A.4261634: Factor w/ 3 levels "0","1","2": 3 3 2 3 3
>> 3 3 3 3 2
>>> ...
>>>  $ SNP_A.4261656: Factor w/ 3 levels "0","1","2": 1 1 2 1 1
>> 1 1 1 1 2
>>> ...
>>>  $ SNP_A.4261637: Factor w/ 3 levels "0","1","2": 1 3 2 3 2
>> 1 2 1 1 3
>>> ...
>>>  $ SNP_A.4261597: Factor w/ 3 leve

Re: [R] Dataframe of factors transform speed?

2007-07-19 Thread Latchezar Dimitrov
Hi,

> -Original Message-
> From: Benilton Carvalho [mailto:[EMAIL PROTECTED] 
> Sent: Friday, July 20, 2007 12:25 AM
> To: Latchezar Dimitrov
> Cc: r-help@stat.math.ethz.ch
> Subject: Re: [R] Dataframe of factors transform speed?
> 
> it looks like that whatever method you used to genotype the 
> 1002 samples on the STY array gave you a transposed matrix of 
> genotype calls. :-)

It only looks like :-)

Otherwise it is correctly created dataframe of 1002 samples X (big
number) of columns (SNP genotypes). It worked perfectly until I decided
to put together to cohorts independently processed in R already. I got
stuck with my lack of foreseeing. Otherwise I would have put 3 dummy
lines w/ AA,AB, and AB on each one to make sure all 3 genotypes are
present and that's it! Lesson for the future :-)

Maybe I am not using columns and rows appropriately here but the
dataframe is correct (I have not used FORTRAN since FORTRAN IV ;-) - as
str says 1002 observ. of (big number) vars.

> 
> i'd use:
> 
> genoT = read.table(yourFile, stringsAsFactors = FALSE)
> 
> as a starting point... but I don't think that would be 
> efficient (as you'd need to fix one column at a time - lapply).

No it was not efficient at all. 'matter of fact nothing is more
efficient then loading already read data, alas :-(

> 
> i'd preprocess yourFile before trying to load it:
> 
> cat yourFile | sed -e 's/AA/1/g' | sed -e 's/AB/2/g' | sed -e 
> 's/BB/3/ g' > outFile
> 
> and, now, in R:
> 
> genoT = read.table(outFile, header=TRUE)

... Too late ;-) As it must be clear now I have two dataframes I want to
put together with rbind(geno1,geno2). The issue again is
"uniformization" of factor variables w/ missing factors - they ended up
like levels AA,BB on one of the and levels AB,BB on the other which
means as.numeric of AA is 1 on the 1st and as.numeric of AB is 1 on the
second - complete mess. That's why I tried to make both uniform, i.e.
levels "AA","AB", and "BB" for every SNP and then rbind works.

In any case my 1st questions remains: "What's wrong with me?" :-)

Thanks,
Latchezar

> 
> b
> 
> On Jul 19, 2007, at 11:51 PM, Latchezar Dimitrov wrote:
> 
> > Hello,
> >
> > This is a speed question. I have a dataframe genoT:
> >
> >> dim(genoT)
> > [1]   1002 238304
> >
> >> str(genoT)
> > 'data.frame':   1002 obs. of  238304 variables:
> >  $ SNP_A.4261647: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 
> 3 3 3 3 3 
> > ...
> >  $ SNP_A.4261610: Factor w/ 3 levels "0","1","2": 1 1 3 3 1 
> 1 1 2 2 2 
> > ...
> >  $ SNP_A.4261601: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 
> 1 1 1 1 1 
> > ...
> >  $ SNP_A.4261704: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 
> 3 3 3 3 3 
> > ...
> >  $ SNP_A.4261563: Factor w/ 3 levels "0","1","2": 3 1 2 1 2 
> 3 2 3 3 1 
> > ...
> >  $ SNP_A.4261554: Factor w/ 3 levels "0","1","2": 1 1 NA 1 NA 2 1 1
> > 2 1
> > ...
> >  $ SNP_A.4261666: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 
> 1 1 1 1 2 
> > ...
> >  $ SNP_A.4261634: Factor w/ 3 levels "0","1","2": 3 3 2 3 3 
> 3 3 3 3 2 
> > ...
> >  $ SNP_A.4261656: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 
> 1 1 1 1 2 
> > ...
> >  $ SNP_A.4261637: Factor w/ 3 levels "0","1","2": 1 3 2 3 2 
> 1 2 1 1 3 
> > ...
> >  $ SNP_A.4261597: Factor w/ 3 levels "AA","AB","BB": 2 2 3 3 3 2 1
> > 2 2 3
> > ...
> >  $ SNP_A.4261659: Factor w/ 3 levels "AA","AB","BB": 3 3 3 3 3 3 3
> > 3 3 3
> > ...
> >  $ SNP_A.4261594: Factor w/ 3 levels "AA","AB","BB": 2 2 2 1 1 1 2
> > 2 2 2
> > ...
> >  $ SNP_A.4261698: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1
> > 1 ...
> >  $ SNP_A.4261538: Factor w/ 3 levels "AA","AB","BB": 2 3 2 2 3 2 2
> > 1 1 2
> > ...
> >  $ SNP_A.4261621: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1
> > 1 1 1
> > ...
> >  $ SNP_A.4261553: Factor w/ 3 levels "AA","AB","BB": 1 1 2 1 1 1 1
> > 1 1 1
> > ...
> >  $ SNP_A.4261528: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1
> > 1 ..

Re: [R] Dataframe of factors transform speed?

2007-07-19 Thread jim holtman
Is this what you want?  It took 0.01 seconds to convert 20 rows of the
test data:

> # create some data (20 rows with 1000 columns)
> n <- 20
> result <- list()
> vals <- c("AA", "AB", "BB")
> for (i in 1:n){
+ result[[as.character(i)]] <- sample(vals,1000, replace=TRUE,
prob=c(9000,1,1))
+ }
> result.df <- do.call('data.frame', result)
>
>
> str(result.df)
'data.frame':   1000 obs. of  20 variables:
 $ X1 : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X2 : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X3 : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X4 : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X5 : Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ...
 $ X6 : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X7 : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X8 : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X9 : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X10: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X11: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X12: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X13: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X14: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X15: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X16: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X17: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X18: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X19: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X20: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
>
> # go through each row and convert the factors according to 'vals' above
> system.time({  # time to convert 20 rows
+ x <- lapply(result.df, function(facts){
+ factor(match(as.character(facts), vals) - 1, levels=0:2)
+ })
+ result.df <- do.call('data.frame', x)
+ })
   user  system elapsed
   0.010.000.01
>
> str(result.df)
'data.frame':   1000 obs. of  20 variables:
 $ X1 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X2 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X3 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X4 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X5 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X6 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X7 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X8 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X9 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X10: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X11: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X12: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X13: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X14: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X15: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X16: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X17: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X18: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X19: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X20: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
>


On 7/19/07, Latchezar Dimitrov <[EMAIL PROTECTED]> wrote:
> Hello,
>
> This is a speed question. I have a dataframe genoT:
>
> > dim(genoT)
> [1]   1002 238304
>
> > str(genoT)
> 'data.frame':   1002 obs. of  238304 variables:
>  $ SNP_A.4261647: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3
> ...
>  $ SNP_A.4261610: Factor w/ 3 levels "0","1","2": 1 1 3 3 1 1 1 2 2 2
> ...
>  $ SNP_A.4261601: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1
> ...
>  $ SNP_A.4261704: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3
> ...
>  $ SNP_A.4261563: Factor w/ 3 levels "0","1","2": 3 1 2 1 2 3 2 3 3 1
> ...
>  $ SNP_A.4261554: Factor w/ 3 levels "0","1","2": 1 1 NA 1 NA 2 1 1 2 1
> ...
>  $ SNP_A.4261666: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 1 1 1 1 2
> ...
>  $ SNP_A.4261634: Factor w/ 3 levels "0","1","2": 3 3 2 3 3 3 3 3 3 2
> ...
>  $ SNP_A.4261656: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 1 1 1 1 2
> ...
>  $ SNP_A.4261637: Factor w/ 3 levels "0","1","2": 1 3 2 3 2 1 2 1 1 3
> ...
>  $ SNP_A.4261597: Factor w/ 3 levels "AA","AB","BB": 2 2 3 3 3 2 1 2 2 3
> ...
>  $ SNP_A.4261659: Factor w/ 3 levels "AA","AB","BB": 3 3 3 3 3 3 3 3 3 3
> ...
>  $ SNP_A.4261594: Factor w/ 3 levels "AA","AB","BB": 2 2 2 1 1 1 2 2 2 2
> ...
>  $ SNP_A.4261698: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ...
>  $ SNP_A.4261538: Factor w/ 3 levels "AA","AB","BB": 2 3 2 2 3 2 2 1 1 2
> ...
>  $ SNP_A.4261621: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 1 1 1
> ...
>  $ SNP_A.4261553: Factor w/ 3 levels "AA","AB","BB": 1 1 2 1 1 1 1 1 1 1
> ...
>  $ SNP_A.4261528: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ...
>  $ SNP_A.4261579: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 2 1 1 1 2
> ...
>  $ SNP_A.4261513: Factor w/ 3 levels "AA","AB"

Re: [R] Dataframe of factors transform speed?

2007-07-19 Thread Benilton Carvalho
it looks like that whatever method you used to genotype the 1002  
samples on the STY array gave you a transposed matrix of genotype  
calls. :-)

i'd use:

genoT = read.table(yourFile, stringsAsFactors = FALSE)

as a starting point... but I don't think that would be efficient (as  
you'd need to fix one column at a time - lapply).

i'd preprocess yourFile before trying to load it:

cat yourFile | sed -e 's/AA/1/g' | sed -e 's/AB/2/g' | sed -e 's/BB/3/ 
g' > outFile

and, now, in R:

genoT = read.table(outFile, header=TRUE)

b

On Jul 19, 2007, at 11:51 PM, Latchezar Dimitrov wrote:

> Hello,
>
> This is a speed question. I have a dataframe genoT:
>
>> dim(genoT)
> [1]   1002 238304
>
>> str(genoT)
> 'data.frame':   1002 obs. of  238304 variables:
>  $ SNP_A.4261647: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3
> ...
>  $ SNP_A.4261610: Factor w/ 3 levels "0","1","2": 1 1 3 3 1 1 1 2 2 2
> ...
>  $ SNP_A.4261601: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1
> ...
>  $ SNP_A.4261704: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3
> ...
>  $ SNP_A.4261563: Factor w/ 3 levels "0","1","2": 3 1 2 1 2 3 2 3 3 1
> ...
>  $ SNP_A.4261554: Factor w/ 3 levels "0","1","2": 1 1 NA 1 NA 2 1 1  
> 2 1
> ...
>  $ SNP_A.4261666: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 1 1 1 1 2
> ...
>  $ SNP_A.4261634: Factor w/ 3 levels "0","1","2": 3 3 2 3 3 3 3 3 3 2
> ...
>  $ SNP_A.4261656: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 1 1 1 1 2
> ...
>  $ SNP_A.4261637: Factor w/ 3 levels "0","1","2": 1 3 2 3 2 1 2 1 1 3
> ...
>  $ SNP_A.4261597: Factor w/ 3 levels "AA","AB","BB": 2 2 3 3 3 2 1  
> 2 2 3
> ...
>  $ SNP_A.4261659: Factor w/ 3 levels "AA","AB","BB": 3 3 3 3 3 3 3  
> 3 3 3
> ...
>  $ SNP_A.4261594: Factor w/ 3 levels "AA","AB","BB": 2 2 2 1 1 1 2  
> 2 2 2
> ...
>  $ SNP_A.4261698: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1  
> 1 ...
>  $ SNP_A.4261538: Factor w/ 3 levels "AA","AB","BB": 2 3 2 2 3 2 2  
> 1 1 2
> ...
>  $ SNP_A.4261621: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1  
> 1 1 1
> ...
>  $ SNP_A.4261553: Factor w/ 3 levels "AA","AB","BB": 1 1 2 1 1 1 1  
> 1 1 1
> ...
>  $ SNP_A.4261528: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1  
> 1 ...
>  $ SNP_A.4261579: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 2 1  
> 1 1 2
> ...
>  $ SNP_A.4261513: Factor w/ 3 levels "AA","AB","BB": 2 1 2 2 2 NA 1  
> NA 2
> 1 ...
>  $ SNP_A.4261532: Factor w/ 3 levels "AA","AB","BB": 1 2 2 1 1 1 3  
> 1 1 1
> ...
>  $ SNP_A.4261600: Factor w/ 2 levels "AB","BB": 2 2 2 2 2 2 2 2 2  
> 2 ...
>  $ SNP_A.4261706: Factor w/ 2 levels "AA","BB": 1 1 1 1 1 1 1 1 1  
> 1 ...
>  $ SNP_A.4261575: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1  
> 2 2 1
> ...
>
> Its columns are factors with different number of levels (from 1 to 3 -
> that's what I got from read.table, i.e., it dropped missing levels). I
> want to convert it to uniform factors with 3 levels. The 1st 10 rows
> above show already converted columns and the rest are not yet  
> converted.
> Here's my attempt wich is a complete failure as speed:
>
>> system.time(
> + for(j in 1:(10 )){ #-- this is to try 1st 10 cols and
> measure the time, it otherwise is ncol(genoT) instead of 10
>
> +gt<-genoT[[j]]  #-- this is to avoid 2D indices
> +for(l in 1:length([EMAIL PROTECTED])){
> +  levels(gt)[l] <- switch([EMAIL PROTECTED],AA="0",AB="1",BB="2")
> #-- convert levels to "0","1", or "2"
> +  genoT[[j]]<-factor(gt,levels=0:2)   #-- make a 3-level  
> factor
> and put it back
> +}
> + }
> + )
> [1] 785.085   4.358 789.454   0.000   0.000
>
> 789s for 10 columns only!
>
> To me it seems like replacing 10 x 3 levels and then making a  
> factor of
> 1002 element vector x 10 is a "negligible" amount of operations  
> needed.
>
> So, what's wrong with me? Any idea how to accelerate significantly the
> transformation or (to go to the very beginning) to make read.table  
> use a
> fixed set of levels ("AA","AB", and "BB") and not to drop any  
> (missing)
> level?
>
> R-devel_2006-08-26, Sun Solaris 10 OS - x86 64-bit
>
> The machine is with 32G RAM and AMD Opteron 285 (2.? GHz) so it's not
> it.
>
> Thank you very much for the help,
>
> Latchezar Dimitrov,
> Analyst/Programmer IV,
> Wake Forest University School of Medicine,
> Winston-Salem, North Carolina, USA
>
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting- 
> guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Dataframe of factors transform speed?

2007-07-19 Thread Latchezar Dimitrov
Hello,

This is a speed question. I have a dataframe genoT: 

> dim(genoT)
[1]   1002 238304

> str(genoT)
'data.frame':   1002 obs. of  238304 variables:
 $ SNP_A.4261647: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3
...
 $ SNP_A.4261610: Factor w/ 3 levels "0","1","2": 1 1 3 3 1 1 1 2 2 2
...
 $ SNP_A.4261601: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1
...
 $ SNP_A.4261704: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3
...
 $ SNP_A.4261563: Factor w/ 3 levels "0","1","2": 3 1 2 1 2 3 2 3 3 1
...
 $ SNP_A.4261554: Factor w/ 3 levels "0","1","2": 1 1 NA 1 NA 2 1 1 2 1
...
 $ SNP_A.4261666: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 1 1 1 1 2
...
 $ SNP_A.4261634: Factor w/ 3 levels "0","1","2": 3 3 2 3 3 3 3 3 3 2
...
 $ SNP_A.4261656: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 1 1 1 1 2
...
 $ SNP_A.4261637: Factor w/ 3 levels "0","1","2": 1 3 2 3 2 1 2 1 1 3
...
 $ SNP_A.4261597: Factor w/ 3 levels "AA","AB","BB": 2 2 3 3 3 2 1 2 2 3
...
 $ SNP_A.4261659: Factor w/ 3 levels "AA","AB","BB": 3 3 3 3 3 3 3 3 3 3
...
 $ SNP_A.4261594: Factor w/ 3 levels "AA","AB","BB": 2 2 2 1 1 1 2 2 2 2
...
 $ SNP_A.4261698: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ...
 $ SNP_A.4261538: Factor w/ 3 levels "AA","AB","BB": 2 3 2 2 3 2 2 1 1 2
...
 $ SNP_A.4261621: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 1 1 1
...
 $ SNP_A.4261553: Factor w/ 3 levels "AA","AB","BB": 1 1 2 1 1 1 1 1 1 1
...
 $ SNP_A.4261528: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ...
 $ SNP_A.4261579: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 2 1 1 1 2
...
 $ SNP_A.4261513: Factor w/ 3 levels "AA","AB","BB": 2 1 2 2 2 NA 1 NA 2
1 ...
 $ SNP_A.4261532: Factor w/ 3 levels "AA","AB","BB": 1 2 2 1 1 1 3 1 1 1
...
 $ SNP_A.4261600: Factor w/ 2 levels "AB","BB": 2 2 2 2 2 2 2 2 2 2 ...
 $ SNP_A.4261706: Factor w/ 2 levels "AA","BB": 1 1 1 1 1 1 1 1 1 1 ...
 $ SNP_A.4261575: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 2 2 1
...

Its columns are factors with different number of levels (from 1 to 3 -
that's what I got from read.table, i.e., it dropped missing levels). I
want to convert it to uniform factors with 3 levels. The 1st 10 rows
above show already converted columns and the rest are not yet converted.
Here's my attempt wich is a complete failure as speed:

> system.time(
+ for(j in 1:(10 )){ #-- this is to try 1st 10 cols and
measure the time, it otherwise is ncol(genoT) instead of 10

+gt<-genoT[[j]]  #-- this is to avoid 2D indices
+for(l in 1:length([EMAIL PROTECTED])){
+  levels(gt)[l] <- switch([EMAIL PROTECTED],AA="0",AB="1",BB="2")
#-- convert levels to "0","1", or "2"
+  genoT[[j]]<-factor(gt,levels=0:2)   #-- make a 3-level factor
and put it back
+}
+ }
+ )
[1] 785.085   4.358 789.454   0.000   0.000

789s for 10 columns only!

To me it seems like replacing 10 x 3 levels and then making a factor of
1002 element vector x 10 is a "negligible" amount of operations needed.

So, what's wrong with me? Any idea how to accelerate significantly the
transformation or (to go to the very beginning) to make read.table use a
fixed set of levels ("AA","AB", and "BB") and not to drop any (missing)
level?

R-devel_2006-08-26, Sun Solaris 10 OS - x86 64-bit

The machine is with 32G RAM and AMD Opteron 285 (2.? GHz) so it's not
it.

Thank you very much for the help,

Latchezar Dimitrov,
Analyst/Programmer IV,
Wake Forest University School of Medicine,
Winston-Salem, North Carolina, USA

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.