Re: [R] How to make this for() loop memory efficient?

2012-01-11 Thread iliketurtles
Ray, your solution works and is indeed faster than mine!

It looks like it's going to take a few days to to 400,000 rows, still, which
is unfortunate.

Steve, thanks for your help, I'll definitely self-teach plyr and data.table. 

-


Isaac
Research Assistant
Quantitative Finance Faculty, UTS
--
View this message in context: 
http://r.789695.n4.nabble.com/How-to-make-this-for-loop-memory-efficient-tp4283594p4284716.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to make this for() loop memory efficient?

2012-01-11 Thread Martin Morgan

On 01/11/2012 12:09 AM, iliketurtles wrote:

Ray, your solution works and is indeed faster than mine!

It looks like it's going to take a few days to to 400,000 rows, still, which
is unfortunate.

Steve, thanks for your help, I'll definitely self-teach plyr and data.table.


I added a column with the first two digits of the module

  data$XX - substr(L[,2], 1, 2)

then created a data frame that summarized the first module of each call 
and the length of the phone call


  df - with(data,
 data.frame(FirstModule=tapply(XX, `phone calls`, `[[`, 1),
Length=tapply(XX, `phone calls`, length)))

then summarized the length of the phone calls associated with each module

  with(df, tapply(Length, FirstModule, mean))

resulting in

 with(df, tapply(Length, FirstModule, mean))
  82   84   92   93   94   96   97
1.00 2.00 1.75 1.67 1.00 1.22 1.67

Martin



-


Isaac
Research Assistant
Quantitative Finance Faculty, UTS
--
View this message in context: 
http://r.789695.n4.nabble.com/How-to-make-this-for-loop-memory-efficient-tp4283594p4284716.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] How to make this for() loop memory efficient?

2012-01-10 Thread iliketurtles
##I have 2 columns of data. The first column is unique event IDs that
represent a phone call made to a customer.
###So, if you see 3 entries together in the first column like follows:

matrix(c(call1a,call1a,call1a) )

##then this means that this particular phone call  (the first call that's
logged in the data set) was transferred 
##between 3 different modules before the call was terminated.

##The second column is a numerical description of the module the call
started with and then got transferred to prior to ##call termination. Now,
I'll construct a ##representative array of the type of data I'm dealing with
(the real data set goes ##on for X00,000s of rows):
##(Ignore how I construct the following array, it’s completely unrelated to
how the actual data set was constructed). 


a-sapply(1:50,function(i){paste(call,i,sep=,collapse=)})
development.a-seq(1,40,3)
development.a2-seq(1,40,5)
a[development.a]-a[development.a+1]
a[development.a2]-a[development.a2+1]
a[1:2]-call2a;a[3]-call3a;a[4:5]-call5a;a[6:8]-call8a;a[9]-call9a
b-c(920010,960010,820009,920010,960500,970050,930010,920010,960500,970050,930900,870010,840010,960500,920010,970050,930010,960500,920010,970050,930010,960010,920010,940010,960010,970010,960500,920010,970050,930010,960500,920010,970050,930010,960500,920010,970050,930010,920010,960500,970050,930010,920009,960500,970050,930009,940010,960500,960500,960500)
data-as.data.frame(cbind(a,b))
colnames(data)-c(phone calls,modules)
dim(data)
print(data[1:10,]) #sample of 10 rows

# Note that in the real data set, data[,2] ranges from 810,000 to 999,999.
I've been tasked with the following:
# For each phone call that BEGINS with the module which is denoted by 81
(i.e. of the form 81X,XXX), what is the expected number of modules in these
calls?
#Then it's the same question for each module beginning with 82, 83, 84.
all the way until 99. 
#I've created code that I think works for this, but I can't actually run it
on the whole data set. I left it for 30 minutes and it only had about #5% of
the task completed (I clicked STOP then checked my output to see if I did
it properly, and it seems correct).
#I know the apply() family specializes in vector operations, but I can't
figure out how to complete the above question in any way other than #loops.

L-data

A-array(0,dim=c(19,2));rownames(A)-seq(81,99,1)
A-data.frame(A)

 for(i in 1:(nrow(L)-1))
 {
  if(L[(i+1),1]!=L[i,1])
  {
   
A[paste(strsplit(as.character(L[i+1,2]),)[[1]][1:2],sep=,collapse=),1]-
{ 
 
A[paste(strsplit(as.character(L[i+1,2]),)[[1]][1:2],sep=,collapse=),1]+length(grep(as.character(L[i+1,1]),L[,1],value=FALSE))
 
#aggregate number of modules in the calls that begin with XX (not yet
averaged). 
}
   
A[paste(strsplit(as.character(L[i+1,2]),)[[1]][1:2],sep=,collapse=),2]-
{
 
A[paste(strsplit(as.character(L[i+1,2]),)[[1]][1:2],sep=,collapse=),2]+1
}
  }
   
 }

#If I can get this code to be more memory efficient such that I can do it on
a 400,000 row data set, I can do, for example,

A[17,1]/A[17,2]

#and I'll arrive at the mean number of modules per call where the call
starts with a module that starts with 97.

A[17,1] 
#is 10, which means that, out of every single call that started with a
module of 97X,XXX,
#they went through 10 modules in total. 

A[17,2] 
#is 6, which means that there was 6 calls in total that began with a 97X,XXX
module.

#Hence,


A[17,1]/A[17,2]

#is the average number of modules that were executed in all the calls that
began with a 97X,XXX module.


-


Isaac
Research Assistant
Quantitative Finance Faculty, UTS
--
View this message in context: 
http://r.789695.n4.nabble.com/How-to-make-this-for-loop-memory-efficient-tp4283594p4283594.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to make this for() loop memory efficient?

2012-01-10 Thread Ray Brownrigg
On Wed, 11 Jan 2012, iliketurtles wrote:
 ##I have 2 columns of data. The first column is unique event IDs that
 represent a phone call made to a customer.
 ###So, if you see 3 entries together in the first column like follows:
 
 matrix(c(call1a,call1a,call1a) )
 
 ##then this means that this particular phone call  (the first call that's
 logged in the data set) was transferred
 ##between 3 different modules before the call was terminated.
 
 ##The second column is a numerical description of the module the call
 started with and then got transferred to prior to ##call termination. Now,
 I'll construct a ##representative array of the type of data I'm dealing
 with (the real data set goes ##on for X00,000s of rows):
 ##(Ignore how I construct the following array, it’s completely unrelated to
 how the actual data set was constructed).
 
 
 a-sapply(1:50,function(i){paste(call,i,sep=,collapse=)})
 development.a-seq(1,40,3)
 development.a2-seq(1,40,5)
 a[development.a]-a[development.a+1]
 a[development.a2]-a[development.a2+1]
 a[1:2]-call2a;a[3]-call3a;a[4:5]-call5a;a[6:8]-call8a;a[9]-ca
 ll9a
 b-c(920010,960010,820009,920010,960500,970050,930010,920010,960500,970050
 ,930900,870010,840010,960500,920010,970050,930010,960500,920010,970050,9300
 10,960010,920010,940010,960010,970010,960500,920010,970050,930010,960500,92
 0010,970050,930010,960500,920010,970050,930010,920010,960500,970050,930010,
 920009,960500,970050,930009,940010,960500,960500,960500)
 data-as.data.frame(cbind(a,b))
 colnames(data)-c(phone calls,modules)
 dim(data)
 print(data[1:10,]) #sample of 10 rows
 
 # Note that in the real data set, data[,2] ranges from 810,000 to 999,999.
 I've been tasked with the following:
 # For each phone call that BEGINS with the module which is denoted by 81
 (i.e. of the form 81X,XXX), what is the expected number of modules in these
 calls?
 #Then it's the same question for each module beginning with 82, 83, 84.
 all the way until 99.
 #I've created code that I think works for this, but I can't actually run it
 on the whole data set. I left it for 30 minutes and it only had about #5%
 of the task completed (I clicked STOP then checked my output to see if I
 did it properly, and it seems correct).
 #I know the apply() family specializes in vector operations, but I can't
 figure out how to complete the above question in any way other than #loops.
 
 L-data
 
 A-array(0,dim=c(19,2));rownames(A)-seq(81,99,1)
 A-data.frame(A)
 
  for(i in 1:(nrow(L)-1))
  {
   if(L[(i+1),1]!=L[i,1])
   {
 
 A[paste(strsplit(as.character(L[i+1,2]),)[[1]][1:2],sep=,collapse=),1
 ]- {
 
 A[paste(strsplit(as.character(L[i+1,2]),)[[1]][1:2],sep=,collapse=),1
 ]+length(grep(as.character(L[i+1,1]),L[,1],value=FALSE)) #aggregate number
 of modules in the calls that begin with XX (not yet averaged).
 }
 
 A[paste(strsplit(as.character(L[i+1,2]),)[[1]][1:2],sep=,collapse=),2
 ]- {
 
 A[paste(strsplit(as.character(L[i+1,2]),)[[1]][1:2],sep=,collapse=),2
 ]+1 }
   }
 
  }
 
 #If I can get this code to be more memory efficient such that I can do it
 on a 400,000 row data set, I can do, for example,
 
 A[17,1]/A[17,2]
 
 #and I'll arrive at the mean number of modules per call where the call
 starts with a module that starts with 97.
 
 A[17,1]
 #is 10, which means that, out of every single call that started with a
 module of 97X,XXX,
 #they went through 10 modules in total.
 
 A[17,2]
 #is 6, which means that there was 6 calls in total that began with a
 97X,XXX module.
 
 #Hence,
 
 
 A[17,1]/A[17,2]
 
 #is the average number of modules that were executed in all the calls that
 began with a 97X,XXX module.
 
 
 -
 
 
 Isaac
 Research Assistant
 Quantitative Finance Faculty, UTS

I don't see any need for you to use data frames.

If you make A and data (not a good use of a variable name) just matrices, you 
get the same 
answers at about 10 times the speed (using your example).

Hope this helps,
Ray Brownrigg

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to make this for() loop memory efficient?

2012-01-10 Thread Steve Lianoglou
I'm having a really difficult time understanding what you're trying to
get -- copy and pasting your code is failing to run, and your question
isn't clear, ie:

For each phone call that BEGINS with the module which is denoted by 81
(i.e. of the form 81X,XXX), what is the expected number of modules in these
calls?

How does one calculate the expected number of modules in this
module? What does that even mean?

Anyway, here's some using your `data` data.frame that calculates the
number of unique calls and other statistics on the call id within
each module prefix. I'm using both data.table and plyr ... there are
no for loops.

You will want to do `whatever it is you really want to do` inside the
blocks below.

## R code
data - transform(data, module.prefix=substring(modules, 1, 2))

## take a look at `data` now

## calulate stuff inside each module.prefix using data.table
xx - data.table(data, key=module.prefix)

ans - xx[, {
  ## the columns of the particular subset of your data.table
  ## are injected into the scope for this expression block
  ## which is where the `calls` variable below comes from
  tabled - table(as.character(calls))
  list(unique.calls=length(tabled), min=min(tabled),
median=as.numeric(median(tabled)), max=max(tabled))
  ## you will want to return your own list of stuff
}, by='module.prefix']


## with plyr
library(plyr)
ans - ddply(data, module.prefix, function(x) {
  ## `x` is a data.frame that all share the same module.prefix
  ## do whatever you want with it here
  tabled - table(as.character(x$calls))
  c(unique.calls=length(tabled), min=min(tabled),
median=median(tabled), max=max(tabled))
})

You'll have to read up on the particulars of data.table and plyr. Both
are really powerful packages ... you should get familiar with at least
one.

plyr is a bit more flexible in some ways.

data.table is a bit more strict (cf. the need for
`as.numeric(median(tabled))`), but also tends to be (much) faster when
working over large datasets

HTH,
-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to make this for() loop memory efficient?

2012-01-10 Thread Ray Brownrigg
On Wed, 11 Jan 2012, Ray Brownrigg wrote:
 On Wed, 11 Jan 2012, iliketurtles wrote:
  ##I have 2 columns of data. The first column is unique event IDs that
  represent a phone call made to a customer.
  ###So, if you see 3 entries together in the first column like follows:
  
  matrix(c(call1a,call1a,call1a) )
  
  ##then this means that this particular phone call  (the first call that's
  logged in the data set) was transferred
  ##between 3 different modules before the call was terminated.
  
  ##The second column is a numerical description of the module the call
  started with and then got transferred to prior to ##call termination.
  Now, I'll construct a ##representative array of the type of data I'm
  dealing with (the real data set goes ##on for X00,000s of rows):
  ##(Ignore how I construct the following array, it’s completely unrelated
  to how the actual data set was constructed).
  
  
  a-sapply(1:50,function(i){paste(call,i,sep=,collapse=)})
  development.a-seq(1,40,3)
  development.a2-seq(1,40,5)
  a[development.a]-a[development.a+1]
  a[development.a2]-a[development.a2+1]
  a[1:2]-call2a;a[3]-call3a;a[4:5]-call5a;a[6:8]-call8a;a[9]-
  ca ll9a
  b-c(920010,960010,820009,920010,960500,970050,930010,920010,960500,97005
  0
  ,930900,870010,840010,960500,920010,970050,930010,960500,920010,970050,9
  300
  10,960010,920010,940010,960010,970010,960500,920010,970050,930010,960500
  ,92
  0010,970050,930010,960500,920010,970050,930010,920010,960500,970050,9300
  10, 920009,960500,970050,930009,940010,960500,960500,960500)
  data-as.data.frame(cbind(a,b))
  colnames(data)-c(phone calls,modules)
  dim(data)
  print(data[1:10,]) #sample of 10 rows
  
  # Note that in the real data set, data[,2] ranges from 810,000 to
  999,999. I've been tasked with the following:
  # For each phone call that BEGINS with the module which is denoted by 81
  (i.e. of the form 81X,XXX), what is the expected number of modules in
  these calls?
  #Then it's the same question for each module beginning with 82, 83,
  84. all the way until 99.
  #I've created code that I think works for this, but I can't actually run
  it on the whole data set. I left it for 30 minutes and it only had about
  #5% of the task completed (I clicked STOP then checked my output to
  see if I did it properly, and it seems correct).
  #I know the apply() family specializes in vector operations, but I can't
  figure out how to complete the above question in any way other than
  #loops.
  
  L-data
  
  A-array(0,dim=c(19,2));rownames(A)-seq(81,99,1)
  A-data.frame(A)
  
   for(i in 1:(nrow(L)-1))
   {
   
if(L[(i+1),1]!=L[i,1])
{
  
  A[paste(strsplit(as.character(L[i+1,2]),)[[1]][1:2],sep=,collapse=)
  ,1 ]- {
  
  A[paste(strsplit(as.character(L[i+1,2]),)[[1]][1:2],sep=,collapse=)
  ,1 ]+length(grep(as.character(L[i+1,1]),L[,1],value=FALSE)) #aggregate
  number of modules in the calls that begin with XX (not yet averaged).
  
  }
  
  A[paste(strsplit(as.character(L[i+1,2]),)[[1]][1:2],sep=,collapse=)
  ,2 ]- {
  
  A[paste(strsplit(as.character(L[i+1,2]),)[[1]][1:2],sep=,collapse=)
  ,2 ]+1 }
  
}
   
   }
  
  #If I can get this code to be more memory efficient such that I can do it
  on a 400,000 row data set, I can do, for example,
  
  A[17,1]/A[17,2]
  
  #and I'll arrive at the mean number of modules per call where the call
  starts with a module that starts with 97.
  
  A[17,1]
  #is 10, which means that, out of every single call that started with a
  module of 97X,XXX,
  #they went through 10 modules in total.
  
  A[17,2]
  #is 6, which means that there was 6 calls in total that began with a
  97X,XXX module.
  
  #Hence,
  
  
  A[17,1]/A[17,2]
  
  #is the average number of modules that were executed in all the calls
  that began with a 97X,XXX module.
  
  
  -
  
  
  Isaac
  Research Assistant
  Quantitative Finance Faculty, UTS
 
 I don't see any need for you to use data frames.
 
 If you make A and data (not a good use of a variable name) just matrices,
 you get the same answers at about 10 times the speed (using your example).
 
Further, you should calculate your rowname, namely:
paste(strsplit(as.character(L[i+1,2]),)[[1]][1:2],sep=,collapse=)
only once each loop, instead of 4 times. this saves another 25-30% cputime.

And you can combine the two updates into a single assignment.

So using the code:

L - as.matrix(data)
A - array(0, dim=c(19, 2)); rownames(A) - seq(81, 99, 1)
# A - data.frame(A)

 for(i in 1:(nrow(L)-1))
 {
  if(L[(i+1),1]!=L[i,1])
  {
  myrow - paste(strsplit(as.character(L[i+1, 2]), )[[1]][1:2], sep=,
collapse=)
  A[myrow, ] - A[myrow, ] +
c(length(grep(as.character(L[i+1, 1]), L[, 1], value=FALSE)), 1)
  }
 }
is 15 times as fast as your original code.

 Hope this helps,
 Ray Brownrigg
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 

Re: [R] How to make this for() loop memory efficient?

2012-01-10 Thread Ray Brownrigg
Steve:

I don't understand why you couldn't get the original code working.  You just 
have to 
notice that one comment overflows its line.

However I couldn't get your code to match the output of the original - almost, 
but not 
quite!

Ray

On Wed, 11 Jan 2012, Steve Lianoglou wrote:
 I'm having a really difficult time understanding what you're trying to
 get -- copy and pasting your code is failing to run, and your question
 isn't clear, ie:
 
 For each phone call that BEGINS with the module which is denoted by 81
 (i.e. of the form 81X,XXX), what is the expected number of modules in these
 calls?
 
 How does one calculate the expected number of modules in this
 module? What does that even mean?
 
 Anyway, here's some using your `data` data.frame that calculates the
 number of unique calls and other statistics on the call id within
 each module prefix. I'm using both data.table and plyr ... there are
 no for loops.
 
 You will want to do `whatever it is you really want to do` inside the
 blocks below.
 
 ## R code
 data - transform(data, module.prefix=substring(modules, 1, 2))
 
 ## take a look at `data` now
 
 ## calulate stuff inside each module.prefix using data.table
 xx - data.table(data, key=module.prefix)
 
 ans - xx[, {
   ## the columns of the particular subset of your data.table
   ## are injected into the scope for this expression block
   ## which is where the `calls` variable below comes from
   tabled - table(as.character(calls))
   list(unique.calls=length(tabled), min=min(tabled),
 median=as.numeric(median(tabled)), max=max(tabled))
   ## you will want to return your own list of stuff
 }, by='module.prefix']
 
 
 ## with plyr
 library(plyr)
 ans - ddply(data, module.prefix, function(x) {
   ## `x` is a data.frame that all share the same module.prefix
   ## do whatever you want with it here
   tabled - table(as.character(x$calls))
   c(unique.calls=length(tabled), min=min(tabled),
 median=median(tabled), max=max(tabled))
 })
 
 You'll have to read up on the particulars of data.table and plyr. Both
 are really powerful packages ... you should get familiar with at least
 one.
 
 plyr is a bit more flexible in some ways.
 
 data.table is a bit more strict (cf. the need for
 `as.numeric(median(tabled))`), but also tends to be (much) faster when
 working over large datasets
 
 HTH,
 -steve

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to make this for() loop memory efficient?

2012-01-10 Thread Steve Lianoglou
Let me just reply to myself.

Sorry, it's funny how much I don't get this, but it appears Ray is
following you and provides an answer -- scratch my email, it seems to
be way off

(you should still learn plyr and/or data.table if you haven't yet, tho ;-)

Apologies,
-steve

On Tue, Jan 10, 2012 at 7:18 PM, Steve Lianoglou
mailinglist.honey...@gmail.com wrote:
 I'm having a really difficult time understanding what you're trying to
 get -- copy and pasting your code is failing to run, and your question
 isn't clear, ie:

 For each phone call that BEGINS with the module which is denoted by 81
 (i.e. of the form 81X,XXX), what is the expected number of modules in these
 calls?

 How does one calculate the expected number of modules in this
 module? What does that even mean?

 Anyway, here's some using your `data` data.frame that calculates the
 number of unique calls and other statistics on the call id within
 each module prefix. I'm using both data.table and plyr ... there are
 no for loops.

 You will want to do `whatever it is you really want to do` inside the
 blocks below.

 ## R code
 data - transform(data, module.prefix=substring(modules, 1, 2))

 ## take a look at `data` now

 ## calulate stuff inside each module.prefix using data.table
 xx - data.table(data, key=module.prefix)

 ans - xx[, {
  ## the columns of the particular subset of your data.table
  ## are injected into the scope for this expression block
  ## which is where the `calls` variable below comes from
  tabled - table(as.character(calls))
  list(unique.calls=length(tabled), min=min(tabled),
 median=as.numeric(median(tabled)), max=max(tabled))
  ## you will want to return your own list of stuff
 }, by='module.prefix']


 ## with plyr
 library(plyr)
 ans - ddply(data, module.prefix, function(x) {
  ## `x` is a data.frame that all share the same module.prefix
  ## do whatever you want with it here
  tabled - table(as.character(x$calls))
  c(unique.calls=length(tabled), min=min(tabled),
 median=median(tabled), max=max(tabled))
 })

 You'll have to read up on the particulars of data.table and plyr. Both
 are really powerful packages ... you should get familiar with at least
 one.

 plyr is a bit more flexible in some ways.

 data.table is a bit more strict (cf. the need for
 `as.numeric(median(tabled))`), but also tends to be (much) faster when
 working over large datasets

 HTH,
 -steve

 --
 Steve Lianoglou
 Graduate Student: Computational Systems Biology
  | Memorial Sloan-Kettering Cancer Center
  | Weill Medical College of Cornell University
 Contact Info: http://cbio.mskcc.org/~lianos/contact



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to make this for() loop memory efficient?

2012-01-10 Thread Steve Lianoglou
Yeah -- just fired off an apology email before this landed in my inbox.

Sometimes I'm better off not trying to help at all -- this was one of
those cases ;-)

Whatever I was trying to do clearly was going down the wrong trail

Thankfully, you're on top of it though.

Sorry for the spam,
-steve

On Tue, Jan 10, 2012 at 7:33 PM, Ray Brownrigg
ray.brownr...@ecs.vuw.ac.nz wrote:
 Steve:

 I don't understand why you couldn't get the original code working.  You just 
 have to
 notice that one comment overflows its line.

 However I couldn't get your code to match the output of the original - 
 almost, but not
 quite!

 Ray

 On Wed, 11 Jan 2012, Steve Lianoglou wrote:
 I'm having a really difficult time understanding what you're trying to
 get -- copy and pasting your code is failing to run, and your question
 isn't clear, ie:

 For each phone call that BEGINS with the module which is denoted by 81
 (i.e. of the form 81X,XXX), what is the expected number of modules in these
 calls?

 How does one calculate the expected number of modules in this
 module? What does that even mean?

 Anyway, here's some using your `data` data.frame that calculates the
 number of unique calls and other statistics on the call id within
 each module prefix. I'm using both data.table and plyr ... there are
 no for loops.

 You will want to do `whatever it is you really want to do` inside the
 blocks below.

 ## R code
 data - transform(data, module.prefix=substring(modules, 1, 2))

 ## take a look at `data` now

 ## calulate stuff inside each module.prefix using data.table
 xx - data.table(data, key=module.prefix)

 ans - xx[, {
   ## the columns of the particular subset of your data.table
   ## are injected into the scope for this expression block
   ## which is where the `calls` variable below comes from
   tabled - table(as.character(calls))
   list(unique.calls=length(tabled), min=min(tabled),
 median=as.numeric(median(tabled)), max=max(tabled))
   ## you will want to return your own list of stuff
 }, by='module.prefix']


 ## with plyr
 library(plyr)
 ans - ddply(data, module.prefix, function(x) {
   ## `x` is a data.frame that all share the same module.prefix
   ## do whatever you want with it here
   tabled - table(as.character(x$calls))
   c(unique.calls=length(tabled), min=min(tabled),
 median=median(tabled), max=max(tabled))
 })

 You'll have to read up on the particulars of data.table and plyr. Both
 are really powerful packages ... you should get familiar with at least
 one.

 plyr is a bit more flexible in some ways.

 data.table is a bit more strict (cf. the need for
 `as.numeric(median(tabled))`), but also tends to be (much) faster when
 working over large datasets

 HTH,
 -steve

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.