[R] Combining multiple probability weights for the sample() function.

2015-06-02 Thread Benjamin Ward (ENV)
Dear R-List,

I have a set of possibilities I want to sample from:

bases - list(c('A', 'C'), c('A', 'G'), c('C', 'T'))
possibilities - as.matrix(expand.grid(bases))

possibilities
Var1 Var2 Var3
[1,] A  A  C
[2,] C  A  C
[3,] A  G  C
[4,] C  G  C
[5,] A  A  T
[6,] C  A  T
[7,] A  G  T
[8,] C  G  T

If I want to randomly sample one of these rows. If I do this, I find that it is 
25% likely that my choice will have an identical first and last letter (e.g. 
[1,] A  A  C). It is also 25% likely that my choice will have an 
identical first and third letter (e.g. [4,] C  G  C). It is not likely at 
all that the second and third letter of my choice could be identical.

What I would like to do, is sample one of the rows, but given the constraint 
that the probability of drawing identical letters 1 and 2 should be 50% or 0.5, 
and at the same time the probability of drawing identical letters 1 and 3 
should be 50%. I am unsure on how to do this, but I know it involves coming up 
with a modified set of weights for the sample() function. My progress is below, 
any advice is much appreciated.

Best Wishes,

Ben Ward, UEA.


So I have used the following code to come up with a matrix, which contains 
weighting according to each criteria:

possibilities - as.matrix(expand.grid(bases))
  identities - apply(possibilities, 1, function(x) c(x[1] == x[2], x[1] == 
x[3], x[2] == x[3]))
  prob - matrix(rep(0, length(identities)), ncol = ncol(identities))
  consProb - apply(identities, 1, function(x){0.5 / length(which(x))})
  polProb - apply(identities, 1, function(x){0.5 / length(which(!x))})
  for(i in 1:nrow(identities)){
prob[i, which(identities[i,])] - consProb[i]
prob[i, which(!identities[i,])] - polProb[i]
  }
  rownames(prob) - c(1==2, 1==3, 2==3)
  colnames(prob) - apply(possibilities, 1, function(x)paste(x, collapse = , 
))

This code gives the following matrix:

A, A, CC, A, C  A, G, CC, G, C   A, A, 
T C, A, T   A, G, T   C, G, T
1==2 0.2500 0.0833 0.0833 0.0833 0.2500 0.0833 
0.0833 0.0833
1==3 0.0833 0.2500 0.0833 0.2500 0.0833 0.0833 
0.0833 0.0833
2==3 0.0625 0.0625 0.0625 0.0625 0.0625 0.0625 
0.0625 0.0625

Each column is one of the choices from 'possibilities', and each row gives a 
series of weights based on three different criteria:

Row 1, that if it possible from the choices for letter 1 == letter 2, that 
combined chance be 50%.
Row 2, that if it possible from the choices for letter 1 == letter 3, that 
combined chance be 50%.
Row 3, that if it possible from the choices for letter 2 == letter 3, that 
combined chance be 50%.

So:

 If I used sample(x = 1:now(possibilities), size = 1, prob = prob[1,]) 
repeatedly, I expect about half the choices to contain identical letters 1 and 
2.

 If I used sample(x = 1:now(possibilities), size = 1, prob = prob[2,]) 
repeatedly, I expect about half the choices to contain identical letters 1 and 
3.

If I used sample(x = 1:now(possibilities), size = 1, prob = prob[3,]) 
repeatedly, I expect about half the choices to contain identical letters 2 and 
3. Except that in this case, since it is not possible.

Note each row sums to 1.

What I would like to do - if it is possible - is combine these three sets of 
weights into one set, that when used with
sample(x = 1:nrow(possibilities, size = 1, prob = MAGICPROB) will give me a 
list of choices, where ~50% of them contain identical letters 1 and 2, AND ~50% 
of them contain identical letters 1 and 3, AND ~50% again contain identical 
letters 2 and 3 (except in this example as it is not possible from the choices).

Can multiple probability weightings be combined in such a manner?




[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Two x axes - top and bottom

2014-02-08 Thread Benjamin Ward (ENV)
Hi, fellow R users,

I've been asked to make a plot with two datasets each with a different x axis, 
and it's been suggested one be at the top and the other at the bottom of the 
graph. I normally use ggplot2, and I know how to plot multiple datasets by 
simply + a new geom with a different data option, but usually in these case my 
different datasets have had the same x and y axes. Can I add a new x axis to 
the top of the plot in ggplot2 or one of the other graphics packages?

Thanks,
Ben W.



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Change points in R

2014-01-30 Thread Benjamin Ward (ENV)
Hi R helpers,

I have a set of data best shown in this below graph.

Each coloured line represents a statistic calculated across pairs of DNA 
sequences. And for each coloured line, I would like to identify breakpoints - 
so identify the chunks where the values are high, for example, in the light 
blue line, there is a large high segment just after x=2e+05. From googling the 
aim to find such points, I've read about something called change-point 
analysis, used with time series data and I wondered if it or a variant of it in 
R might be of use here, this data is a series of % values (double), all a 
single measurement i.e. for each line, a 'scanner' passed over two sequences 
and at each step recorded the % value. Can change-point analysis help me here 
and if so what package or method will allow me to do this making as little 
assumptions about my data as possible?

Thanks in advance,

Ben W.

 [X]

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Small p from binomial probability function.

2013-10-16 Thread Benjamin Ward (ENV)
Hi,

Thanks again for your answers, just so as I can get clear what is happening, 
with the uniroot method, I'm defining a function in which the binomial 
probability function pbinom is present but in addition p0 is subtracted from 
the result - in this case p0 is the large P I want to plug in so 0.05, 0.50 and 
0.95, or even just 0.05 and 0.95? Then uniroot finds the root of this function 
and doing so find me the small p I need?

Best,
Ben.



From: Rolf Turner [rolf.tur...@vodafone.co.nz]
Sent: 11 October 2013 02:11
To: Benjamin Ward (ENV)
Cc: Stefan Evert; R-help Mailing List
Subject: Re: [R] Small p from binomial probability function.

It is mysterious to me why the procedure proposed by Stefan Evert works.
It appears to work --- once you modify the call to binom.test() to have the
correct syntax.  In a sequence of 1000  trials with random values of N, x,
and p0, the answers from Evert's procedure agreed with the answer given
by uniroot() to within +/- 3.045e-05.

However your question was (in effect) how to solve the equation

 Pr(X = x) = p0

for p, where X ~ Binom(N,p), with N and x known.  What this has to do with
confidence intervals for p is, to my mind at least, completely opaque.
In contrast it is obvious why the procedure using uniroot() works.

I would suggest that you stick with the uniroot() procedure in that it is
readily comprehensible.

 cheers,

 Rolf Turner

On 10/11/13 03:56, Benjamin Ward (ENV) wrote:
 Hi,

 Thank you for your answers, I'm not completely sure if it's to bino.test I 
 need or the uniroot. Perhaps I should explain more the idea behind the code 
 and the actual task I'm trying to do. The idea is to calculate a confidence 
 interval as to the age of two DNA sequences which have diverged, where I know 
 the number of mutations that happened in them, and I know the mutation rate.

 The binomial probability can be used since, mutations have a probability of 
 occurring or being observed so many times in a sequence. This is dependent on 
 the length of the DNA stretch (which equates to the number of trials since 
 each base is a possibility of observing a mutation), the probability of a 
 single mutation occurring which is p = t * u, since more time means a higher 
 probability a mutation may have occurred.

 So my code, using pbinom, is supposed to calculate the probability that my 
 DNA stretches contain the number of mutations observed P(X = k), given their 
 size (trials) and the probability of a single mutation (p = t * u). However 
 I'm interested in finding t: t is what is unknown, so the loop repeatedly 
 evaluates the calculation, increasing t each time and checking P(X=k), when 
 it is 0.05, 0.50 and 0.95, we record t.

 Ideally I'd like to rearrange this so I can get the probability of a single 
 success (mutation) p, and then divide by the mutation rate to get my t. My 
 supervisor gave my the loopy code but I imagine there is a way to plug in 
 P(X=k) as 0.05 and 0.95 and get my upper and lower t estimates.

 According to the R built in docs:

 binom.test
 Description:

   Performs an exact test of a simple null hypothesis about the
   probability of success in a Bernoulli experiment.

 Perhaps this is the one I need rather than uniroot?

 Best,
 Ben.


 
 From: Stefan Evert [stefa...@collocations.de]
 Sent: 10 October 2013 09:37
 To: R-help Mailing List
 Cc: Benjamin Ward (ENV)
 Subject: Re: [R] Small p from binomial probability function.

 Sounds like you want a 95% binomial confidence interval:

  binom.test(N, P)

 will compute this for you, and you can get the bounds directly with

  binom.test(N, P)$conf.int

 Actually, binom.test computes a two-sided confidence interval, which 
 corresponds roughly to 2.5 and 97.5 percentages in your approach. It doesn't 
 give you the 50% point either, but I don't think that's a meaningful quantity 
 with a two-sided test.

 Hope this helps,
 Stefan


 On 9 Oct 2013, at 15:53, Benjamin Ward (ENV) b.w...@uea.ac.uk wrote:

 I got given some code that uses the R function pbionom:

 p - mut * t
 sumprobs - pbinom( N, B, p ) * 1000

 Which gives the output of a probability as a percentage like 5, 50, 95.

 What the code currently does is find me the values of t I need, by using the 
 above two code lines in a loop, each iteration it increaces t by one and 
 runs the two lines. When sumprobs equals 5, it records the value t, then 
 again when sumprobs is equal to 50, and again when sumprobs is equal to 95 - 
 giving me three t values. This is not an efficient way of doing this if t is 
 large. Is it possible to rearrange pbinom so it gives me the small p (made 
 of mut*t) as the result of plugging in the sumprobs instead, and is there an 
 R function that already does this?

 Since pbinom is the binomial probability equation I suppose the question is 
 - in more mathematical terminology - can I change this code

Re: [R] Small p from binomial probability function.

2013-10-10 Thread Benjamin Ward (ENV)
Hi,

Thank you for your answers, I'm not completely sure if it's to bino.test I need 
or the uniroot. Perhaps I should explain more the idea behind the code and the 
actual task I'm trying to do. The idea is to calculate a confidence interval as 
to the age of two DNA sequences which have diverged, where I know the number of 
mutations that happened in them, and I know the mutation rate.

The binomial probability can be used since, mutations have a probability of 
occurring or being observed so many times in a sequence. This is dependent on 
the length of the DNA stretch (which equates to the number of trials since each 
base is a possibility of observing a mutation), the probability of a single 
mutation occurring which is p = t * u, since more time means a higher 
probability a mutation may have occurred.  

So my code, using pbinom, is supposed to calculate the probability that my DNA 
stretches contain the number of mutations observed P(X = k), given their size 
(trials) and the probability of a single mutation (p = t * u). However I'm 
interested in finding t: t is what is unknown, so the loop repeatedly evaluates 
the calculation, increasing t each time and checking P(X=k), when it is 0.05, 
0.50 and 0.95, we record t.

Ideally I'd like to rearrange this so I can get the probability of a single 
success (mutation) p, and then divide by the mutation rate to get my t. My 
supervisor gave my the loopy code but I imagine there is a way to plug in 
P(X=k) as 0.05 and 0.95 and get my upper and lower t estimates.

According to the R built in docs:

binom.test
Description:

 Performs an exact test of a simple null hypothesis about the
 probability of success in a Bernoulli experiment.

Perhaps this is the one I need rather than uniroot?

Best,
Ben.



From: Stefan Evert [stefa...@collocations.de]
Sent: 10 October 2013 09:37
To: R-help Mailing List
Cc: Benjamin Ward (ENV)
Subject: Re: [R] Small p from binomial probability function.

Sounds like you want a 95% binomial confidence interval:

binom.test(N, P)

will compute this for you, and you can get the bounds directly with

binom.test(N, P)$conf.int

Actually, binom.test computes a two-sided confidence interval, which 
corresponds roughly to 2.5 and 97.5 percentages in your approach. It doesn't 
give you the 50% point either, but I don't think that's a meaningful quantity 
with a two-sided test.

Hope this helps,
Stefan


On 9 Oct 2013, at 15:53, Benjamin Ward (ENV) b.w...@uea.ac.uk wrote:

 I got given some code that uses the R function pbionom:

 p - mut * t
 sumprobs - pbinom( N, B, p ) * 1000

 Which gives the output of a probability as a percentage like 5, 50, 95.

 What the code currently does is find me the values of t I need, by using the 
 above two code lines in a loop, each iteration it increaces t by one and runs 
 the two lines. When sumprobs equals 5, it records the value t, then again 
 when sumprobs is equal to 50, and again when sumprobs is equal to 95 - giving 
 me three t values. This is not an efficient way of doing this if t is large. 
 Is it possible to rearrange pbinom so it gives me the small p (made of mut*t) 
 as the result of plugging in the sumprobs instead, and is there an R function 
 that already does this?

 Since pbinom is the binomial probability equation I suppose the question is - 
 in more mathematical terminology - can I change this code so that instead of 
 calculating the Probability of N successes given the number of trials and the 
 probability of a single success, can I instead calculate the probability of a 
 single success using the probability of N successes and number of trials, and 
 the number of successes? Can R do this for me. So instead I plug in 5, 50, 
 and 95, and then get the small p out?


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Small p from binomial probability function.

2013-10-09 Thread Benjamin Ward (ENV)
Hi,

I got given some code that uses the R function pbionom:

p - mut * t
sumprobs - pbinom( N, B, p ) * 1000

Which gives the output of a probability as a percentage like 5, 50, 95.

What the code currently does is find me the values of t I need, by using the 
above two code lines in a loop, each iteration it increaces t by one and runs 
the two lines. When sumprobs equals 5, it records the value t, then again when 
sumprobs is equal to 50, and again when sumprobs is equal to 95 - giving me 
three t values. This is not an efficient way of doing this if t is large. Is it 
possible to rearrange pbinom so it gives me the small p (made of mut*t) as the 
result of plugging in the sumprobs instead, and is there an R function that 
already does this?

Since pbinom is the binomial probability equation I suppose the question is - 
in more mathematical terminology - can I change this code so that instead of 
calculating the Probability of N successes given the number of trials and the 
probability of a single success, can I instead calculate the probability of a 
single success using the probability of N successes and number of trials, and 
the number of successes? Can R do this for me. So instead I plug in 5, 50, and 
95, and then get the small p out?

Thanks,
Ben W.





[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Building a Package using gWidgets

2013-02-14 Thread Benjamin Ward (ENV)
Hi,

For the past few months I've been building a simulation in R I hope to package. 
It consists of two usable functions, and many internal ones which one of the 
two usable functions call while looping, to perform the stages of simulation.

A simple conceptual example is:

# Abstract representation 1st Usable function, generates object containing 
settings for simulation.
settings - function(){
  matrix(c(1:4),nrow=2,ncol=2)
}

# Abstract representation of one of many internal functions, does some action 
in simulation.
int.func - function(x){
  out - x*2
  return(out)
}

# Abstract representation of second usable function, takes settings  invokes 
internal functions generates results  saves as R object files.
sim.func - function(x){
  ans - int.func(x)
  ans2 - ans-2
  save(ans2, file=outtest)
}

With my package so far, using it is done like so after loading and attaching 
the package with library():

INPUT - settings()
fix(settings) # If you want to change from the defaults.
sim.func(INPUT)

Nothing needs returning from the simulation because it gets saved as an object 
to be read in afterwards.

Now I'd like it to have a simple GUI which can be used in conjunction with 
command line - think Rcmdr but much more simple, to allow my co-workes who have 
never touched R to use it.
The gui needs to be able to edit the settings - like with the fix command 
above, to save a settings object to file, and to read in setting from an object 
file. I've built this with gWidgets:

gui - function(){
  INPUT - matrix(c(1:4),nrow=2,ncol=2)

  mainwin - gwindow(MainWindow)

  button1 - gbutton(Edit Settings, cont=mainwin, handler=
   function(h,...){
 fix(INPUT)
 print(Settings Edited)
   })

  button2 - gbutton(RUN, cont=mainwin, handler=
   function(h,...){
 sim.func(INPUT)
 print(The run is done)
   })

  savebutton - gbutton(Write Settings to File,cont=mainwin, handler=
  function(h,...){
setfilename - ginput(Please enter the filename)
save(INPUT, file=setfilename)
  })

  loadutton - gbutton(Load Settings from File, cont=mainwin,
   handler=function(h,...){
 fname - gfile(test=Choose a file,
type=open,
action=print,
handler =
  function(h,...){
do.call(h$action, list(h$file))
  }
 )
 load(fname)})
}

Note the job of the settings function from before is now done by the first line 
of this gui function.
I add this to the same R file as the three functions above, add 'gui' to the 
namespace as an export, set gWidgets stuff as imports, and rebuild, then I 
library() the package and do gui().
The interface shows up. However I have a few issues:
The gui shows up fine, but if I click button1 to edit settings through 
fix(INPUT), then change the values, close the editor and click the button again 
to see if the changes have persisted and been stored in INPUT, they have not.
Same goes for reading in an object, it does not overwrite the INPUT object 
generated by default in the first line of function gui().

I think this has something to do with environments of functions but I'm not too 
sure. In the gui-less version of my package, the user generates the object 
containing settings, which is in workspace and feeds it to the simulation 
function as an argument. However since with the gui version, everything is run 
inside the function gui() and gWidgets handlers makes use of functions(h,...) I 
can't help but feel as if environments are the issue here. It's odd that when 
clicking on button 1, it will find INPUT from the gui() environment, but won't 
make the changes back there.

Can anybody help out with this and suggest what it is I need to do?

Apologies for a long email, but I've tried to explain clearly. Code is 
reproducible, as is the issue, just by having library(gWidgets, gWidgetstcltk)
and copying and pasting code. The abstract example I've provided faithfully 
reproduces the same issues I have with my proper simulation functions so if I 
can't get it working, I won't get the real thing working.

Thanks,

Ben W.
UEA
The Sainsbury Laboratory.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Subscript out of Bounds Lapply

2013-02-05 Thread Benjamin Ward (ENV)
Hi all,

I've written a function for a simulation which will - in general operation 
without being specific to the simulation scenario, duplicate or delete columns 
from a matrix based on two values which determine how many as a proportion of 
the matrix: the two values are always between 0.01 and 0.99:

Dup.Dels - function(matrix, values){

  p.dup - ceiling(ncol(matrix)*as.numeric(values[1])) #Determine how many 
columns to be duplicated.

  p.del - ceiling(ncol(matrix)*as.numeric(values[2])) # Determine how many 
columns to be deleted.

  dups - sample(ncol(matrix), p.dup, replace=FALSE) # Randomly choose columns 
to be duplicated.

  dels - sample(ncol(matrix), p.del, replace=FALSE) # Randomly choose columns 
to be deleted.

  matrix.new - cbind(matrix.new, matrix[,dups]) # Duplicate columns.

  matrix.new - matrix.new[,-dels] # Delete columns.

  return(matrix.new)

}

I have a list of matrices of different column numbers and I apply the function 
like so (the values are in a matrix):
The lapply goes down the matrix list, and down the matrix with the 'values' in, 
applying the function.

New.Matrices - lapply(seq(Matrices), function(i) Dup.Dels(Matrices[[i]], 
Values[i,]))

Now the weird thing is, if I test this with data and run this function. It 
appears to work exactly how I want it to, yet when it is called from the 
simulation loop I get:

Error: subscript out of bounds

So I open up all my .R files, load all my functions and open the simulation 
loop, go through each line, sending it to the R interpreter one by one, 
function by function. I get to the lapply line above, and it works! For some 
reason it works when I spoon feed my simulation loop line by line, but if I 
call the function to start up the simulation and run it, it will hit that 
function and tell me the subscript is out of bounds. Commenting out the call to 
the function in the simulation allows the rest of the simulation to execute 
perfectly.

Does anybody know why I would get this strange behaviour with this code? - 
writing it, testing it like by line with some starting data the simulation 
actually uses, it works fine, even going through the entire sim step by step, 
it works fine. Closing it all up and running the simulation by function call 
however it hits an error. Running through it with debug() and browser() it also 
hits this error at the same point - something is up with this function. I can't 
figure out, as far as I can tell my indexing in the lapply line is correct, 
both for operating on every element of the list Matrix, and going down the 
rows of the matrix storing the two values.



Thanks,
Ben W.

UEA (ENV) b.w...@uea.ac.uk
The Sainsbury Laboratory ben.w...@sainsbury-laboratory.ac.uk


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Remove and add to many matrices in list.

2013-02-05 Thread Benjamin Ward (ENV)
Hi all,

I realised that my last email question and code was probably going to be a bit 
of an eyesore for some people and that perhaps the best thing for me to do is 
to pose the question of what it is I want to achieve, rather than what I've 
written, if it helps people:

I'm writing a simulation, and during that simulation I have a list of matrices 
– of variable number of columns.
For every matrix in this list I want to randomly select a proportion of them 
for removal from the matrix, and a proportion for duplication. Proportions of 
columns to be duplicated of deleted are set out in a n by 2 matrix:

Delete  Duplicate
0.99 0.43
0.340.32
0.540.56
….. And so on.

So for each matrix in the list of matrices, I want to:


  *   Calculate the number of columns to be deleted or duplicated: something 
like
 *   number.to.delete - ncol(matrix list[[I]])*proportionsmatrix[I,1]
 *   number.to.duplicate - ncol(matrix list[[I]])*proportionsmatrix[I,2]
  *   Then I want to sample the columns to be deleted, and those to be copied : 
something like:
 *   which.delete - sample(ncol(matrix[[I]], number.to.delete, replace=F)
 *   which.duplicate - sample(ncol(matrix[[I]], number.to.delete, 
replace=F)
  *   Then I want to make the new matrices: something like:
 *   new.matrices- matrix[,-which.delete]
 *   new.matrices-cbind(new.matrices, matrix[,which.duplicate]

From my previous email you'll see I did this by making a function which will 
do this for one matrix out of the entire list, and the applying the function 
to the entire list with lapply. Which works when I copy and paste the code 
into R with usable data, but as part of the simulation it fails. This is 
strange since I do other similar operations on these matrices without problem, 
with the same method of indexing. Debug() and running the sim step by step the 
data does not appear to be altered such that would affect the function.

Best,
Ben.





[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] NA and Character(0) in List Element

2013-01-28 Thread Benjamin Ward (ENV)
Hi, This is probably a small query but one I'm struggling with: I have a list 
in which I had elements which were NA, I removed them, by doing: list2 - 
lapply(list, na.omit),

However this leaves the element there with  'character(0)' in place as well as 
attributes:

e.g.
[[978]]
character(0)
attr(,na.action)
[1] 1
attr(,class)
[1] omit


 I want to get rid of these elements/positions in the list, since a function is 
supposed to sample the list for elements (each element is a collection of about 
20 numbers each).

Thanks,

Ben W.

UEA (ENV) - b.w...@uea.ac.uk



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Removal of columns from matrix where all values of the column are identical.

2013-01-26 Thread Benjamin Ward (ENV)
Hi,

I've been trying to work this out how it works, I'm still not totally sure but 
seems to me it subsets the matrix according to whether each column returns all 
TRUE, to not being the same values when you compare the column[-1] and the 
column[-length(column)], essentially siding the column against itself and 
making comparison? It seems that it also removed columns with any repeated 
values, rather than columns in which all values are the same:

testm-matrix(nrow=5, ncol=5)
testm[,1] - c(1,2,3,4,5)
testm[,2] - c(3,3,3,3,3)
testm[,3] - c(3,3,3,4,3)
testm[,4] - c(5,4,3,2,1)
testm[,5] - c(1,2,3,4,4)
testm

test3 - testm[,apply(testm,2,function(x) all(c(TRUE,x[-length(x)]!=x[-1])))]
test3

test3
     [,1] [,2]
[1,]    1    5
[2,]    2    4
[3,]    3    3
[4,]    4    2
[5,]    5    1

Thanks,
Ben W.


From: arun [smartpink...@yahoo.com]
Sent: 26 January 2013 02:34
To: Benjamin Ward (ENV)
Cc: R help
Subject: Re: [R] Removal of columns from matrix where all values of the column 
are identical.

Hi,

I guess this should also work:


 Matrix[,apply(Matrix,2,function(x) all(c(TRUE,x[-length(x)]!=x[-1])))]
# [,1] [,2] [,3]
#[1,]    155
#[2,]    241
#[3,]    334
#[4,]    423
#[5,]    512
A.K.



- Original Message -
From: Benjamin Ward (ENV) b.w...@uea.ac.uk
To: r-help@r-project.org r-help@r-project.org
Cc:
Sent: Friday, January 25, 2013 6:17 PM
Subject: [R] Removal of columns from matrix where all values of the column are 
identical.

Hi all,

I'd like to write a piece of code which will remove columns from a matrix, if 
the column contains only one value, say every value in the column is a 3:

Matrix - matrix(NA, nrow=5, ncol=4)
Matrix[,1] - c(1,2,3,4,5)
Matrix[,2] - c(3,3,3,3,3)
Matrix[,3] - c(5,4,3,2,1)
Matrix[,4] - c(5,1,4,3,2)

  [,1] [,2] [,3] [,4]
[1,]    1355
[2,]    2341
[3,]    3334
[4,]    4323
[5,]    5312

What I have written so far is a loop which will see if all values are the same, 
a bit of a hack since it just checks all values are equal to the first value of 
the column, if not, by definition the column cannot contain only one 
value/variable/character:

removals-c()
for(i in 1:ncol(Matrix)){
  if(all(Matrix[,i] == Matrix[[1,i]])){
removals-append(removals, i)
  }
}
new.Matrix - Matrix[,-removals]

This works for matrices with numbers or characters.
My question is - is there a better or more efficient way of doing this, maybe 
with apply or something. My first thought was apply set to operate over all 
columns, but was unsure of the indexing and selecting columns to be deleted.

Thanks,

Ben W.

University of East Anglia (ENV): b.w...@uea.ac.uk
The Sainsbury Laboratory: ben.w...@sainsbury-laboratory.ac.uk

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Removal of columns from matrix where all values of the column are identical.

2013-01-25 Thread Benjamin Ward (ENV)
Hi all,

I'd like to write a piece of code which will remove columns from a matrix, if 
the column contains only one value, say every value in the column is a 3:

Matrix - matrix(NA, nrow=5, ncol=4)
Matrix[,1] - c(1,2,3,4,5)
Matrix[,2] - c(3,3,3,3,3)
Matrix[,3] - c(5,4,3,2,1)
Matrix[,4] - c(5,1,4,3,2)

  [,1] [,2] [,3] [,4]
[1,]1355
[2,]2341
[3,]3334
[4,]4323
[5,]5312

What I have written so far is a loop which will see if all values are the same, 
a bit of a hack since it just checks all values are equal to the first value of 
the column, if not, by definition the column cannot contain only one 
value/variable/character:

removals-c()
for(i in 1:ncol(Matrix)){
  if(all(Matrix[,i] == Matrix[[1,i]])){
removals-append(removals, i)
  }
}
new.Matrix - Matrix[,-removals]

This works for matrices with numbers or characters.
My question is - is there a better or more efficient way of doing this, maybe 
with apply or something. My first thought was apply set to operate over all 
columns, but was unsure of the indexing and selecting columns to be deleted.

Thanks,

Ben W.

University of East Anglia (ENV): b.w...@uea.ac.uk
The Sainsbury Laboratory: ben.w...@sainsbury-laboratory.ac.uk

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Pairwise Comparrisons

2013-01-24 Thread Benjamin Ward (ENV)
Dear all,

I'm trying to write a function, that will take as an argument, some aligned 
genome sequences, and using a sliding window, do pairwise comparisons of 
sequence similarity. Coding the sliding window I think I can manage but what 
I'm trying to get to grips with is getting it so as every pairwise comparison 
is made, no matter how many genomes are added, from 3 to N.

So if I had four genome sequences, G1, G2, G3, G4 the comparisons would be:

G1:G1
G1:G2
G1:G3
G1:G4
G2:G2
G2:G3
G2:G4
G3:G3
G3:G4
G4:G4

I can think of a way this might be done with a very complicated loop, which 
would take the region in the window of each genome and then make all possible 
combination/comparrisons: So the loop would take G1, and then in turn compare 
against G2, G3, G4. Then it would take G2, and start again and pair it with 
everything from G1 to G4, then it would take G3 and compare with everything 
from G1 to G4, and then finally would take G4, and compare it with everything 
from G1 to G4.

This is a wasteful way of doing it however, because for example, by the time 
the loop gets around to dealing with G4 as it's first argument I.e. the G4:GN 
comparisons, all comparisons with G4 in apart from G4:G4 have already been made 
– I.e. G4:G1 is just G1: G4 backwards. So it's really wasteful and computing 
stuff that isn't necessary.

So my question is, how can someone do pairwise comparisons in R this way, and 
ensure all combinations are compared, but it's not as wasteful as my obvious 
shotgun approach which computers many redundant comparisons?

Ben W.

University of East Anglia (ENV): b.w...@uea.ac.uk
The Sainsbury Lab (JIC): ben.w...@sainsbury-lab.ac.uk

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Stuck trying to modify a function

2012-11-28 Thread Benjamin Ward (ENV)
Jean,

Thank you for your suggestion it has really helped a lot. I ended up changing 
the expression system in my model, since I realised that if the genome has 
extensive duplication, i.e. same values, which my organism of interest is know 
to evolve, then because of the ux - unique(c(x, y))part of the function, 
duplicate genes within the genome may mutate in the same way, to test this I 
changed the gene values to a much narrower range between 1 and 10, rather than 
1. What I found was all the 7's would mutate negatively by 1, but all the 
5's would mutate positively by 1 etc. I also realised with duplication that's 
extensive, if it's value comes up in the expression data, then which of those 
duplicated genes is the one that's expressed is not clear, which complicates 
things.

I've changed it so as the expressed data is the same length as the genes data, 
but consists of a 1 and 0 for on and off. The function decided how many 
effectors to express between 1 and the total number of genes, samples them, 
makes a data series the same length as the effector genes, with all values set 
to 0 or off. Then gets the indexes of the genes sampled from the genome to be 
switched on, and then switches the 0's at the corresponding indexes in the 
expression data to 1.


Init_Express - function(x, min.exp=1, max.exp=length(x)) {
  Number_Expressed - round(runif(1, min=min.exp, max=max.exp))
  Expressed - sample(x,Number_Expressed)
  E - rep(0, length(x))
  pos-match(Expressed, x)
  E[pos]-1
  return(E)
}

Expression.State - lapply(seq(Effectors), function(i) 
Init_Express(Effectors[[i]]))

Based on how you managed to mutate genes using operations without loops, I 
thought of how to do this with expression states:

Change.Expression - function(x, prob.express=0.5, prob.repress=0.5) {
  old-x
  low.old - old[which(old==0)]
  low.new - low.old + sample(c(0, 1), size=length(low.old), replace=TRUE, 
prob=c(1-prob.express, 0+prob.express))
  high.old - old[which(old==1)]
  high.new - high.old - sample(c(0, 1), size=length(high.old), replace=TRUE, 
prob=c(1-prob.repress, 0+prob.repress))
  all.new - old
  all.new[pmatch(low.old, old, dup=F)] - low.new
  all.new[pmatch(high.old, old, dup=F)] - high.new
  return(all.new)
}

Expression.State - lapply(seq(Expression.State), function(i) 
Change.Expression(Expression.State[[i]]))

I've heard a lot of talk about the benefits or drawbacks of loops vs. the apply 
family of functions, be it speed or tidyness of code. I've been offered use of 
a high performance cluster in my dept. (not sure how R does on those) but I was 
still concerned about nested loops because of the number of individuals and 
then genes per individual, I mostly went with loops because of the indexing 
issue - controlling going through individuals, and then all the genes in turn. 
Changing the nested part of the loop in the manner you've demonstrated, 
eliminates that and will be a big help. It's helped me see how I can use lapply 
with functions that do things in a more vectorised manner and a more R like 
manner, I was afraid of the eventuality of having to write some processed in 
C++ to do them quicker.

Best Wishes,

Ben W.

UEA (ENV) and The Sainsbury Laboratory.







From: Jean V Adams [jvad...@usgs.gov]
Sent: 27 November 2012 22:01
To: Benjamin Ward (ENV)
Cc: r-help@r-project.org
Subject: Re: [R] Stuck trying to modify a function

Ben,

You can use the sample() function to randomly add -1, 0, or 1 to each 
observation, and control for the probability of mutation at the same time.  
Then you can use the match() function to make sure that any mutations in X are 
carried through to Y in the same way.  I wrote the function to do each list 
element separately.  So a gene in X[[1] and Y[[1]] will be mutated in the same 
way, but the same gene in X[[2]] and Y[[2]] may be mutated in a different way.  
Not sure if that is what you want.

Mutate - function(x, y, prob.mutate=0.9) {
ux - unique(c(x, y))
new.ux - ux + sample(c(-1, 0, 1), size=length(ux), replace=TRUE, 
prob=c(prob.mutate/2, 1-prob.mutate, prob.mutate/2))
new.x - new.ux[match(x, ux)]
new.y - new.ux[match(y, ux)]
list(xm=new.x, ym=new.y)
}

Effectors - lapply(seq(X), function(i) Mutate(X[[i]], Y[[i]]))

Jean



Benjamin Ward (ENV) b.w...@uea.ac.uk wrote on 11/27/2012 10:45:23 AM:

 Hi,

 I have the following data:

 Path_Number - 5
 ID.Path - c(1:Path_Number) # Make vector of ID's.
 No_of_X - sample(50:550, length(ID.Path), replace=TRUE) #
 X - split(sample(1:1, sum(No_of_X), replace=TRUE), rep(ID.Path,No_of_X))
 Y - lapply(X,function(x) sample(x, round(runif(1, min=10, max=50

 X and Y are both lists, and I've made the following function to work
 on that data as part of a simulation I'm building:

 Mutate-function(x){
   l-0
   for(i in x){
 l2-0
 l-l+1
 for(i in x[[l]]){
   l2-l2+1
   if(runif(1)  0.9) ifelse(runif(1) 0.5, x[[l]][l2

[R] Stuck trying to modify a function

2012-11-27 Thread Benjamin Ward (ENV)
Hi,

I have the following data:

Path_Number - 5
ID.Path - c(1:Path_Number) # Make vector of ID's.
No_of_X - sample(50:550, length(ID.Path), replace=TRUE) #
X - split(sample(1:1, sum(No_of_X), replace=TRUE), rep(ID.Path, No_of_X))
Y - lapply(X,function(x) sample(x, round(runif(1, min=10, max=50

X and Y are both lists, and I've made the following function to work on that 
data as part of a simulation I'm building:

Mutate-function(x){
  l-0
  for(i in x){
l2-0
l-l+1
for(i in x[[l]]){
  l2-l2+1
  if(runif(1)  0.9) ifelse(runif(1) 0.5, x[[l]][l2] - x[[l]][l2]+1, 
x[[l]][l2] - x[[l]][l2]-1)
}
  }
  return(x)
}

I call this with Effectors-Mutate(X)
The function is designed to alter the values of each element in X by either + 
or - 1 (50:50 chance wether + or -). However Y, elements of which are a subset 
of the corresponding elements of X, need to be consistent i.e. if a value in X 
is changed, and that value is part of the Y subset, then the value in Y also 
needs to be changed. however, since Y is a smaller subset it will not be 
indexed the same. My idea was to include in the function an if statement that 
checks if Y contains the value to be changed, removes it, and then after the 
value in X is changed, put the new value in Y. I attempted this with:


Mutate-function(x,y){
  l-0
  for(i in x){
l2-0
l-l+1
for(i in x[[l]]){
  l2-l2+1
  if(runif(1)  0.9){
if(x[[l]][l2] %in% y[[l]] == TRUE){
  y[[l]]-[which(y[[l]]!=x[[l]][l2])]
  if(runif(1) 0.5){
x[[l]][l2] - x[[l]][l2]+1
y[[l]]-append(x[[l]][l2])
  }else{
x[[l]][l2] - x[[l]][l2]-1
y[[l]]-append(x[[l]][l2])
  }
}
ifelse(runif(1) 0.5, x[[l]][l2] - x[[l]][l2]+1, x[[l]][l2] - 
x[[l]][l2]-1)
  }
}
  }
  return(list(x,y))
}

Bit of an eyesore so I've put the altered stuff in bold. I've basically taken 
what the ifelse statement does in the first function, (which is still there and 
run if Y does not contain the X value being altered) and broken it down into an 
if and an else segment with multiple operations in curly braces to accommodate 
the extra actions needed to alter Y as well as X. This was all I could think of 
to keep changes between the two in sync, however this does not work when I 
try to load the function into workspace:

Error: unexpected '}' in }

I hope someone can point out what it is I've done that isn't working, or a 
better way to do this.

Best Wishes,

Ben W.

UEA (ENV)  The Sainsbury Laboratory.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Code works, but not as function.

2012-11-15 Thread Benjamin Ward (ENV)
Hi,

I have some values in a list format generated by the following:
Path_Number - 0010
ID.Path - formatC(0001:Path_Number, width=4, flag=0) # Make vector of ID's.
No_of_Effectors - sample(1:550, length(ID.Path), replace=TRUE) # Define Number 
of Effectors each individual gets.
Effectors - split(sample(1:1, sum(No_of_Effectors), replace=TRUE), 
rep(ID.Path, No_of_Effectors)) # Generate effectors and dish them out.
Effectors

And I've written a chunk which is designed to go through each element of the 
list, and then through each value in the element, and if a conditions is met 
(in this case if runif(1) is  0.3, this is simple but may be changed later to 
be more complex probability criteria based on mutation data from experiment), 
change the value by 1 in either direction, higher or lower with equal 
probability. Here it is (I've changed the 0.3 value I mentioned to 0.9, so many 
values change so it can be easily seen):

l-0 # Set counter 1 to 0.
for(i in Effectors){ # Begin loop on list of effectors.
  l2-0 # Set counter 2 to 0.
  l -l+1 # Increace counter number 1.
  for(i in Effectors[[l]]){ # Begin loop through all effector values.
l2 -l2+1 # Increace counter number 2.
if(runif(1)  0.9) ifelse(runif(1) 0.5, Effectors[[l]][l2] - 
Effectors[[l]][l2]+1, Effectors[[l]][l2] - Effectors[[l]][l2]-1) # Line which 
increaces or decreaces the values in the list element (50/50 chance of increace 
or decreace), if the first IF statement is satisfied.
  }
}

Now I don't know if this is the best and most R-ish way of doing this, but it 
works and I understand it. However I'd like to define a function with this, my 
attempts so far have been:

Eff.Mutate-function(){
  l-0 # Set counter 1 to 0.
  for(i in Effectors){ # Begin loop on list of effectors.
l2-0 # Set counter 2 to 0.
l -l+1 # Increace counter number 1.
for(i in Effectors[[l]]){ # Begin loop through all effector values.
  l2 -l2+1 # Increace counter number 2.
  if(runif(1)  0.9) ifelse(runif(1) 0.5, Effectors[[l]][l2] - 
Effectors[[l]][l2]+1, Effectors[[l]][l2] - Effectors[[l]][l2]-1) # Line which 
increaces or decreaces the values in effvec, if the first IF statement is 
satisfied.
}
  }
}

and:

Eff.Mutate2-function(x){
  l-0
  for(i in x){
l2-0
l-l+1
for(i in x[[l]]){
  l2-l2+1
  if(runif(1)  0.9) ifelse(runif(1) 0.5, x[[l]][l2] - x[[l]][l2]+1, 
x[[l]][l2] - x[[l]][l2]-1)
}
  }
}

However if I do either Eff.Mutate() or Eff.Mutate2(Effectors), then neither 
seems to work; I've seen no differences in the values in the list elements, 
before and after.
I can't figure out why it works as a code chunk but if I try to make it a 
function nothing seems to happen. I'm probably going about making it a function 
wrong.

Thanks,

Ben W.

UEA (ENV)  The Sainsbury Laboratory.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Extract cell of many values from dataframe cells and sample from them.

2012-11-11 Thread Benjamin Ward (ENV)
Hi,

Thank you for your suggestion, this works a treat.

For my understanding and future reference, this would also work for something 
like 2D matrices of unequal row size? As far as I understand it would not be 
possible to make a 3D array jagged like this because the rows would need to be 
of equal number for the array function, yet in a list there is not such 
requirement, and operations on matrices can target elements in specific 
matrices by [[,]][,] ?

Best Wishes,

Ben W.

UEA (ENV)  The Sainsbury Laboratory.


From: Jean V Adams [jvad...@usgs.gov]
Sent: 08 November 2012 19:59
To: r-help@r-project.org
Cc: Benjamin Ward (ENV)
Subject: Re: [R] Extract cell of many values from dataframe cells and sample 
from them.

Ben,

I think you would find lists a helpful way to arrange your data.  They do not 
require equal lengths of data in each element.  Check out the code below for a 
smaller version of the example you provided (with only 5 individuals rather 
than 500).

# An alternative way to arrange your data, as a list
# Each element of the list is an individual, with all its effector genes
ID.unique - formatC(0001:0005, width=4, flag=0)
No_of_Effectors - sample(1:550, length(ID.unique), replace=TRUE)
Effectors - split(sample(1:1, sum(No_of_Effectors), replace=TRUE), 
rep(ID.unique, No_of_Effectors))
Effectors

# Now take a random sample of effectors from each individual
Expressed_Genes - lapply(Effectors, function(x) sample(x, sample(1:length(x), 
1)))
Expressed_Genes

Jean



Benjamin Ward (ENV) b.w...@uea.ac.uk wrote on 11/08/2012 10:00:57 AM:

 Hi,

 First my apologies for a non-working piece of code in a previous
 submission, I have corrected this error.

 I'm doing is individual based modelling of a pathogen and it's host.
 The way I've thought of doing this is with two dataframes, one of
 the pathogen and it's genes and effector genes, and one of the host
 and it's resistance genes. During the simulation, these things can
 be pulled out of the dataframes and operated on, before being stored
 again in the dataframes.

 Below is how I've created my dataframe and stored my effector genes.
 In this model, effector genes are numerical values between 1 and 1.

 Path_Number - 0500
 inds - data.frame(ID=formatC
 (0001:Path_Number,width=4,flag=0),No_of_Effectors=,No_Expressed_Effectors=)
 inds$No_of_Effectors - round(as.numeric(lapply(1:nrow
 (inds),function(x) runif(1, min=1, max=550
 Effectors - lapply(1:nrow(inds),function(x) sample(1:1,inds
 $No_of_Effectors,replace=TRUE))
 inds - data.frame(inds,Effectors=as.character(Effectors))
 Ind_Genes - strsplit(as.character(inds[1,4]),,)

 What I'm trying to do is:
 1). For each individual (row) in my database, extract the values in
 the Effectors cell to an object.
 2). Sample a number of those values and assign them to a new object
 called Expressed_Effectors
 3). Storing it in the Expressed_Effectors cell, in much the same
 manner as I stored the Effectors object in the Effectors cell.

 My example attempt (for the first row/individual in my dataset) is below:

 (step by step, I didn't put this in a loop until I know it works for 1 row)

 Extract the values (effector genes) for the first individual, from
 the Effectors Cell in the dataframe, to Ind_Effectors object.
 Ind_Effectors - strsplit(as.character(inds[1,4]),,)

 Randomly dictate how many values (effectors) will be sampled
 n-round(runif(1, min=10, max=50))

 Sample n values (effector genes) from Ind_Effectors, not replacing
 Expressed_Genes - sample(Ind_Effectors,n,replace=F)

 If I run this I receive the error:
 Error in sample(Ind_Effectors, n, replace = F) :
   cannot take a sample larger than the population when 'replace = FALSE'

 What I think this means is rather than picking out n values from the
 whole set of values in Ind_Effectors it's trying to sample the
 whole lot n times, which it cannot do because replace=F. This is not
 what I need, what I need is n values sampled from Ind_Effectors,
 not all values from Ind_Effectors sampled n times.

 I hope this clears up the confusion with what I'm trying to do. It
 may very well be I'm not instructing R to sample as a require
 properly. Sadly my previous experience with R amounts to loading in
 dataframes from experiment and doing stat analysis  model fitting,
 not simulations or individual based models.

 Best wishes,

 Ben W.
 UEA (ENV)  The Sainsbury Laboratory.

 P.S. As an aside I've been thinking about doing this model an
 alternative way to as I described in the first bit of my email
 (based on dataframes).
 Instead I would use a multi-dimentional ragged array(s):
 The format would be a 2D layout, Where every line is an effector
 gene and every column an aspect of the effector gene(value,
 expression state, fitness contribution etc.) This 2D layout of rows
 and columns is then repeated in the 3rd dimension (the z of x,y,z)
 of the array for each individual. It is ragged in the sense each

Re: [R] sample from list

2012-11-08 Thread Benjamin Ward (ENV)
Hi,

Thanks, for the reply.

I should explain more, I'll be as brief as I can, the code for generating the 
dataframe is below.

What I'm doing is individual based modelling of a pathogen and it's host. The 
way I've thought of doing this is with two dataframes, one of the pathogen and 
it's genes and effectors, and one of the host and it's resistance genes. During 
the processes of the model these things can be pulled out of the dataframes and 
operated on, before being stored again in the dataframes.

I have generated my dataset as below, it was suggested by arun in a reply to 
a previous email I wrote with the subject Trouble with data structures.

Path_Number - 0500 # The number of pathogen individuals in the population.
# Create the initial dataframe, with initial number of effectors and initial 
number of expressed effectors.
inds 
-data.frame(ID=formatC(0001:Path_Number,width=4,flag=0),No_of_Effectors=,No_Expressed_Effectors=)
# Generate the number of effectors genes each individual has.
inds$No_of_Effectors - round(as.numeric(lapply(1:nrow(inds),function(x) 
runif(1, min=1, max=550
# Generate the actual efector genes.
Effectors - lapply(1:nrow(inds),function(x) 
sample(1:1,inds$No_of_Effectors,replace=TRUE))
#Add them to the dataframe
inds - data.frame(inds,Effectors=as.character(Effectors))


What I'm trying to do is for each individual, extract the values in the 
Effector genes cell to an object. As far as I can tell,

Ind_Genes-strsplit(as.character(inds2[1,4]),,)

Will do this for the first individual or I can get all of them with

All_Genes-strsplit(as.character(inds2[,4]),,)

What I then want to do is according to a generated number for each individual...

round(as.numeric(lapply(1:nrow(inds2),function(x) runif(1, min=10, max=50

... sample that many genes from Ind_Genes and make a new object called 
Expressed_Genes, which can be stored in the dataframe. My attempt at doing this 
is:

Expressed_Genes-lapply(First_Ind_Genes,function(x) 
sample(First_Ind_Genes,round(runif(1, min=10, max=50)),replace=F))

to get Expressed genes for each individual, this might be part of a for loop, 
or to the whole list of every individuals genes like so:

Expressed_Genes-lapply(All_Genes,function(x) sample(All_Genes,3,replace=F))

What usually happens however is I get errors:
Error in sample(First_Ind_Genes, round(runif(1, min = 10, max = 50)),  :
  cannot take a sample larger than the population when 'replace = FALSE'

or it will rather than sample 3 values, sample all the values, 3 times if I 
allow replacement (which I don't want).

So it's not sampling 3 values for me, but the whole lot of values 3 times.

I do not know of another way to extract these gene values and then do things 
with them.
For my model it is essential I can pull the genes or expressed genes out of the 
dataframe, work functions or operations on them and then store them back again. 
For example if an individual turns a gene on that was not before, then the 
genes would need to be pulled from the database, as would the expressed genes, 
and a random value from the genes object added to the expressed genes object, 
and then they could both be put back. A similar thing would happen when I 
wanted to mutate the genes.

In short my aim is pull genes or expressed genes out, work functions or 
operations on them and then store them back again.

Hopefully I've explained better, I have been thinking of changing my approach 
from datasets of pathogen and host from which values are pulled to objects and 
operated on to a multi-dimentional ragged arrays. I've been told this may be 
more simple for me.

Where every line is an effector gene and there can be columns for the gene 
value, expression state (1 or 0/T or F), fitness contribution etc. This 2D 
layout of rows and columns is then repeated in the z dimension of the array for 
each individual. It is ragged in the sense each individual, each slice through 
the array in the z direction, would have different numbers of rows - different 
numbers of effectors. I can then simulate mutations by changing the gene 
values, cause duplications by adding rows of duplicated genes, or even cause 
deletions by removing rows.
Once I have this set up for the pathogen I may make a similar array for the 
host plants, then perhaps with indexing or some such thing I can write 
functions to do the interactions and immunology and such.

Best,

Ben W.

UEA (ENV)  The Sainsbury Laboratory.



From: Jean V Adams [jvad...@usgs.gov]
Sent: 07 November 2012 21:12
To: Benjamin Ward (ENV)
Cc: r-help@r-project.org
Subject: Re: [R] sample from list

Ben,

Can you provide a small example data set for
inds
so that we can run the code you have supplied?
It's difficult for me to follow what you've got and where you're trying to go.

Jean



Benjamin Ward (ENV) b.w...@uea.ac.uk wrote on 11/06/2012 03:29:52 PM:

 Hi all,

 I have a list of genes present in 500 individuals

[R] Extract cell of many values from dataframe cells and sample from them.

2012-11-08 Thread Benjamin Ward (ENV)
Hi,

First my apologies for a non-working piece of code in a previous submission, I 
have corrected this error.

I'm doing is individual based modelling of a pathogen and it's host. The way 
I've thought of doing this is with two dataframes, one of the pathogen and it's 
genes and effector genes, and one of the host and it's resistance genes. During 
the simulation, these things can be pulled out of the dataframes and operated 
on, before being stored again in the dataframes.

Below is how I've created my dataframe and stored my effector genes. In this 
model, effector genes are numerical values between 1 and 1.

Path_Number - 0500
inds - 
data.frame(ID=formatC(0001:Path_Number,width=4,flag=0),No_of_Effectors=,No_Expressed_Effectors=)
inds$No_of_Effectors - round(as.numeric(lapply(1:nrow(inds),function(x) 
runif(1, min=1, max=550
Effectors - lapply(1:nrow(inds),function(x) 
sample(1:1,inds$No_of_Effectors,replace=TRUE))
inds - data.frame(inds,Effectors=as.character(Effectors))
Ind_Genes - strsplit(as.character(inds[1,4]),,)

What I'm trying to do is:
1). For each individual (row) in my database, extract the values in the 
Effectors cell to an object.
2). Sample a number of those values and assign them to a new object called 
Expressed_Effectors
3). Storing it in the Expressed_Effectors cell, in much the same manner as I 
stored the Effectors object in the Effectors cell.

My example attempt (for the first row/individual in my dataset) is below:

(step by step, I didn't put this in a loop until I know it works for 1 row)

Extract the values (effector genes) for the first individual, from the 
Effectors Cell in the dataframe, to Ind_Effectors object.
Ind_Effectors - strsplit(as.character(inds[1,4]),,)

Randomly dictate how many values (effectors) will be sampled
n-round(runif(1, min=10, max=50))

Sample n values (effector genes) from Ind_Effectors, not replacing
Expressed_Genes - sample(Ind_Effectors,n,replace=F)

If I run this I receive the error:
Error in sample(Ind_Effectors, n, replace = F) :
  cannot take a sample larger than the population when 'replace = FALSE'

What I think this means is rather than picking out n values from the whole set 
of values in Ind_Effectors it's trying to sample the whole lot n times, which 
it cannot do because replace=F. This is not what I need, what I need is n 
values sampled from Ind_Effectors, not all values from Ind_Effectors sampled 
n times.

I hope this clears up the confusion with what I'm trying to do. It may very 
well be I'm not instructing R to sample as a require properly. Sadly my 
previous experience with R amounts to loading in dataframes from experiment and 
doing stat analysis  model fitting, not simulations or individual based models.

Best wishes,

Ben W.
UEA (ENV)  The Sainsbury Laboratory.

P.S. As an aside I've been thinking about doing this model an alternative way 
to as I described in the first bit of my email (based on dataframes).
Instead I would use a multi-dimentional ragged array(s):
The format would be a 2D layout, Where every line is an effector gene and every 
column an aspect of the effector gene(value, expression state, fitness 
contribution etc.) This 2D layout of rows and columns is then repeated in the 
3rd dimension (the z of x,y,z) of the array for each individual. It is ragged 
in the sense each individual, each slice through the array in the z direction, 
would have different numbers of rows - different numbers of effectors. This may 
be easier to work on, but I've not worked with multidimensional arrays, I'm 
used to data in dataframes (usually from spreadsheets from experiments).


From: Jean V Adams [jvad...@usgs.gov]
Sent: 08 November 2012 13:35
To: Benjamin Ward (ENV)
Cc: r-help@r-project.org
Subject: RE: [R] sample from list

Ben,

You have still not supplied reproducible code for me (and any other r-help 
reader) to run, which makes it very difficult to help you.  I can run your 
first 5 lines of code with no problem.

Path_Number - 0500
inds 
-data.frame(ID=formatC(0001:Path_Number,width=4,flag=0),No_of_Effectors=,No_Expressed_Effectors=)
inds$No_of_Effectors - round(as.numeric(lapply(1:nrow(inds),function(x) 
runif(1, min=1, max=550
Effectors - lapply(1:nrow(inds),function(x) 
sample(1:1,inds$No_of_Effectors,replace=TRUE))
inds - data.frame(inds,Effectors=as.character(Effectors))

But your 6th line of code doesn't work ... there is no object inds2.

Ind_Genes-strsplit(as.character(inds2[1,4]),,)

If I use code that you provided in your earlier e-mail to create inds2, I get 
errors because inds doesn't have a variable No_of_Genes.

Genes - lapply(1:nrow(inds),function(x) 
sample(1:1,inds$No_of_Genes,replace=TRUE))
inds2 - data.frame(inds, Genes=I(Genes))
inds2$No_Expressed_Genes - round(as.numeric(lapply(1:nrow(inds2),function(x) 
runif(1, min=10, max=50

So, before you hit the send button on your next e-mail.  Start a clean R 
session with none of your objects

[R] sample from list

2012-11-06 Thread Benjamin Ward (ENV)
Hi all,

I have a list of genes present in 500 individuals, the individuals are the 
elements:
Genes - lapply(1:nrow(inds),function(x) 
sample(1:1,inds$No_of_Genes,replace=TRUE))

(This was later written to a dataframe as well as kept as the list object: 
inds2 - data.frame(inds,Genes=I(Genes)))

I also have a vector of  how many of those genes are expressed in the 
individuals, this can also kept as a vector object or written to a data frame:

inds2$No_Expressed_Genes - round(as.numeric(lapply(1:nrow(inds2),function(x) 
runif(1, min=10, max=50

I want to create another list which consists of each individuals expressed 
genes - essentially a subset of the total genes the individuals have in the 
Genes list, by sampling from the Genes list for each individual, the number 
of genes (values)in the Num_Expressed_Genes vector. i.e. if Num_Expressed_Genes 
= 3 then sample 3 values from the element in the Genes list. I can't quite 
figure it out though. So far I have the following:

#Defines The number of expressed genes for each individual in my data frame.
Num_Expressed_Genes - round(as.numeric(lapply(1:nrow(inds2),function(x) 
runif(1, min=10, max=50


#My attempts to apply the sample function to every element (individual 
organism) of the Genes list , to subset the genes expressed.
Expressed_Genes - lapply(1:nrow(inds),function(x) 
sample(Genes,Num_Expressed_Genes, replace=FALSE))
Expressed_Genes - lapply(Genes,function(x) sample(Genes,Num_Expressed_Genes, 
replace=FALSE))

So far though I'm getting results like this:

[[49]]
[[49]][[1]]
  [1] 3540   27 5344 7278 9758 8077 ... [217]


[[49]][[2]]
  [1]  740 3362 8588 8574 4371 1447 .. [340]


When what I need is more:

[[49]]
[1] 6070 1106 6275
In a case where Num_Expressed_Genes = 3 and the values are taken from the much 
larger set of values for element (individual) 49 in my Genes list.

I'm not sure what I'm doing wrong but it seems what is happening is instead of 
picking out a few values according to the Num_Expressed_Genes vector - as an 
example say 3 again, It's drawing a large number of values, if not all of them, 
from elements in the list, 3 times.

Any help is greatly appreciated,
I've thought of using loops to achieve the same task, but I'm trying to get my 
individual/genes/expressed genes data.frame set up for my individual based 
model and get it running using vectors and as little loops as possible.

Thanks,
Ben.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] FW: Having some Trouble Data Structures

2012-11-05 Thread Benjamin Ward (ENV)
From: Benjamin Ward (ENV)
Sent: 03 November 2012 13:29
To: Jeff Newmiller; r-help@r-project.org
Subject: RE: [R] Having some Trouble Data Structures

Hi,
Thank you very much for your reply - how you prefer, is how my supervisor 
implemented the layout in Minitab, however I was unsure of how to get R to do 
this repeating ID behaviour and how to know that in a for loop going through 
individual 1 to say 10, I want it to:

Randomly sample a number from a distribution for the number of effectors (I can 
do this but with runif),

Then put one value in a cell of the Effector column and repeat the ID for each 
effector row. I'm also then left wondering when I do for loops then that use 
ID, will it go and apply operations row by row, or ID by ID - for example in 
the immunology part I would need a loop to check individual by individual if 
any of the effectors it has means death in the host, in which case all 
instances of - say ID 1 would need to be deleted.

Would you be able to provide an example chunk of how you accomplish this with 
your preferred approach, if you have the time?

Thanks,
Ben W.


From: Jeff Newmiller [jdnew...@dcn.davis.ca.us]
Sent: 28 October 2012 15:27
To: Benjamin Ward (ENV); r-help@r-project.org
Subject: Re: [R] Having some Trouble Data Structures

Search on ragged array.

My preferred approach is to use a data frame with one row per effector that 
repeats the per-ID information. If that occupies too much memory, you can setup 
another data frame with one row per ID and refer to that information as using 
lapply and subset the effectors data as needed. The plyr package is also useful 
for such processing.
---
Jeff NewmillerThe .   .  Go Live...
DCN:jdnew...@dcn.davis.ca.usBasics: ##.#.   ##.#.  Live Go...
  Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/BatteriesO.O#.   #.O#.  with
/Software/Embedded Controllers)   .OO#.   .OO#.  rocks...1k
---
Sent from my phone. Please excuse my brevity.

Benjamin Ward (ENV) b.w...@uea.ac.uk wrote:

Hi All,

I'm trying to run a simulation of host-pathogen evolution based around
individuals.
What I need to have is a dataframe or table of some description -
describing all the individuals of a pathogen population (so far I've
implemented this as a matrix):

 ID No_of_Effectors   Effectors (Sequences)
  [1,] 0001  3   ##   3 Random Numbers ##

There will be many such rows for many individuals. They have something
called effectors, the number of which is randomly generated, so say you
get 3 in the No_of_Effectors column. Then I make R generate 3 numbers
from between 1 and 10,000, this gives me three numerical
representations of genes. These numbers will be compared to a similar
data structure of the host individuals who have their immune genes with
similar numbers.

My problem is that obviously I can't stick 3 numbers in one cell of
the matrix (I've tried) :

Pathogen_Individuals[1,3] - c(2,3,4)
Error in Pathogen_Individuals[1, 3] - c(345, 567, 678) :
  number of items to replace is not a multiple of replacement length

In future I'm also going to have more variables such as whether a gene
is expressed. Such information may require a matrix in itself -
something like:


Effector ID Sequence  Expressed?
 [1,] 0001  345,567,678   1 (or 0).

Is there a way then I can put more than one value in the cell like a
list of values, or a way to put objects in a cell of a data frame,
matrix or table etc. Almost an inception deal - data structures nested
in a data structure? If I search for things like insert list into
matrix I get results like how to turn one into another, which is not
what I think I need to be doing.

I have been considering having several data structures not nested in
each other, something like for every individual create a new matrix
object with the name Effectors_[Individual_ID] and some how get my
simulation loops operating on those objects but I find it hard to see
how to tell R all of those matrices are to be included in an operation,
as you can all lines of a data frame for example with for loops.
This is strange for me because this model was written in a macro-code
for another program which handles data in a different format and layout
to R.

My problem is I think, each individual in the model has many variables
- in this case representations of genes. So I'm having trouble getting
my head about this.

Hopefully someone more experienced will be able to offer advice or a
solution, it will be very appreciated.

Many Thanks,
Ben Ward (ENV, UEA  The Sainsbury Lab, JIC).

P.S. I have searched previous queries

[R] Having some Trouble Data Structures

2012-10-28 Thread Benjamin Ward (ENV)
Hi All,

I'm trying to run a simulation of host-pathogen evolution based around 
individuals.
What I need to have is a dataframe or table of some description - describing 
all the individuals of a pathogen population (so far I've implemented this as a 
matrix):

 ID No_of_Effectors   Effectors (Sequences)
  [1,] 0001  3   ##   3 Random Numbers ##

There will be many such rows for many individuals. They have something called 
effectors, the number of which is randomly generated, so say you get 3 in the 
No_of_Effectors column. Then I make R generate 3 numbers from between 1 and 
10,000, this gives me three numerical representations of genes. These numbers 
will be compared to a similar data structure of the host individuals who have 
their immune genes with similar numbers.

My problem is that obviously I can't stick 3 numbers in one cell of the 
matrix (I've tried) :

Pathogen_Individuals[1,3] - c(2,3,4)
Error in Pathogen_Individuals[1, 3] - c(345, 567, 678) :
  number of items to replace is not a multiple of replacement length

In future I'm also going to have more variables such as whether a gene is 
expressed. Such information may require a matrix in itself - something like:


Effector ID Sequence  Expressed?
  [1,] 0001  345,567,678   1 (or 0).

Is there a way then I can put more than one value in the cell like a list of 
values, or a way to put objects in a cell of a data frame, matrix or table etc. 
Almost an inception deal - data structures nested in a data structure? If I 
search for things like insert list into matrix I get results like how to turn 
one into another, which is not what I think I need to be doing.

I have been considering having several data structures not nested in each 
other, something like for every individual create a new matrix object with the 
name Effectors_[Individual_ID] and some how get my simulation loops operating 
on those objects but I find it hard to see how to tell R all of those matrices 
are to be included in an operation, as you can all lines of a data frame for 
example with for loops.
This is strange for me because this model was written in a macro-code for 
another program which handles data in a different format and layout to R.

My problem is I think, each individual in the model has many variables - in 
this case representations of genes. So I'm having trouble getting my head about 
this.

Hopefully someone more experienced will be able to offer advice or a solution, 
it will be very appreciated.

Many Thanks,
Ben Ward (ENV, UEA  The Sainsbury Lab, JIC).

P.S. I have searched previous queries to the list, and I'm not sure but this 
may be useful for relevant:


Have you thought of using a list?

 a - matrix(1:10, nrow=2)
 b - 1:5
 x - list(a=a, b=b)
 x
$a
 [,1] [,2] [,3] [,4] [,5]
[1,]13579
[2,]2468   10

$b
[1] 1 2 3 4 5

 x$a
 [,1] [,2] [,3] [,4] [,5]
[1,]13579
[2,]2468   10
 x$b
[1] 1 2 3 4 5

oliveoil and yarn datasets have been mentioned.





[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] RExcel

2009-10-07 Thread Benjamin Ward
Hello-

I am a Graduate Assistant for an instructor who has written programs for
statistics calculations such as binomial distributions and regressions.

The programs had worked with no problem in Excel 2003. Now we are trying to
use it with Excel 2007, and we are having some trouble.

I have downloaded RandFriends and have ran the binomial distribution
process in 2007 Excel and have received an error that says: Compile error
in hidden module: UFDBinomial

However, ther are two demo excel files in the RExcel file called RdemoDens.
When I open the first RDemoDens excel file, I can run the processes and they
work fine. When I run the second RDemoDens excel file, or a blank excel 2007
file, the processes do not work and I get the error message.

I am trying to figure out what is different about the first RDemoDens excel
file that allows the calculations to process correctly. I am thinking that
something in the macro library in the demo must be different than what is in
a blank excel document. I just cannot seem to figure out what it is.

One thing that I did notice is that there are two different RExcel files in
the RExcel folder. One is labled RExcel and one is labed RExcel 2007.
What are the difference between these two RExcel files? I am not sure if
this has anything to do with the problem, but perhaps the excel demo in
which our calculations work uses the correct RExcel file while a regular
excel 2007 document does not call the correct one.

If anyone has an idea about what might be happening here, or who else I
could ask about the situation, I would appreciate any input.

Thanks,

Ben Ward

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.