[R] Combining multiple probability weights for the sample() function.
Dear R-List, I have a set of possibilities I want to sample from: bases - list(c('A', 'C'), c('A', 'G'), c('C', 'T')) possibilities - as.matrix(expand.grid(bases)) possibilities Var1 Var2 Var3 [1,] A A C [2,] C A C [3,] A G C [4,] C G C [5,] A A T [6,] C A T [7,] A G T [8,] C G T If I want to randomly sample one of these rows. If I do this, I find that it is 25% likely that my choice will have an identical first and last letter (e.g. [1,] A A C). It is also 25% likely that my choice will have an identical first and third letter (e.g. [4,] C G C). It is not likely at all that the second and third letter of my choice could be identical. What I would like to do, is sample one of the rows, but given the constraint that the probability of drawing identical letters 1 and 2 should be 50% or 0.5, and at the same time the probability of drawing identical letters 1 and 3 should be 50%. I am unsure on how to do this, but I know it involves coming up with a modified set of weights for the sample() function. My progress is below, any advice is much appreciated. Best Wishes, Ben Ward, UEA. So I have used the following code to come up with a matrix, which contains weighting according to each criteria: possibilities - as.matrix(expand.grid(bases)) identities - apply(possibilities, 1, function(x) c(x[1] == x[2], x[1] == x[3], x[2] == x[3])) prob - matrix(rep(0, length(identities)), ncol = ncol(identities)) consProb - apply(identities, 1, function(x){0.5 / length(which(x))}) polProb - apply(identities, 1, function(x){0.5 / length(which(!x))}) for(i in 1:nrow(identities)){ prob[i, which(identities[i,])] - consProb[i] prob[i, which(!identities[i,])] - polProb[i] } rownames(prob) - c(1==2, 1==3, 2==3) colnames(prob) - apply(possibilities, 1, function(x)paste(x, collapse = , )) This code gives the following matrix: A, A, CC, A, C A, G, CC, G, C A, A, T C, A, T A, G, T C, G, T 1==2 0.2500 0.0833 0.0833 0.0833 0.2500 0.0833 0.0833 0.0833 1==3 0.0833 0.2500 0.0833 0.2500 0.0833 0.0833 0.0833 0.0833 2==3 0.0625 0.0625 0.0625 0.0625 0.0625 0.0625 0.0625 0.0625 Each column is one of the choices from 'possibilities', and each row gives a series of weights based on three different criteria: Row 1, that if it possible from the choices for letter 1 == letter 2, that combined chance be 50%. Row 2, that if it possible from the choices for letter 1 == letter 3, that combined chance be 50%. Row 3, that if it possible from the choices for letter 2 == letter 3, that combined chance be 50%. So: If I used sample(x = 1:now(possibilities), size = 1, prob = prob[1,]) repeatedly, I expect about half the choices to contain identical letters 1 and 2. If I used sample(x = 1:now(possibilities), size = 1, prob = prob[2,]) repeatedly, I expect about half the choices to contain identical letters 1 and 3. If I used sample(x = 1:now(possibilities), size = 1, prob = prob[3,]) repeatedly, I expect about half the choices to contain identical letters 2 and 3. Except that in this case, since it is not possible. Note each row sums to 1. What I would like to do - if it is possible - is combine these three sets of weights into one set, that when used with sample(x = 1:nrow(possibilities, size = 1, prob = MAGICPROB) will give me a list of choices, where ~50% of them contain identical letters 1 and 2, AND ~50% of them contain identical letters 1 and 3, AND ~50% again contain identical letters 2 and 3 (except in this example as it is not possible from the choices). Can multiple probability weightings be combined in such a manner? [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Two x axes - top and bottom
Hi, fellow R users, I've been asked to make a plot with two datasets each with a different x axis, and it's been suggested one be at the top and the other at the bottom of the graph. I normally use ggplot2, and I know how to plot multiple datasets by simply + a new geom with a different data option, but usually in these case my different datasets have had the same x and y axes. Can I add a new x axis to the top of the plot in ggplot2 or one of the other graphics packages? Thanks, Ben W. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Change points in R
Hi R helpers, I have a set of data best shown in this below graph. Each coloured line represents a statistic calculated across pairs of DNA sequences. And for each coloured line, I would like to identify breakpoints - so identify the chunks where the values are high, for example, in the light blue line, there is a large high segment just after x=2e+05. From googling the aim to find such points, I've read about something called change-point analysis, used with time series data and I wondered if it or a variant of it in R might be of use here, this data is a series of % values (double), all a single measurement i.e. for each line, a 'scanner' passed over two sequences and at each step recorded the % value. Can change-point analysis help me here and if so what package or method will allow me to do this making as little assumptions about my data as possible? Thanks in advance, Ben W. [X] [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Small p from binomial probability function.
Hi, Thanks again for your answers, just so as I can get clear what is happening, with the uniroot method, I'm defining a function in which the binomial probability function pbinom is present but in addition p0 is subtracted from the result - in this case p0 is the large P I want to plug in so 0.05, 0.50 and 0.95, or even just 0.05 and 0.95? Then uniroot finds the root of this function and doing so find me the small p I need? Best, Ben. From: Rolf Turner [rolf.tur...@vodafone.co.nz] Sent: 11 October 2013 02:11 To: Benjamin Ward (ENV) Cc: Stefan Evert; R-help Mailing List Subject: Re: [R] Small p from binomial probability function. It is mysterious to me why the procedure proposed by Stefan Evert works. It appears to work --- once you modify the call to binom.test() to have the correct syntax. In a sequence of 1000 trials with random values of N, x, and p0, the answers from Evert's procedure agreed with the answer given by uniroot() to within +/- 3.045e-05. However your question was (in effect) how to solve the equation Pr(X = x) = p0 for p, where X ~ Binom(N,p), with N and x known. What this has to do with confidence intervals for p is, to my mind at least, completely opaque. In contrast it is obvious why the procedure using uniroot() works. I would suggest that you stick with the uniroot() procedure in that it is readily comprehensible. cheers, Rolf Turner On 10/11/13 03:56, Benjamin Ward (ENV) wrote: Hi, Thank you for your answers, I'm not completely sure if it's to bino.test I need or the uniroot. Perhaps I should explain more the idea behind the code and the actual task I'm trying to do. The idea is to calculate a confidence interval as to the age of two DNA sequences which have diverged, where I know the number of mutations that happened in them, and I know the mutation rate. The binomial probability can be used since, mutations have a probability of occurring or being observed so many times in a sequence. This is dependent on the length of the DNA stretch (which equates to the number of trials since each base is a possibility of observing a mutation), the probability of a single mutation occurring which is p = t * u, since more time means a higher probability a mutation may have occurred. So my code, using pbinom, is supposed to calculate the probability that my DNA stretches contain the number of mutations observed P(X = k), given their size (trials) and the probability of a single mutation (p = t * u). However I'm interested in finding t: t is what is unknown, so the loop repeatedly evaluates the calculation, increasing t each time and checking P(X=k), when it is 0.05, 0.50 and 0.95, we record t. Ideally I'd like to rearrange this so I can get the probability of a single success (mutation) p, and then divide by the mutation rate to get my t. My supervisor gave my the loopy code but I imagine there is a way to plug in P(X=k) as 0.05 and 0.95 and get my upper and lower t estimates. According to the R built in docs: binom.test Description: Performs an exact test of a simple null hypothesis about the probability of success in a Bernoulli experiment. Perhaps this is the one I need rather than uniroot? Best, Ben. From: Stefan Evert [stefa...@collocations.de] Sent: 10 October 2013 09:37 To: R-help Mailing List Cc: Benjamin Ward (ENV) Subject: Re: [R] Small p from binomial probability function. Sounds like you want a 95% binomial confidence interval: binom.test(N, P) will compute this for you, and you can get the bounds directly with binom.test(N, P)$conf.int Actually, binom.test computes a two-sided confidence interval, which corresponds roughly to 2.5 and 97.5 percentages in your approach. It doesn't give you the 50% point either, but I don't think that's a meaningful quantity with a two-sided test. Hope this helps, Stefan On 9 Oct 2013, at 15:53, Benjamin Ward (ENV) b.w...@uea.ac.uk wrote: I got given some code that uses the R function pbionom: p - mut * t sumprobs - pbinom( N, B, p ) * 1000 Which gives the output of a probability as a percentage like 5, 50, 95. What the code currently does is find me the values of t I need, by using the above two code lines in a loop, each iteration it increaces t by one and runs the two lines. When sumprobs equals 5, it records the value t, then again when sumprobs is equal to 50, and again when sumprobs is equal to 95 - giving me three t values. This is not an efficient way of doing this if t is large. Is it possible to rearrange pbinom so it gives me the small p (made of mut*t) as the result of plugging in the sumprobs instead, and is there an R function that already does this? Since pbinom is the binomial probability equation I suppose the question is - in more mathematical terminology - can I change this code
Re: [R] Small p from binomial probability function.
Hi, Thank you for your answers, I'm not completely sure if it's to bino.test I need or the uniroot. Perhaps I should explain more the idea behind the code and the actual task I'm trying to do. The idea is to calculate a confidence interval as to the age of two DNA sequences which have diverged, where I know the number of mutations that happened in them, and I know the mutation rate. The binomial probability can be used since, mutations have a probability of occurring or being observed so many times in a sequence. This is dependent on the length of the DNA stretch (which equates to the number of trials since each base is a possibility of observing a mutation), the probability of a single mutation occurring which is p = t * u, since more time means a higher probability a mutation may have occurred. So my code, using pbinom, is supposed to calculate the probability that my DNA stretches contain the number of mutations observed P(X = k), given their size (trials) and the probability of a single mutation (p = t * u). However I'm interested in finding t: t is what is unknown, so the loop repeatedly evaluates the calculation, increasing t each time and checking P(X=k), when it is 0.05, 0.50 and 0.95, we record t. Ideally I'd like to rearrange this so I can get the probability of a single success (mutation) p, and then divide by the mutation rate to get my t. My supervisor gave my the loopy code but I imagine there is a way to plug in P(X=k) as 0.05 and 0.95 and get my upper and lower t estimates. According to the R built in docs: binom.test Description: Performs an exact test of a simple null hypothesis about the probability of success in a Bernoulli experiment. Perhaps this is the one I need rather than uniroot? Best, Ben. From: Stefan Evert [stefa...@collocations.de] Sent: 10 October 2013 09:37 To: R-help Mailing List Cc: Benjamin Ward (ENV) Subject: Re: [R] Small p from binomial probability function. Sounds like you want a 95% binomial confidence interval: binom.test(N, P) will compute this for you, and you can get the bounds directly with binom.test(N, P)$conf.int Actually, binom.test computes a two-sided confidence interval, which corresponds roughly to 2.5 and 97.5 percentages in your approach. It doesn't give you the 50% point either, but I don't think that's a meaningful quantity with a two-sided test. Hope this helps, Stefan On 9 Oct 2013, at 15:53, Benjamin Ward (ENV) b.w...@uea.ac.uk wrote: I got given some code that uses the R function pbionom: p - mut * t sumprobs - pbinom( N, B, p ) * 1000 Which gives the output of a probability as a percentage like 5, 50, 95. What the code currently does is find me the values of t I need, by using the above two code lines in a loop, each iteration it increaces t by one and runs the two lines. When sumprobs equals 5, it records the value t, then again when sumprobs is equal to 50, and again when sumprobs is equal to 95 - giving me three t values. This is not an efficient way of doing this if t is large. Is it possible to rearrange pbinom so it gives me the small p (made of mut*t) as the result of plugging in the sumprobs instead, and is there an R function that already does this? Since pbinom is the binomial probability equation I suppose the question is - in more mathematical terminology - can I change this code so that instead of calculating the Probability of N successes given the number of trials and the probability of a single success, can I instead calculate the probability of a single success using the probability of N successes and number of trials, and the number of successes? Can R do this for me. So instead I plug in 5, 50, and 95, and then get the small p out? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Small p from binomial probability function.
Hi, I got given some code that uses the R function pbionom: p - mut * t sumprobs - pbinom( N, B, p ) * 1000 Which gives the output of a probability as a percentage like 5, 50, 95. What the code currently does is find me the values of t I need, by using the above two code lines in a loop, each iteration it increaces t by one and runs the two lines. When sumprobs equals 5, it records the value t, then again when sumprobs is equal to 50, and again when sumprobs is equal to 95 - giving me three t values. This is not an efficient way of doing this if t is large. Is it possible to rearrange pbinom so it gives me the small p (made of mut*t) as the result of plugging in the sumprobs instead, and is there an R function that already does this? Since pbinom is the binomial probability equation I suppose the question is - in more mathematical terminology - can I change this code so that instead of calculating the Probability of N successes given the number of trials and the probability of a single success, can I instead calculate the probability of a single success using the probability of N successes and number of trials, and the number of successes? Can R do this for me. So instead I plug in 5, 50, and 95, and then get the small p out? Thanks, Ben W. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Building a Package using gWidgets
Hi, For the past few months I've been building a simulation in R I hope to package. It consists of two usable functions, and many internal ones which one of the two usable functions call while looping, to perform the stages of simulation. A simple conceptual example is: # Abstract representation 1st Usable function, generates object containing settings for simulation. settings - function(){ matrix(c(1:4),nrow=2,ncol=2) } # Abstract representation of one of many internal functions, does some action in simulation. int.func - function(x){ out - x*2 return(out) } # Abstract representation of second usable function, takes settings invokes internal functions generates results saves as R object files. sim.func - function(x){ ans - int.func(x) ans2 - ans-2 save(ans2, file=outtest) } With my package so far, using it is done like so after loading and attaching the package with library(): INPUT - settings() fix(settings) # If you want to change from the defaults. sim.func(INPUT) Nothing needs returning from the simulation because it gets saved as an object to be read in afterwards. Now I'd like it to have a simple GUI which can be used in conjunction with command line - think Rcmdr but much more simple, to allow my co-workes who have never touched R to use it. The gui needs to be able to edit the settings - like with the fix command above, to save a settings object to file, and to read in setting from an object file. I've built this with gWidgets: gui - function(){ INPUT - matrix(c(1:4),nrow=2,ncol=2) mainwin - gwindow(MainWindow) button1 - gbutton(Edit Settings, cont=mainwin, handler= function(h,...){ fix(INPUT) print(Settings Edited) }) button2 - gbutton(RUN, cont=mainwin, handler= function(h,...){ sim.func(INPUT) print(The run is done) }) savebutton - gbutton(Write Settings to File,cont=mainwin, handler= function(h,...){ setfilename - ginput(Please enter the filename) save(INPUT, file=setfilename) }) loadutton - gbutton(Load Settings from File, cont=mainwin, handler=function(h,...){ fname - gfile(test=Choose a file, type=open, action=print, handler = function(h,...){ do.call(h$action, list(h$file)) } ) load(fname)}) } Note the job of the settings function from before is now done by the first line of this gui function. I add this to the same R file as the three functions above, add 'gui' to the namespace as an export, set gWidgets stuff as imports, and rebuild, then I library() the package and do gui(). The interface shows up. However I have a few issues: The gui shows up fine, but if I click button1 to edit settings through fix(INPUT), then change the values, close the editor and click the button again to see if the changes have persisted and been stored in INPUT, they have not. Same goes for reading in an object, it does not overwrite the INPUT object generated by default in the first line of function gui(). I think this has something to do with environments of functions but I'm not too sure. In the gui-less version of my package, the user generates the object containing settings, which is in workspace and feeds it to the simulation function as an argument. However since with the gui version, everything is run inside the function gui() and gWidgets handlers makes use of functions(h,...) I can't help but feel as if environments are the issue here. It's odd that when clicking on button 1, it will find INPUT from the gui() environment, but won't make the changes back there. Can anybody help out with this and suggest what it is I need to do? Apologies for a long email, but I've tried to explain clearly. Code is reproducible, as is the issue, just by having library(gWidgets, gWidgetstcltk) and copying and pasting code. The abstract example I've provided faithfully reproduces the same issues I have with my proper simulation functions so if I can't get it working, I won't get the real thing working. Thanks, Ben W. UEA The Sainsbury Laboratory. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Subscript out of Bounds Lapply
Hi all, I've written a function for a simulation which will - in general operation without being specific to the simulation scenario, duplicate or delete columns from a matrix based on two values which determine how many as a proportion of the matrix: the two values are always between 0.01 and 0.99: Dup.Dels - function(matrix, values){ p.dup - ceiling(ncol(matrix)*as.numeric(values[1])) #Determine how many columns to be duplicated. p.del - ceiling(ncol(matrix)*as.numeric(values[2])) # Determine how many columns to be deleted. dups - sample(ncol(matrix), p.dup, replace=FALSE) # Randomly choose columns to be duplicated. dels - sample(ncol(matrix), p.del, replace=FALSE) # Randomly choose columns to be deleted. matrix.new - cbind(matrix.new, matrix[,dups]) # Duplicate columns. matrix.new - matrix.new[,-dels] # Delete columns. return(matrix.new) } I have a list of matrices of different column numbers and I apply the function like so (the values are in a matrix): The lapply goes down the matrix list, and down the matrix with the 'values' in, applying the function. New.Matrices - lapply(seq(Matrices), function(i) Dup.Dels(Matrices[[i]], Values[i,])) Now the weird thing is, if I test this with data and run this function. It appears to work exactly how I want it to, yet when it is called from the simulation loop I get: Error: subscript out of bounds So I open up all my .R files, load all my functions and open the simulation loop, go through each line, sending it to the R interpreter one by one, function by function. I get to the lapply line above, and it works! For some reason it works when I spoon feed my simulation loop line by line, but if I call the function to start up the simulation and run it, it will hit that function and tell me the subscript is out of bounds. Commenting out the call to the function in the simulation allows the rest of the simulation to execute perfectly. Does anybody know why I would get this strange behaviour with this code? - writing it, testing it like by line with some starting data the simulation actually uses, it works fine, even going through the entire sim step by step, it works fine. Closing it all up and running the simulation by function call however it hits an error. Running through it with debug() and browser() it also hits this error at the same point - something is up with this function. I can't figure out, as far as I can tell my indexing in the lapply line is correct, both for operating on every element of the list Matrix, and going down the rows of the matrix storing the two values. Thanks, Ben W. UEA (ENV) b.w...@uea.ac.uk The Sainsbury Laboratory ben.w...@sainsbury-laboratory.ac.uk [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Remove and add to many matrices in list.
Hi all, I realised that my last email question and code was probably going to be a bit of an eyesore for some people and that perhaps the best thing for me to do is to pose the question of what it is I want to achieve, rather than what I've written, if it helps people: I'm writing a simulation, and during that simulation I have a list of matrices of variable number of columns. For every matrix in this list I want to randomly select a proportion of them for removal from the matrix, and a proportion for duplication. Proportions of columns to be duplicated of deleted are set out in a n by 2 matrix: Delete Duplicate 0.99 0.43 0.340.32 0.540.56 .. And so on. So for each matrix in the list of matrices, I want to: * Calculate the number of columns to be deleted or duplicated: something like * number.to.delete - ncol(matrix list[[I]])*proportionsmatrix[I,1] * number.to.duplicate - ncol(matrix list[[I]])*proportionsmatrix[I,2] * Then I want to sample the columns to be deleted, and those to be copied : something like: * which.delete - sample(ncol(matrix[[I]], number.to.delete, replace=F) * which.duplicate - sample(ncol(matrix[[I]], number.to.delete, replace=F) * Then I want to make the new matrices: something like: * new.matrices- matrix[,-which.delete] * new.matrices-cbind(new.matrices, matrix[,which.duplicate] From my previous email you'll see I did this by making a function which will do this for one matrix out of the entire list, and the applying the function to the entire list with lapply. Which works when I copy and paste the code into R with usable data, but as part of the simulation it fails. This is strange since I do other similar operations on these matrices without problem, with the same method of indexing. Debug() and running the sim step by step the data does not appear to be altered such that would affect the function. Best, Ben. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] NA and Character(0) in List Element
Hi, This is probably a small query but one I'm struggling with: I have a list in which I had elements which were NA, I removed them, by doing: list2 - lapply(list, na.omit), However this leaves the element there with 'character(0)' in place as well as attributes: e.g. [[978]] character(0) attr(,na.action) [1] 1 attr(,class) [1] omit I want to get rid of these elements/positions in the list, since a function is supposed to sample the list for elements (each element is a collection of about 20 numbers each). Thanks, Ben W. UEA (ENV) - b.w...@uea.ac.uk [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Removal of columns from matrix where all values of the column are identical.
Hi, I've been trying to work this out how it works, I'm still not totally sure but seems to me it subsets the matrix according to whether each column returns all TRUE, to not being the same values when you compare the column[-1] and the column[-length(column)], essentially siding the column against itself and making comparison? It seems that it also removed columns with any repeated values, rather than columns in which all values are the same: testm-matrix(nrow=5, ncol=5) testm[,1] - c(1,2,3,4,5) testm[,2] - c(3,3,3,3,3) testm[,3] - c(3,3,3,4,3) testm[,4] - c(5,4,3,2,1) testm[,5] - c(1,2,3,4,4) testm test3 - testm[,apply(testm,2,function(x) all(c(TRUE,x[-length(x)]!=x[-1])))] test3 test3 [,1] [,2] [1,] 1 5 [2,] 2 4 [3,] 3 3 [4,] 4 2 [5,] 5 1 Thanks, Ben W. From: arun [smartpink...@yahoo.com] Sent: 26 January 2013 02:34 To: Benjamin Ward (ENV) Cc: R help Subject: Re: [R] Removal of columns from matrix where all values of the column are identical. Hi, I guess this should also work: Matrix[,apply(Matrix,2,function(x) all(c(TRUE,x[-length(x)]!=x[-1])))] # [,1] [,2] [,3] #[1,] 155 #[2,] 241 #[3,] 334 #[4,] 423 #[5,] 512 A.K. - Original Message - From: Benjamin Ward (ENV) b.w...@uea.ac.uk To: r-help@r-project.org r-help@r-project.org Cc: Sent: Friday, January 25, 2013 6:17 PM Subject: [R] Removal of columns from matrix where all values of the column are identical. Hi all, I'd like to write a piece of code which will remove columns from a matrix, if the column contains only one value, say every value in the column is a 3: Matrix - matrix(NA, nrow=5, ncol=4) Matrix[,1] - c(1,2,3,4,5) Matrix[,2] - c(3,3,3,3,3) Matrix[,3] - c(5,4,3,2,1) Matrix[,4] - c(5,1,4,3,2) [,1] [,2] [,3] [,4] [1,] 1355 [2,] 2341 [3,] 3334 [4,] 4323 [5,] 5312 What I have written so far is a loop which will see if all values are the same, a bit of a hack since it just checks all values are equal to the first value of the column, if not, by definition the column cannot contain only one value/variable/character: removals-c() for(i in 1:ncol(Matrix)){ if(all(Matrix[,i] == Matrix[[1,i]])){ removals-append(removals, i) } } new.Matrix - Matrix[,-removals] This works for matrices with numbers or characters. My question is - is there a better or more efficient way of doing this, maybe with apply or something. My first thought was apply set to operate over all columns, but was unsure of the indexing and selecting columns to be deleted. Thanks, Ben W. University of East Anglia (ENV): b.w...@uea.ac.uk The Sainsbury Laboratory: ben.w...@sainsbury-laboratory.ac.uk [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Removal of columns from matrix where all values of the column are identical.
Hi all, I'd like to write a piece of code which will remove columns from a matrix, if the column contains only one value, say every value in the column is a 3: Matrix - matrix(NA, nrow=5, ncol=4) Matrix[,1] - c(1,2,3,4,5) Matrix[,2] - c(3,3,3,3,3) Matrix[,3] - c(5,4,3,2,1) Matrix[,4] - c(5,1,4,3,2) [,1] [,2] [,3] [,4] [1,]1355 [2,]2341 [3,]3334 [4,]4323 [5,]5312 What I have written so far is a loop which will see if all values are the same, a bit of a hack since it just checks all values are equal to the first value of the column, if not, by definition the column cannot contain only one value/variable/character: removals-c() for(i in 1:ncol(Matrix)){ if(all(Matrix[,i] == Matrix[[1,i]])){ removals-append(removals, i) } } new.Matrix - Matrix[,-removals] This works for matrices with numbers or characters. My question is - is there a better or more efficient way of doing this, maybe with apply or something. My first thought was apply set to operate over all columns, but was unsure of the indexing and selecting columns to be deleted. Thanks, Ben W. University of East Anglia (ENV): b.w...@uea.ac.uk The Sainsbury Laboratory: ben.w...@sainsbury-laboratory.ac.uk [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Pairwise Comparrisons
Dear all, I'm trying to write a function, that will take as an argument, some aligned genome sequences, and using a sliding window, do pairwise comparisons of sequence similarity. Coding the sliding window I think I can manage but what I'm trying to get to grips with is getting it so as every pairwise comparison is made, no matter how many genomes are added, from 3 to N. So if I had four genome sequences, G1, G2, G3, G4 the comparisons would be: G1:G1 G1:G2 G1:G3 G1:G4 G2:G2 G2:G3 G2:G4 G3:G3 G3:G4 G4:G4 I can think of a way this might be done with a very complicated loop, which would take the region in the window of each genome and then make all possible combination/comparrisons: So the loop would take G1, and then in turn compare against G2, G3, G4. Then it would take G2, and start again and pair it with everything from G1 to G4, then it would take G3 and compare with everything from G1 to G4, and then finally would take G4, and compare it with everything from G1 to G4. This is a wasteful way of doing it however, because for example, by the time the loop gets around to dealing with G4 as it's first argument I.e. the G4:GN comparisons, all comparisons with G4 in apart from G4:G4 have already been made I.e. G4:G1 is just G1: G4 backwards. So it's really wasteful and computing stuff that isn't necessary. So my question is, how can someone do pairwise comparisons in R this way, and ensure all combinations are compared, but it's not as wasteful as my obvious shotgun approach which computers many redundant comparisons? Ben W. University of East Anglia (ENV): b.w...@uea.ac.uk The Sainsbury Lab (JIC): ben.w...@sainsbury-lab.ac.uk [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Stuck trying to modify a function
Jean, Thank you for your suggestion it has really helped a lot. I ended up changing the expression system in my model, since I realised that if the genome has extensive duplication, i.e. same values, which my organism of interest is know to evolve, then because of the ux - unique(c(x, y))part of the function, duplicate genes within the genome may mutate in the same way, to test this I changed the gene values to a much narrower range between 1 and 10, rather than 1. What I found was all the 7's would mutate negatively by 1, but all the 5's would mutate positively by 1 etc. I also realised with duplication that's extensive, if it's value comes up in the expression data, then which of those duplicated genes is the one that's expressed is not clear, which complicates things. I've changed it so as the expressed data is the same length as the genes data, but consists of a 1 and 0 for on and off. The function decided how many effectors to express between 1 and the total number of genes, samples them, makes a data series the same length as the effector genes, with all values set to 0 or off. Then gets the indexes of the genes sampled from the genome to be switched on, and then switches the 0's at the corresponding indexes in the expression data to 1. Init_Express - function(x, min.exp=1, max.exp=length(x)) { Number_Expressed - round(runif(1, min=min.exp, max=max.exp)) Expressed - sample(x,Number_Expressed) E - rep(0, length(x)) pos-match(Expressed, x) E[pos]-1 return(E) } Expression.State - lapply(seq(Effectors), function(i) Init_Express(Effectors[[i]])) Based on how you managed to mutate genes using operations without loops, I thought of how to do this with expression states: Change.Expression - function(x, prob.express=0.5, prob.repress=0.5) { old-x low.old - old[which(old==0)] low.new - low.old + sample(c(0, 1), size=length(low.old), replace=TRUE, prob=c(1-prob.express, 0+prob.express)) high.old - old[which(old==1)] high.new - high.old - sample(c(0, 1), size=length(high.old), replace=TRUE, prob=c(1-prob.repress, 0+prob.repress)) all.new - old all.new[pmatch(low.old, old, dup=F)] - low.new all.new[pmatch(high.old, old, dup=F)] - high.new return(all.new) } Expression.State - lapply(seq(Expression.State), function(i) Change.Expression(Expression.State[[i]])) I've heard a lot of talk about the benefits or drawbacks of loops vs. the apply family of functions, be it speed or tidyness of code. I've been offered use of a high performance cluster in my dept. (not sure how R does on those) but I was still concerned about nested loops because of the number of individuals and then genes per individual, I mostly went with loops because of the indexing issue - controlling going through individuals, and then all the genes in turn. Changing the nested part of the loop in the manner you've demonstrated, eliminates that and will be a big help. It's helped me see how I can use lapply with functions that do things in a more vectorised manner and a more R like manner, I was afraid of the eventuality of having to write some processed in C++ to do them quicker. Best Wishes, Ben W. UEA (ENV) and The Sainsbury Laboratory. From: Jean V Adams [jvad...@usgs.gov] Sent: 27 November 2012 22:01 To: Benjamin Ward (ENV) Cc: r-help@r-project.org Subject: Re: [R] Stuck trying to modify a function Ben, You can use the sample() function to randomly add -1, 0, or 1 to each observation, and control for the probability of mutation at the same time. Then you can use the match() function to make sure that any mutations in X are carried through to Y in the same way. I wrote the function to do each list element separately. So a gene in X[[1] and Y[[1]] will be mutated in the same way, but the same gene in X[[2]] and Y[[2]] may be mutated in a different way. Not sure if that is what you want. Mutate - function(x, y, prob.mutate=0.9) { ux - unique(c(x, y)) new.ux - ux + sample(c(-1, 0, 1), size=length(ux), replace=TRUE, prob=c(prob.mutate/2, 1-prob.mutate, prob.mutate/2)) new.x - new.ux[match(x, ux)] new.y - new.ux[match(y, ux)] list(xm=new.x, ym=new.y) } Effectors - lapply(seq(X), function(i) Mutate(X[[i]], Y[[i]])) Jean Benjamin Ward (ENV) b.w...@uea.ac.uk wrote on 11/27/2012 10:45:23 AM: Hi, I have the following data: Path_Number - 5 ID.Path - c(1:Path_Number) # Make vector of ID's. No_of_X - sample(50:550, length(ID.Path), replace=TRUE) # X - split(sample(1:1, sum(No_of_X), replace=TRUE), rep(ID.Path,No_of_X)) Y - lapply(X,function(x) sample(x, round(runif(1, min=10, max=50 X and Y are both lists, and I've made the following function to work on that data as part of a simulation I'm building: Mutate-function(x){ l-0 for(i in x){ l2-0 l-l+1 for(i in x[[l]]){ l2-l2+1 if(runif(1) 0.9) ifelse(runif(1) 0.5, x[[l]][l2
[R] Stuck trying to modify a function
Hi, I have the following data: Path_Number - 5 ID.Path - c(1:Path_Number) # Make vector of ID's. No_of_X - sample(50:550, length(ID.Path), replace=TRUE) # X - split(sample(1:1, sum(No_of_X), replace=TRUE), rep(ID.Path, No_of_X)) Y - lapply(X,function(x) sample(x, round(runif(1, min=10, max=50 X and Y are both lists, and I've made the following function to work on that data as part of a simulation I'm building: Mutate-function(x){ l-0 for(i in x){ l2-0 l-l+1 for(i in x[[l]]){ l2-l2+1 if(runif(1) 0.9) ifelse(runif(1) 0.5, x[[l]][l2] - x[[l]][l2]+1, x[[l]][l2] - x[[l]][l2]-1) } } return(x) } I call this with Effectors-Mutate(X) The function is designed to alter the values of each element in X by either + or - 1 (50:50 chance wether + or -). However Y, elements of which are a subset of the corresponding elements of X, need to be consistent i.e. if a value in X is changed, and that value is part of the Y subset, then the value in Y also needs to be changed. however, since Y is a smaller subset it will not be indexed the same. My idea was to include in the function an if statement that checks if Y contains the value to be changed, removes it, and then after the value in X is changed, put the new value in Y. I attempted this with: Mutate-function(x,y){ l-0 for(i in x){ l2-0 l-l+1 for(i in x[[l]]){ l2-l2+1 if(runif(1) 0.9){ if(x[[l]][l2] %in% y[[l]] == TRUE){ y[[l]]-[which(y[[l]]!=x[[l]][l2])] if(runif(1) 0.5){ x[[l]][l2] - x[[l]][l2]+1 y[[l]]-append(x[[l]][l2]) }else{ x[[l]][l2] - x[[l]][l2]-1 y[[l]]-append(x[[l]][l2]) } } ifelse(runif(1) 0.5, x[[l]][l2] - x[[l]][l2]+1, x[[l]][l2] - x[[l]][l2]-1) } } } return(list(x,y)) } Bit of an eyesore so I've put the altered stuff in bold. I've basically taken what the ifelse statement does in the first function, (which is still there and run if Y does not contain the X value being altered) and broken it down into an if and an else segment with multiple operations in curly braces to accommodate the extra actions needed to alter Y as well as X. This was all I could think of to keep changes between the two in sync, however this does not work when I try to load the function into workspace: Error: unexpected '}' in } I hope someone can point out what it is I've done that isn't working, or a better way to do this. Best Wishes, Ben W. UEA (ENV) The Sainsbury Laboratory. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Code works, but not as function.
Hi, I have some values in a list format generated by the following: Path_Number - 0010 ID.Path - formatC(0001:Path_Number, width=4, flag=0) # Make vector of ID's. No_of_Effectors - sample(1:550, length(ID.Path), replace=TRUE) # Define Number of Effectors each individual gets. Effectors - split(sample(1:1, sum(No_of_Effectors), replace=TRUE), rep(ID.Path, No_of_Effectors)) # Generate effectors and dish them out. Effectors And I've written a chunk which is designed to go through each element of the list, and then through each value in the element, and if a conditions is met (in this case if runif(1) is 0.3, this is simple but may be changed later to be more complex probability criteria based on mutation data from experiment), change the value by 1 in either direction, higher or lower with equal probability. Here it is (I've changed the 0.3 value I mentioned to 0.9, so many values change so it can be easily seen): l-0 # Set counter 1 to 0. for(i in Effectors){ # Begin loop on list of effectors. l2-0 # Set counter 2 to 0. l -l+1 # Increace counter number 1. for(i in Effectors[[l]]){ # Begin loop through all effector values. l2 -l2+1 # Increace counter number 2. if(runif(1) 0.9) ifelse(runif(1) 0.5, Effectors[[l]][l2] - Effectors[[l]][l2]+1, Effectors[[l]][l2] - Effectors[[l]][l2]-1) # Line which increaces or decreaces the values in the list element (50/50 chance of increace or decreace), if the first IF statement is satisfied. } } Now I don't know if this is the best and most R-ish way of doing this, but it works and I understand it. However I'd like to define a function with this, my attempts so far have been: Eff.Mutate-function(){ l-0 # Set counter 1 to 0. for(i in Effectors){ # Begin loop on list of effectors. l2-0 # Set counter 2 to 0. l -l+1 # Increace counter number 1. for(i in Effectors[[l]]){ # Begin loop through all effector values. l2 -l2+1 # Increace counter number 2. if(runif(1) 0.9) ifelse(runif(1) 0.5, Effectors[[l]][l2] - Effectors[[l]][l2]+1, Effectors[[l]][l2] - Effectors[[l]][l2]-1) # Line which increaces or decreaces the values in effvec, if the first IF statement is satisfied. } } } and: Eff.Mutate2-function(x){ l-0 for(i in x){ l2-0 l-l+1 for(i in x[[l]]){ l2-l2+1 if(runif(1) 0.9) ifelse(runif(1) 0.5, x[[l]][l2] - x[[l]][l2]+1, x[[l]][l2] - x[[l]][l2]-1) } } } However if I do either Eff.Mutate() or Eff.Mutate2(Effectors), then neither seems to work; I've seen no differences in the values in the list elements, before and after. I can't figure out why it works as a code chunk but if I try to make it a function nothing seems to happen. I'm probably going about making it a function wrong. Thanks, Ben W. UEA (ENV) The Sainsbury Laboratory. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Extract cell of many values from dataframe cells and sample from them.
Hi, Thank you for your suggestion, this works a treat. For my understanding and future reference, this would also work for something like 2D matrices of unequal row size? As far as I understand it would not be possible to make a 3D array jagged like this because the rows would need to be of equal number for the array function, yet in a list there is not such requirement, and operations on matrices can target elements in specific matrices by [[,]][,] ? Best Wishes, Ben W. UEA (ENV) The Sainsbury Laboratory. From: Jean V Adams [jvad...@usgs.gov] Sent: 08 November 2012 19:59 To: r-help@r-project.org Cc: Benjamin Ward (ENV) Subject: Re: [R] Extract cell of many values from dataframe cells and sample from them. Ben, I think you would find lists a helpful way to arrange your data. They do not require equal lengths of data in each element. Check out the code below for a smaller version of the example you provided (with only 5 individuals rather than 500). # An alternative way to arrange your data, as a list # Each element of the list is an individual, with all its effector genes ID.unique - formatC(0001:0005, width=4, flag=0) No_of_Effectors - sample(1:550, length(ID.unique), replace=TRUE) Effectors - split(sample(1:1, sum(No_of_Effectors), replace=TRUE), rep(ID.unique, No_of_Effectors)) Effectors # Now take a random sample of effectors from each individual Expressed_Genes - lapply(Effectors, function(x) sample(x, sample(1:length(x), 1))) Expressed_Genes Jean Benjamin Ward (ENV) b.w...@uea.ac.uk wrote on 11/08/2012 10:00:57 AM: Hi, First my apologies for a non-working piece of code in a previous submission, I have corrected this error. I'm doing is individual based modelling of a pathogen and it's host. The way I've thought of doing this is with two dataframes, one of the pathogen and it's genes and effector genes, and one of the host and it's resistance genes. During the simulation, these things can be pulled out of the dataframes and operated on, before being stored again in the dataframes. Below is how I've created my dataframe and stored my effector genes. In this model, effector genes are numerical values between 1 and 1. Path_Number - 0500 inds - data.frame(ID=formatC (0001:Path_Number,width=4,flag=0),No_of_Effectors=,No_Expressed_Effectors=) inds$No_of_Effectors - round(as.numeric(lapply(1:nrow (inds),function(x) runif(1, min=1, max=550 Effectors - lapply(1:nrow(inds),function(x) sample(1:1,inds $No_of_Effectors,replace=TRUE)) inds - data.frame(inds,Effectors=as.character(Effectors)) Ind_Genes - strsplit(as.character(inds[1,4]),,) What I'm trying to do is: 1). For each individual (row) in my database, extract the values in the Effectors cell to an object. 2). Sample a number of those values and assign them to a new object called Expressed_Effectors 3). Storing it in the Expressed_Effectors cell, in much the same manner as I stored the Effectors object in the Effectors cell. My example attempt (for the first row/individual in my dataset) is below: (step by step, I didn't put this in a loop until I know it works for 1 row) Extract the values (effector genes) for the first individual, from the Effectors Cell in the dataframe, to Ind_Effectors object. Ind_Effectors - strsplit(as.character(inds[1,4]),,) Randomly dictate how many values (effectors) will be sampled n-round(runif(1, min=10, max=50)) Sample n values (effector genes) from Ind_Effectors, not replacing Expressed_Genes - sample(Ind_Effectors,n,replace=F) If I run this I receive the error: Error in sample(Ind_Effectors, n, replace = F) : cannot take a sample larger than the population when 'replace = FALSE' What I think this means is rather than picking out n values from the whole set of values in Ind_Effectors it's trying to sample the whole lot n times, which it cannot do because replace=F. This is not what I need, what I need is n values sampled from Ind_Effectors, not all values from Ind_Effectors sampled n times. I hope this clears up the confusion with what I'm trying to do. It may very well be I'm not instructing R to sample as a require properly. Sadly my previous experience with R amounts to loading in dataframes from experiment and doing stat analysis model fitting, not simulations or individual based models. Best wishes, Ben W. UEA (ENV) The Sainsbury Laboratory. P.S. As an aside I've been thinking about doing this model an alternative way to as I described in the first bit of my email (based on dataframes). Instead I would use a multi-dimentional ragged array(s): The format would be a 2D layout, Where every line is an effector gene and every column an aspect of the effector gene(value, expression state, fitness contribution etc.) This 2D layout of rows and columns is then repeated in the 3rd dimension (the z of x,y,z) of the array for each individual. It is ragged in the sense each
Re: [R] sample from list
Hi, Thanks, for the reply. I should explain more, I'll be as brief as I can, the code for generating the dataframe is below. What I'm doing is individual based modelling of a pathogen and it's host. The way I've thought of doing this is with two dataframes, one of the pathogen and it's genes and effectors, and one of the host and it's resistance genes. During the processes of the model these things can be pulled out of the dataframes and operated on, before being stored again in the dataframes. I have generated my dataset as below, it was suggested by arun in a reply to a previous email I wrote with the subject Trouble with data structures. Path_Number - 0500 # The number of pathogen individuals in the population. # Create the initial dataframe, with initial number of effectors and initial number of expressed effectors. inds -data.frame(ID=formatC(0001:Path_Number,width=4,flag=0),No_of_Effectors=,No_Expressed_Effectors=) # Generate the number of effectors genes each individual has. inds$No_of_Effectors - round(as.numeric(lapply(1:nrow(inds),function(x) runif(1, min=1, max=550 # Generate the actual efector genes. Effectors - lapply(1:nrow(inds),function(x) sample(1:1,inds$No_of_Effectors,replace=TRUE)) #Add them to the dataframe inds - data.frame(inds,Effectors=as.character(Effectors)) What I'm trying to do is for each individual, extract the values in the Effector genes cell to an object. As far as I can tell, Ind_Genes-strsplit(as.character(inds2[1,4]),,) Will do this for the first individual or I can get all of them with All_Genes-strsplit(as.character(inds2[,4]),,) What I then want to do is according to a generated number for each individual... round(as.numeric(lapply(1:nrow(inds2),function(x) runif(1, min=10, max=50 ... sample that many genes from Ind_Genes and make a new object called Expressed_Genes, which can be stored in the dataframe. My attempt at doing this is: Expressed_Genes-lapply(First_Ind_Genes,function(x) sample(First_Ind_Genes,round(runif(1, min=10, max=50)),replace=F)) to get Expressed genes for each individual, this might be part of a for loop, or to the whole list of every individuals genes like so: Expressed_Genes-lapply(All_Genes,function(x) sample(All_Genes,3,replace=F)) What usually happens however is I get errors: Error in sample(First_Ind_Genes, round(runif(1, min = 10, max = 50)), : cannot take a sample larger than the population when 'replace = FALSE' or it will rather than sample 3 values, sample all the values, 3 times if I allow replacement (which I don't want). So it's not sampling 3 values for me, but the whole lot of values 3 times. I do not know of another way to extract these gene values and then do things with them. For my model it is essential I can pull the genes or expressed genes out of the dataframe, work functions or operations on them and then store them back again. For example if an individual turns a gene on that was not before, then the genes would need to be pulled from the database, as would the expressed genes, and a random value from the genes object added to the expressed genes object, and then they could both be put back. A similar thing would happen when I wanted to mutate the genes. In short my aim is pull genes or expressed genes out, work functions or operations on them and then store them back again. Hopefully I've explained better, I have been thinking of changing my approach from datasets of pathogen and host from which values are pulled to objects and operated on to a multi-dimentional ragged arrays. I've been told this may be more simple for me. Where every line is an effector gene and there can be columns for the gene value, expression state (1 or 0/T or F), fitness contribution etc. This 2D layout of rows and columns is then repeated in the z dimension of the array for each individual. It is ragged in the sense each individual, each slice through the array in the z direction, would have different numbers of rows - different numbers of effectors. I can then simulate mutations by changing the gene values, cause duplications by adding rows of duplicated genes, or even cause deletions by removing rows. Once I have this set up for the pathogen I may make a similar array for the host plants, then perhaps with indexing or some such thing I can write functions to do the interactions and immunology and such. Best, Ben W. UEA (ENV) The Sainsbury Laboratory. From: Jean V Adams [jvad...@usgs.gov] Sent: 07 November 2012 21:12 To: Benjamin Ward (ENV) Cc: r-help@r-project.org Subject: Re: [R] sample from list Ben, Can you provide a small example data set for inds so that we can run the code you have supplied? It's difficult for me to follow what you've got and where you're trying to go. Jean Benjamin Ward (ENV) b.w...@uea.ac.uk wrote on 11/06/2012 03:29:52 PM: Hi all, I have a list of genes present in 500 individuals
[R] Extract cell of many values from dataframe cells and sample from them.
Hi, First my apologies for a non-working piece of code in a previous submission, I have corrected this error. I'm doing is individual based modelling of a pathogen and it's host. The way I've thought of doing this is with two dataframes, one of the pathogen and it's genes and effector genes, and one of the host and it's resistance genes. During the simulation, these things can be pulled out of the dataframes and operated on, before being stored again in the dataframes. Below is how I've created my dataframe and stored my effector genes. In this model, effector genes are numerical values between 1 and 1. Path_Number - 0500 inds - data.frame(ID=formatC(0001:Path_Number,width=4,flag=0),No_of_Effectors=,No_Expressed_Effectors=) inds$No_of_Effectors - round(as.numeric(lapply(1:nrow(inds),function(x) runif(1, min=1, max=550 Effectors - lapply(1:nrow(inds),function(x) sample(1:1,inds$No_of_Effectors,replace=TRUE)) inds - data.frame(inds,Effectors=as.character(Effectors)) Ind_Genes - strsplit(as.character(inds[1,4]),,) What I'm trying to do is: 1). For each individual (row) in my database, extract the values in the Effectors cell to an object. 2). Sample a number of those values and assign them to a new object called Expressed_Effectors 3). Storing it in the Expressed_Effectors cell, in much the same manner as I stored the Effectors object in the Effectors cell. My example attempt (for the first row/individual in my dataset) is below: (step by step, I didn't put this in a loop until I know it works for 1 row) Extract the values (effector genes) for the first individual, from the Effectors Cell in the dataframe, to Ind_Effectors object. Ind_Effectors - strsplit(as.character(inds[1,4]),,) Randomly dictate how many values (effectors) will be sampled n-round(runif(1, min=10, max=50)) Sample n values (effector genes) from Ind_Effectors, not replacing Expressed_Genes - sample(Ind_Effectors,n,replace=F) If I run this I receive the error: Error in sample(Ind_Effectors, n, replace = F) : cannot take a sample larger than the population when 'replace = FALSE' What I think this means is rather than picking out n values from the whole set of values in Ind_Effectors it's trying to sample the whole lot n times, which it cannot do because replace=F. This is not what I need, what I need is n values sampled from Ind_Effectors, not all values from Ind_Effectors sampled n times. I hope this clears up the confusion with what I'm trying to do. It may very well be I'm not instructing R to sample as a require properly. Sadly my previous experience with R amounts to loading in dataframes from experiment and doing stat analysis model fitting, not simulations or individual based models. Best wishes, Ben W. UEA (ENV) The Sainsbury Laboratory. P.S. As an aside I've been thinking about doing this model an alternative way to as I described in the first bit of my email (based on dataframes). Instead I would use a multi-dimentional ragged array(s): The format would be a 2D layout, Where every line is an effector gene and every column an aspect of the effector gene(value, expression state, fitness contribution etc.) This 2D layout of rows and columns is then repeated in the 3rd dimension (the z of x,y,z) of the array for each individual. It is ragged in the sense each individual, each slice through the array in the z direction, would have different numbers of rows - different numbers of effectors. This may be easier to work on, but I've not worked with multidimensional arrays, I'm used to data in dataframes (usually from spreadsheets from experiments). From: Jean V Adams [jvad...@usgs.gov] Sent: 08 November 2012 13:35 To: Benjamin Ward (ENV) Cc: r-help@r-project.org Subject: RE: [R] sample from list Ben, You have still not supplied reproducible code for me (and any other r-help reader) to run, which makes it very difficult to help you. I can run your first 5 lines of code with no problem. Path_Number - 0500 inds -data.frame(ID=formatC(0001:Path_Number,width=4,flag=0),No_of_Effectors=,No_Expressed_Effectors=) inds$No_of_Effectors - round(as.numeric(lapply(1:nrow(inds),function(x) runif(1, min=1, max=550 Effectors - lapply(1:nrow(inds),function(x) sample(1:1,inds$No_of_Effectors,replace=TRUE)) inds - data.frame(inds,Effectors=as.character(Effectors)) But your 6th line of code doesn't work ... there is no object inds2. Ind_Genes-strsplit(as.character(inds2[1,4]),,) If I use code that you provided in your earlier e-mail to create inds2, I get errors because inds doesn't have a variable No_of_Genes. Genes - lapply(1:nrow(inds),function(x) sample(1:1,inds$No_of_Genes,replace=TRUE)) inds2 - data.frame(inds, Genes=I(Genes)) inds2$No_Expressed_Genes - round(as.numeric(lapply(1:nrow(inds2),function(x) runif(1, min=10, max=50 So, before you hit the send button on your next e-mail. Start a clean R session with none of your objects
[R] sample from list
Hi all, I have a list of genes present in 500 individuals, the individuals are the elements: Genes - lapply(1:nrow(inds),function(x) sample(1:1,inds$No_of_Genes,replace=TRUE)) (This was later written to a dataframe as well as kept as the list object: inds2 - data.frame(inds,Genes=I(Genes))) I also have a vector of how many of those genes are expressed in the individuals, this can also kept as a vector object or written to a data frame: inds2$No_Expressed_Genes - round(as.numeric(lapply(1:nrow(inds2),function(x) runif(1, min=10, max=50 I want to create another list which consists of each individuals expressed genes - essentially a subset of the total genes the individuals have in the Genes list, by sampling from the Genes list for each individual, the number of genes (values)in the Num_Expressed_Genes vector. i.e. if Num_Expressed_Genes = 3 then sample 3 values from the element in the Genes list. I can't quite figure it out though. So far I have the following: #Defines The number of expressed genes for each individual in my data frame. Num_Expressed_Genes - round(as.numeric(lapply(1:nrow(inds2),function(x) runif(1, min=10, max=50 #My attempts to apply the sample function to every element (individual organism) of the Genes list , to subset the genes expressed. Expressed_Genes - lapply(1:nrow(inds),function(x) sample(Genes,Num_Expressed_Genes, replace=FALSE)) Expressed_Genes - lapply(Genes,function(x) sample(Genes,Num_Expressed_Genes, replace=FALSE)) So far though I'm getting results like this: [[49]] [[49]][[1]] [1] 3540 27 5344 7278 9758 8077 ... [217] [[49]][[2]] [1] 740 3362 8588 8574 4371 1447 .. [340] When what I need is more: [[49]] [1] 6070 1106 6275 In a case where Num_Expressed_Genes = 3 and the values are taken from the much larger set of values for element (individual) 49 in my Genes list. I'm not sure what I'm doing wrong but it seems what is happening is instead of picking out a few values according to the Num_Expressed_Genes vector - as an example say 3 again, It's drawing a large number of values, if not all of them, from elements in the list, 3 times. Any help is greatly appreciated, I've thought of using loops to achieve the same task, but I'm trying to get my individual/genes/expressed genes data.frame set up for my individual based model and get it running using vectors and as little loops as possible. Thanks, Ben. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] FW: Having some Trouble Data Structures
From: Benjamin Ward (ENV) Sent: 03 November 2012 13:29 To: Jeff Newmiller; r-help@r-project.org Subject: RE: [R] Having some Trouble Data Structures Hi, Thank you very much for your reply - how you prefer, is how my supervisor implemented the layout in Minitab, however I was unsure of how to get R to do this repeating ID behaviour and how to know that in a for loop going through individual 1 to say 10, I want it to: Randomly sample a number from a distribution for the number of effectors (I can do this but with runif), Then put one value in a cell of the Effector column and repeat the ID for each effector row. I'm also then left wondering when I do for loops then that use ID, will it go and apply operations row by row, or ID by ID - for example in the immunology part I would need a loop to check individual by individual if any of the effectors it has means death in the host, in which case all instances of - say ID 1 would need to be deleted. Would you be able to provide an example chunk of how you accomplish this with your preferred approach, if you have the time? Thanks, Ben W. From: Jeff Newmiller [jdnew...@dcn.davis.ca.us] Sent: 28 October 2012 15:27 To: Benjamin Ward (ENV); r-help@r-project.org Subject: Re: [R] Having some Trouble Data Structures Search on ragged array. My preferred approach is to use a data frame with one row per effector that repeats the per-ID information. If that occupies too much memory, you can setup another data frame with one row per ID and refer to that information as using lapply and subset the effectors data as needed. The plyr package is also useful for such processing. --- Jeff NewmillerThe . . Go Live... DCN:jdnew...@dcn.davis.ca.usBasics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/BatteriesO.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --- Sent from my phone. Please excuse my brevity. Benjamin Ward (ENV) b.w...@uea.ac.uk wrote: Hi All, I'm trying to run a simulation of host-pathogen evolution based around individuals. What I need to have is a dataframe or table of some description - describing all the individuals of a pathogen population (so far I've implemented this as a matrix): ID No_of_Effectors Effectors (Sequences) [1,] 0001 3 ## 3 Random Numbers ## There will be many such rows for many individuals. They have something called effectors, the number of which is randomly generated, so say you get 3 in the No_of_Effectors column. Then I make R generate 3 numbers from between 1 and 10,000, this gives me three numerical representations of genes. These numbers will be compared to a similar data structure of the host individuals who have their immune genes with similar numbers. My problem is that obviously I can't stick 3 numbers in one cell of the matrix (I've tried) : Pathogen_Individuals[1,3] - c(2,3,4) Error in Pathogen_Individuals[1, 3] - c(345, 567, 678) : number of items to replace is not a multiple of replacement length In future I'm also going to have more variables such as whether a gene is expressed. Such information may require a matrix in itself - something like: Effector ID Sequence Expressed? [1,] 0001 345,567,678 1 (or 0). Is there a way then I can put more than one value in the cell like a list of values, or a way to put objects in a cell of a data frame, matrix or table etc. Almost an inception deal - data structures nested in a data structure? If I search for things like insert list into matrix I get results like how to turn one into another, which is not what I think I need to be doing. I have been considering having several data structures not nested in each other, something like for every individual create a new matrix object with the name Effectors_[Individual_ID] and some how get my simulation loops operating on those objects but I find it hard to see how to tell R all of those matrices are to be included in an operation, as you can all lines of a data frame for example with for loops. This is strange for me because this model was written in a macro-code for another program which handles data in a different format and layout to R. My problem is I think, each individual in the model has many variables - in this case representations of genes. So I'm having trouble getting my head about this. Hopefully someone more experienced will be able to offer advice or a solution, it will be very appreciated. Many Thanks, Ben Ward (ENV, UEA The Sainsbury Lab, JIC). P.S. I have searched previous queries
[R] Having some Trouble Data Structures
Hi All, I'm trying to run a simulation of host-pathogen evolution based around individuals. What I need to have is a dataframe or table of some description - describing all the individuals of a pathogen population (so far I've implemented this as a matrix): ID No_of_Effectors Effectors (Sequences) [1,] 0001 3 ## 3 Random Numbers ## There will be many such rows for many individuals. They have something called effectors, the number of which is randomly generated, so say you get 3 in the No_of_Effectors column. Then I make R generate 3 numbers from between 1 and 10,000, this gives me three numerical representations of genes. These numbers will be compared to a similar data structure of the host individuals who have their immune genes with similar numbers. My problem is that obviously I can't stick 3 numbers in one cell of the matrix (I've tried) : Pathogen_Individuals[1,3] - c(2,3,4) Error in Pathogen_Individuals[1, 3] - c(345, 567, 678) : number of items to replace is not a multiple of replacement length In future I'm also going to have more variables such as whether a gene is expressed. Such information may require a matrix in itself - something like: Effector ID Sequence Expressed? [1,] 0001 345,567,678 1 (or 0). Is there a way then I can put more than one value in the cell like a list of values, or a way to put objects in a cell of a data frame, matrix or table etc. Almost an inception deal - data structures nested in a data structure? If I search for things like insert list into matrix I get results like how to turn one into another, which is not what I think I need to be doing. I have been considering having several data structures not nested in each other, something like for every individual create a new matrix object with the name Effectors_[Individual_ID] and some how get my simulation loops operating on those objects but I find it hard to see how to tell R all of those matrices are to be included in an operation, as you can all lines of a data frame for example with for loops. This is strange for me because this model was written in a macro-code for another program which handles data in a different format and layout to R. My problem is I think, each individual in the model has many variables - in this case representations of genes. So I'm having trouble getting my head about this. Hopefully someone more experienced will be able to offer advice or a solution, it will be very appreciated. Many Thanks, Ben Ward (ENV, UEA The Sainsbury Lab, JIC). P.S. I have searched previous queries to the list, and I'm not sure but this may be useful for relevant: Have you thought of using a list? a - matrix(1:10, nrow=2) b - 1:5 x - list(a=a, b=b) x $a [,1] [,2] [,3] [,4] [,5] [1,]13579 [2,]2468 10 $b [1] 1 2 3 4 5 x$a [,1] [,2] [,3] [,4] [,5] [1,]13579 [2,]2468 10 x$b [1] 1 2 3 4 5 oliveoil and yarn datasets have been mentioned. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] RExcel
Hello- I am a Graduate Assistant for an instructor who has written programs for statistics calculations such as binomial distributions and regressions. The programs had worked with no problem in Excel 2003. Now we are trying to use it with Excel 2007, and we are having some trouble. I have downloaded RandFriends and have ran the binomial distribution process in 2007 Excel and have received an error that says: Compile error in hidden module: UFDBinomial However, ther are two demo excel files in the RExcel file called RdemoDens. When I open the first RDemoDens excel file, I can run the processes and they work fine. When I run the second RDemoDens excel file, or a blank excel 2007 file, the processes do not work and I get the error message. I am trying to figure out what is different about the first RDemoDens excel file that allows the calculations to process correctly. I am thinking that something in the macro library in the demo must be different than what is in a blank excel document. I just cannot seem to figure out what it is. One thing that I did notice is that there are two different RExcel files in the RExcel folder. One is labled RExcel and one is labed RExcel 2007. What are the difference between these two RExcel files? I am not sure if this has anything to do with the problem, but perhaps the excel demo in which our calculations work uses the correct RExcel file while a regular excel 2007 document does not call the correct one. If anyone has an idea about what might be happening here, or who else I could ask about the situation, I would appreciate any input. Thanks, Ben Ward [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.