[R] Query about RODBC to access MySQL from Windows
Hi I am trying to use RODBC in R installed on Windows to access MySQL database (on a linux box). I set up a DSN and specified this DSN in R as follows library(RODBC); channel - odbcConnect(mysqldsn); RODB Connection 5 Details: case=nochange PORT=3306 Although this seems to connect properly, running any command yields NO results. i.e. sqlQuery(channel, show tables) yields 0 rows when there are close to 500 tables in the database. Ditto with any other query. It does not cause an error, but it returns 0 rows. The USER DSN mysqldsn is set up as follows :- host : zion.xxx.xxx.xxx default database : default_db port : 3306 username : uname password : pwd Running use default_db; show tables; command from the command prompt on the db server returns 500 rows. I find this problem while running any query. Running select * from tname limit 100 returns 0 rows whereas tname has around a million records. In the past, I have used MySQL clients for Windows to access the database without encountering any such problem I even tried setting up the mysqldsn DSN as a system DSN instead of a user DSN. I would like to know a) whether this is a permissions issue at some level b) whether there is any solution for this problem in R Thanks Lalitha __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Query about finding correlations
Hi I have a dataframe which has 3 columns of numeric data A,B,C each of which has been obtained independent of the other. We are trying to find out, which of A or B cause C i.e. We are hypothesising that C is the effect and either A or B, not both is the cause. i.e. A causes C and this cause-effect relationship explains B. The data for A contains more noise than that for B. We are working with around 1000 points. I would greatly appreciate any inputs on the best statistcal approach to tackle this problem. I am thinking that we can find correlation coefficients between A and C, and between B and C, but I am not sure this answers the question. Also we do not know whether the correlation between them is linear or non linear. Thanks Lalitha __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Query about finding correlations
Hi This is not a homework assignment :) Me and my manager are trying to understand the problem better. In the meanwhile, we thought we would post the problem on this forum to seek some input from statisticians who possibly do this kind of analyses everyday and hence are possibly more proficient with R and/or any recommended methodologies. Lalitha On 5/2/07, Stefan Grosse [EMAIL PROTECTED] wrote: How about making your homeworks yourselfes? lalitha viswanath wrote: Hi I have a dataframe which has 3 columns of numeric data A,B,C each of which has been obtained independent of the other. We are trying to find out, which of A or B cause C i.e. We are hypothesising that C is the effect and either A or B, not both is the cause. i.e. A causes C and this cause-effect relationship explains B. The data for A contains more noise than that for B. We are working with around 1000 points. I would greatly appreciate any inputs on the best statistcal approach to tackle this problem. I am thinking that we can find correlation coefficients between A and C, and between B and C, but I am not sure this answers the question. Also we do not know whether the correlation between them is linear or non linear. Thanks Lalitha __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Query about finding correlations
Hi Thanks for your input. I stand corrected. The causation is not linear. We wish to find out which is the cause and under what circumstances. i.e. at what points along a scale for C, is A the cause and when does B become the cause, if at all. As a crude analyses, we assumed that the above is not the case (i.e. either A causes C all the time or B causes C all the time) and obtained correlation coefficients using lmFit , however as you rightly mentioned, it is not of much help to us. We are trying to find out whether ages of proteins(A) or their rates of evolution(B) influences parameter C. There is an obvious correlation between A and B which needs to fulfil the hypothesis as well. I am checking out wald.test(eba), HypothesisTesting(fBasics), O8.Tests, O6.LinearModels(limma) amongst others presently. Thanks Lalitha On 5/2/07, Alberto Monteiro [EMAIL PROTECTED] wrote: Lalitha Viswanath wrote: We are trying to find out, which of A or B cause C i.e. We are hypothesising that C is the effect and either A or B, not both is the cause. (...) I would greatly appreciate any inputs on the best statistcal approach to tackle this problem. I am thinking that we can find correlation coefficients between A and C, and between B and C, but I am not sure this answers the question. Also we do not know whether the correlation between them is linear or non linear. If the causation (not the correlation) is not linear, then the correlation (which is linear, always) may not be the best indicator. Take, as an extreme case, this: A - (-50:50) + 100 * rnorm(101) B - abs((-50):50) + 10 * rnorm(101) C - A^2 / 50 + rnorm(101) cor(A, C) cor(B, C) A is obviously the cause of C, but B (in some cases) is better correlated to C than A to C. Alberto Monteiro [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Query about using rowSums/ColSums on table results
Hi I have data of the form class age A 0.5 B 0.4 A 0.5 C 0.785 D 0.535 A 0.005 C 0.015 D 0.205 A 0.605 etc etc... I tabulated the above as tab -table(data$class, cut(data$age, seq(0,0.6,0.02)) I wish to view the results in individual bins as a percentage of the points in each bin. So I tried tab/colSums(tab) However that is yielding Inf as a return value in places where clearly the result should be a non-zero value. Is there an alternate way to get the results in each bin as percentages of the total points in that age-bin? Thanks Lalitha __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Query about substituting characters in a df
Hi I have a data frame with 40,000 rows and 4 columns, one of which is class. For each row, the class column can be one of 10 possible NUMERIC values. I wish to substitute these numeric values with words/characters. For example, I wish to substitute all occurences of 5467 in the column class with alpha, 7867 with gamma, etc. I looked up substitute, but did not find any relevant examples. Your input is greatly appreciated Thanks Lalitha Never miss an email again! __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Query about using setdiff
Hi I have two dataframes names(DF1) = c(id, val1, val2); names(DF2) = c(id2); Ids in DF2 are a complete subset of those in DF1 How can I extract entries from DF1 where id NOT IN DF2. I tried setdiff(DF1, DF2); setdiff(DF1$id, DF2$id), etc. Although the latter eliminates the ids as required, I dont know how to extract val1 and val2 for the resultant set. Thanks Lalitha 8:00? 8:25? 8:40? Find a flick in no time __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Query about data manipulation
Hi Thanks much for the prompt response to my earlier enquiry on packages for regression analyses. Along the same topic(?), I have another question about which I could use some input. I am retreiving data from a MySQL database using RODBC. The table has many BLOB columns and each BLOB column has data in the format id1 \t id2 \t measure \n id3 \t id4 \t measure (i.e. multiple rows compressed as one long string) I am retreiving them as follows. dataFromDB - sqlQuery(channel, select uncompress(columnName) from tableName); I am looking for ways to convert this long string into a table/dataframe in R, making it easier for further post processing etc without reading/writing it to a file first. Although by doing write.table and reading it in again, I got the result in a data frame, with the \t and \n interpreted correctly, I wish to sidestep this as I need to carry out this analyses for over 4 million such entries. I tried write.table(dataFromDB, file=FileName); dataFromFile - read.table(FileName, sep=\t) dataFromFile is of the form 92_8_nmenA 993_7_mpul 1.042444 92_8_nmenA 3_5_cpneuA 0.900939 190_1_rpxx 34_4_ctraM 0.822532 190_1_rpxx 781_6_pmul 0.870016 Your input on the above is greatly appreciated. Thanks Lalitha Never miss an email again! __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Query about data manipulation
Hi Thanks much for that input.It was extremely helpful. I am seeking some input about another stumbling block using RODBC; SQLQuery et al with large BLOB values. Although the following query dataFromDB - sqlQuery(channel, select uncompress(columnName) from tableName where Id=id ); returns just one row , dataFromDB[1,1] actually contains 4000+ rows of the form field1 \t field2 \t value\n described earlier. (4000+ rows compressed as one long string) On printing dataFromDB[1,1], it does not print beyond 3600 such rows or so (printing in fact field1 \t field2 \t value \n.field3600 \t field3601), abruptly missing the rest of the result. Hence it throws an error when I try to use read.table (after using textConnection as suggested) that row xyz does not contain 3 values,etc. It seems to be missing 1/4th of the actual result that should contain 4000+ such pairs. The set of 4000+ rows occupy just 100KB if written out to a file directly from MySQL. Is there anyway to increase the capacity of the return result in R so that it does not get thrown off as above and retrieves the ENTIRE result? I tried increasing buffsize, but as I understand, since SqlQuery itself returns just one row in this case, it is possibly not very relevant here? Note that the above mentioned problem does not arise when the data returned from SQL query contains less than 3500 such concatenated entries. Your input is greatly appreciated. Thanks Lalitha --- Marc Schwartz [EMAIL PROTECTED] wrote: On Thu, 2007-03-01 at 08:34 -0800, lalitha viswanath wrote: Hi Thanks much for the prompt response to my earlier enquiry on packages for regression analyses. Along the same topic(?), I have another question about which I could use some input. I am retreiving data from a MySQL database using RODBC. The table has many BLOB columns and each BLOB column has data in the format id1 \t id2 \t measure \n id3 \t id4 \t measure (i.e. multiple rows compressed as one long string) I am retreiving them as follows. dataFromDB - sqlQuery(channel, select uncompress(columnName) from tableName); I am looking for ways to convert this long string into a table/dataframe in R, making it easier for further post processing etc without reading/writing it to a file first. Although by doing write.table and reading it in again, I got the result in a data frame, with the \t and \n interpreted correctly, I wish to sidestep this as I need to carry out this analyses for over 4 million such entries. I tried write.table(dataFromDB, file=FileName); dataFromFile - read.table(FileName, sep=\t) dataFromFile is of the form 92_8_nmenA 993_7_mpul 1.042444 92_8_nmenA 3_5_cpneuA 0.900939 190_1_rpxx 34_4_ctraM 0.822532 190_1_rpxx 781_6_pmul 0.870016 Your input on the above is greatly appreciated. Thanks Lalitha The easiest way might be to use a textConnection(). Let's say that you have read in your data as above and you have a column called 'blob': dataFromDB blob 1 id1 \t id2 \t measure \n id3 \t id4 \t measure #Open textConnection. Note coercion to character BLOB - textConnection(as.character(dataFromDB$blob)) # Read in the column DF - read.table(BLOB, sep = \t) # Close the connection close(BLOB) DF V1V2V3 1 id1 id2 measure 2 id3 id4 measure See ?textConnection HTH, Marc Schwartz Any questions? Get answers on any topic at www.Answers.yahoo.com. Try it now. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Packages in R for least median squares regression and computing outliers (thompson tau technique etc.)
Hi I am looking for suitable packages in R that do regression analyses using least median squares method (or better). Additionally, I am also looking for packages that implement algorithms/methods for detecting outliers that can be discarded before doing the regression analyses. Although some websites refer to lms method under package lps in R, I am unable to find such a package on CRAN. I would greatly appreciate any pointers to suitable functions/packages for doing the above analyses. Thanks Lalitha TV dinner still cooling? Check out Tonight's Picks on Yahoo! TV. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Query about merging two tables
Hi I have table1 which has the foll. columns id age rate and table2 which has the foll. columns id count I wish to get data from table1 for all the ids which are persent in table2 and where the rate is not equal to 999. The ids in table2 are a subset of those in table1 and every id in table2 has an entry in table1. I would appreciate your input regarding the above. Thanks in advance Lalitha __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Query about merging two tables
Hi I have table1 which has the foll. columns id age rate and table2 which has the foll. columns id count I wish to get data from table1 for all the ids which are persent in table2 and where the rate is not equal to 999. The ids in table2 are a subset of those in table1 and every id in table2 has an entry in table1. I would appreciate your input regarding the above. Thanks in advance Lalitha No need to miss a message. Get email on-the-go __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] unique/subset problem
Hi The pruned dataset has 8 unique genomes in it while the dataset before pruning has 65 unique genomes in it. However calling unique on the pruned dataset seems to return 65 no matter what. Any assistance in this matter would be appreciated. Thanks Lalitha --- Weiwei Shi [EMAIL PROTECTED] wrote: Hi, Even you removed many genomes1 by setting score -5; it is not necessary saying you changed the uniqueness. To check this, you can do like p0 - unique(dataset[dataset$score -5, genome1]) # same as subset p1 - unique(dataset[dataset$score= -5, genome1]) setdiff(p1, p0) if the output above has NULL, then it means even though you remove many genomes1, but it does not help changing the uniqueness. HTH, weiwei On 1/25/07, lalitha viswanath [EMAIL PROTECTED] wrote: Hi I am new to R programming and am using subset to extract part of a data as follows names(dataset) = c(genome1,genome2,dist,score); prunedrelatives - subset(dataset, score -5); However when I use unique to find the number of unique genomes now present in prunedrelatives I get results identical to calling unique(dataset$genome1) although subset has eliminated many genomes and records. I would greatly appreciate your input about using unique correctly in this regard. Thanks Lalitha TV dinner still cooling? Check out Tonight's Picks on Yahoo! TV. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Weiwei Shi, Ph.D Research Scientist GeneGO, Inc. Did you always know? No, I did not. But I believed... ---Matrix III Bored stiff? Loosen up... __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] unique/subset problem
Hi I read in my dataset using dt read.table(filename) calling unique(levels(dt$genome1)) yields the following aero aful aquae atum_D bbur bhal bmel bsub [9] buch cace ccre cglu cjej cper cpneuAcpneuC [17] cpneuJctraM ecoliO157 hbsp hinf hpyl linn llact [25] lmon mgen mjan mlep mlot mpneu mpul mthe [33] mtub mtub_cdc nost pabyssi paer paero pmul pyro [41] rcon rpxx saur_mu50 saur_n315 sent smel spneu spyo [49] ssol stok styp synecho tacid tmar tpal tvol [57] uure vcho xfas ypes It shows 60 genomes, which is correct. I extracted a subset as follows possible_relatives_subset - subset(dt, Y -5) I am pasting the results below genome1 genome2 parameterX Y 21 sent ecoliO157 0.00590 -200.633493 22 sent paer 0.18603 -100.200570 27 styp ecoliO157 0.00484 -240.708645 28 styp paer 0.18497 -30.250127 41 paer sent 0.18603 -60.200570 44 paer styp 0.18497 -80.250127 49 paer hinf 0.18913 -90.056333 53 paer vcho 0.18703 -10.153929 55 paer pmul 0.18587 -100.208042 67 paer buch 0.21485 -80.898667 70 paer ypes 0.18460 -107.267454 82 paer xfas 0.26268 -61.920552 95 hinf ecoliO157 0.07654 -163.018417 96 hinf paer 0.18913 -10.056333 103 vcho ecoliO157 0.09518 -140.921153 104 vcho paer 0.18703 -10.153929 107 pmul ecoliO157 0.07328 -165.215225 108 pmul paer 0.18587 -10.208042 131 buch ecoliO157 0.15412 -11.746939 132 buch paer 0.21485 -8.898667 137 ypes ecoliO157 0.02705 -19.171851 138 ypes paer 0.18460 -10.267454 171 ecoliO157 sent 0.00590 -20.633493 174 ecoliO157 styp 0.00484 -20.708645 179 ecoliO157 hinf 0.07654 -6.018417 183 ecoliO157 vcho 0.09518 -14.921153 185 ecoliO157 pmul 0.07328 -6.215225 197 ecoliO157 buch 0.15412 -11.746939 200 ecoliO157 ypes 0.02705 -9.171851 211 ecoliO157 xfas 0.25833 -71.091552 217 xfas ecoliO157 0.25833 -75.091552 218 xfas paer 0.26268 -64.920552 I think even a cursory look will tell us that there are not as many unique genomes in the subset results. (around 8/10). However when I do unique(levels(possible_relatives_subset$genome1)), I get [1] aero aful aquae atum_D bbur bhal bmel bsub [9] buch cace ccre cglu cjej cper cpneuAcpneuC [17] cpneuJctraM ecoliO157 hbsp hinf hpyl linn llact [25] lmon mgen mjan mlep mlot mpneu mpul mthe [33] mtub mtub_cdc nost pabyssi paer paero pmul pyro [41] rcon rpxx saur_mu50 saur_n315 sent smel spneu spyo [49] ssol stok styp synecho tacid tmar tpal tvol [57] uure vcho xfas ypes Where am I going wrong? I tried calling unique without the levels too, which gives me the following response [1] sent styp paer hinf vcho pmul buch ypes ecoliO157 xfas 60 Levels: aero aful aquae atum_D bbur bhal bmel bsub buch cace ccre cglu cjej cper cpneuA ... ypes --- Weiwei Shi [EMAIL PROTECTED] wrote: Then you need to provide more details about the calls you made and your dataset. For example, you can tell us by str(prunedrelatives, 1) how did you call unique on prunedrelative and so on? I made a test data it gave me what you wanted (omitted here). On 1/26/07, lalitha viswanath [EMAIL PROTECTED] wrote: Hi The pruned dataset has 8 unique genomes in it while the dataset before pruning has 65 unique genomes in it. However calling unique on the pruned dataset seems to return 65 no matter what. Any assistance in this matter would be appreciated. Thanks Lalitha --- Weiwei Shi [EMAIL PROTECTED] wrote: Hi, Even you removed many genomes1 by setting score -5; it is not necessary saying you changed the uniqueness. To check this, you can do like p0 - unique(dataset[dataset$score -5, genome1]) # same as subset p1 - unique(dataset[dataset$score= -5, genome1]) setdiff(p1, p0) if the output above has NULL, then it means even though you remove many genomes1, but it does not help changing the uniqueness. HTH, weiwei On 1/25/07, lalitha viswanath [EMAIL PROTECTED] wrote: Hi I am new to R programming and am using subset to extract part of a data as follows names(dataset) = c(genome1,genome2,dist,score); prunedrelatives - subset(dataset, score -5); However when I use unique
[R] Package for phylogenetic tree analyses
Hi I am looking for a package that 1. reads in a phylogenetic tree in NEXUS format 2. given two members/nodes on the tree, can return the distance between the two using the tree. I came across the following packages on CRAN ouch, ape, apTreeShape, phylgr all of which seem to provide extensive range of functions for reading in a Nexus-format tree and performing phylogenetic analyses, tree comparisons etc, but none to the best of my undestanding seem to provide a function obtain distances (in terms of branch lengths) between two nodes on a single tree. I am working with just one tree and need a function to return distances between various pairs of nodes on the tree. Is there any other package out there that has this functionality? Thanks for your responses to my earlier queries. As a beginning R programmer, your responses have been of utmost help and guidance. Lalitha Access over 1 million songs. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] unique/subset problem
Hi I am new to R programming and am using subset to extract part of a data as follows names(dataset) = c(genome1,genome2,dist,score); prunedrelatives - subset(dataset, score -5); However when I use unique to find the number of unique genomes now present in prunedrelatives I get results identical to calling unique(dataset$genome1) although subset has eliminated many genomes and records. I would greatly appreciate your input about using unique correctly in this regard. Thanks Lalitha TV dinner still cooling? Check out Tonight's Picks on Yahoo! TV. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Query about extracting subset of datafram
Hi I have a table read from a mysql database which is of the kind clusterid clockrate I obtained this table in R as clockrates_table -sqlQuery(channel,select); I have a function within which I wish to extract the clusterid for a given cluster. Although I know that there is just one row per clusterid in the data frame, I am using subset to extract the clockrate. clockrate = subset(clockrates_table, clusterid==15, select=c(clockrate)); Is there any way of extracting the clockrate without using subset. In the help section for subset, it mentioned to see also: [,... However I could find no mention for this entry when I searched as ?[, etc. The R manuals also, despite discussing complex libraries, techniques etc, dont always seem to provide such handy hints/tips and tricks for manipulating data, which is a first stumbling block for newbies like me. I would greatly appreciate if you could point me to such resources as well, for future reference. Thanks Lalitha 8:00? 8:25? 8:40? Find a flick in no time __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] User defined function calls
Hi I have a script processfiles.R that contains, amongst other functions 1) a database access function called get_clockrates which retreives from a database, a table containing columns (clusterid, clockrate) and 45000 rows(one for each clusterid). Clusterid is an integer and clockrate is a float. 2) process_clusterid which takes clusterid as an argument and after doing some data processing, retrieves the clockrate corresponding to the clusterid. I wish to call get_clockrates only once and keep the dataframe returned by it as a GLOBAL which the function process_clusterid can use for each clusterid that it processes. To ensure that clockrates is global, I retreive it as clockrate - sqlQuery.. Trust that this is correct. Without the inclusion of get_clockrates function, I have run this script under R as follows source(process_files.R); for (index in c(1:45000)) { try(process_file, silent=TRUE); } How do I get this code to execute get_clockrates only once and subsequently call process_file for each of the 45000 files in turn. I would greatly appreciate your input regarding my query. Thanks Lalitha Finding fabulous fares is fun. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Query about extracting subsets from a table
Hi I am trying to process tabular data as follows: Data in the input file is of the form genome1 genome2 tree-dist log10escore Genome1 and genome2 are alphabetic. Tree-dist and log10escore are numeric. I wish to extract only those rows from this table where the log10escore is less than -3. data -read.table(filename); data$log10escore = data$log10escore[ data$log10escore -3]; I would like to use this pruned list of escores to get the corresponding genomenames and treedist. I did not find anything useful in the FAQs and Notes on R for this part of the data extraction. As I am just beginning programming in R, I would appreciate your input about this. Thanks L Food fight? Enjoy some healthy debate __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Query about using optimizers in R without causing program to crash
Hi I am a newbie to R and am using the lm function to fit my data. This optimization is to be performed for around 45000 files not all of which lend themselves to optimization. Some of these will and do crash. However, How do I ensure that the program simply goes to the next file in line without exiting the code with the error Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : NA/NaN/Inf in foreign function call (arg 4) everytime it encounters troublesome data? I would greatly appreciate your input as it would avoid me having to manually type for fileId in (c(4351:46000)) { ... } for fileId in (c(5761:46000)) { ... }, etc... Thanks Lalitha Now that's room service! Choose from over 150,000 hotels __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Query about using try block
Hi Thanks for your response. However I seem to be doing something wrong regarding the try block resulting in yet another error described below. I have a function that takes in a file name and does the fit for the data in that file. Hence based on your input, I tried try ( (fit = lm(y~x, data = data_fitting)), silent = T); I left the subsequent lines of my code unchanged. coeffs = as.list(coef(fit); lambda = exp(coeffs$x) After the change using try, when I tried to resume processing under R as follows source(fitting.R) for filename in list { process(filename); } It says Cannot find object fit ...(in the line trying to get the coefficients...) Am I closing the try block in the wrong place? This function does some post processing on the coefficients returned by coef(fit), puts them in a list and sends it to another function. (i.e. around 6 lines of code after the call to fit). Thanks Lalitha --- Andreas Hary [EMAIL PROTECTED] wrote: Look at ?try Your code will probably need to be something like the following: fit - list() for(fileId in 1:n){ try(fit[i] - lm(formula,data=???,...), silent=F) #or silent=T if you would like to be made aware of problems } Best wishes, Andreas lalitha viswanath wrote: Hi I am a newbie to R and am using the lm function to fit my data. This optimization is to be performed for around 45000 files not all of which lend themselves to optimization. Some of these will and do crash. However, How do I ensure that the program simply goes to the next file in line without exiting the code with the error Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : NA/NaN/Inf in foreign function call (arg 4) everytime it encounters troublesome data? I would greatly appreciate your input as it would avoid me having to manually type for fileId in (c(4351:46000)) { ... } for fileId in (c(5761:46000)) { ... }, etc... Thanks Lalitha Now that's room service! Choose from over 150,000 hotels __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- = Andreas Hary Flat 5, 70 Finsbury Park Road Lond, N4 2JX, UK Email:[EMAIL PROTECTED] Mobile: 07906 860 987 Want to start your own business? __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Query about using table
Hi I have data of the following form ID age member_FLAG 125 Y 236.75 N 375.5N . . I want to get a histogram of this data showing distribution of member_flag in each age-bin i.e. how many values in each age bin have a member_flag of 'Y' and how many have 'N'. I was able to do the same using barplot2. However I also need similar information in a tabular form using percentages. i.e in each age bin, what is the PERCENTAGE of IDs with a member_flag of 'Y' I am trying to work with table for the same, but would appreciate some guidance regarding the above. Thanks Lalitha __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Query about using table
Hi I have data of the following form ID age member_FLAG 125 Y 236.75 N 375.5N . . I want to get a histogram of this data showing distribution of member_flag in each age-bin i.e. how many values in each age bin have a member_flag of 'Y' and how many have 'N'. I was able to do the same using barplot2. However I also need similar information in a tabular form using percentages. i.e in each age bin, what is the PERCENTAGE of IDs with a member_flag of 'Y' I am trying to work with table for the same, but would appreciate some guidance regarding the above. Thanks Lalitha __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Query about getting averages across a certain parameter in a table
Hi I have a table that goes data cluster_ac clockrate age class 7337 0.9 0.001 alpha_proteins 7888 0.1 0.78 beta proteins etc The class column can have 7-8 different unique values While the clockrate and age columns are floats varying from 0 to 1. I wish to get the average clockrate across each of the classes for this data. I would appreciate your help regarding the aboe. Thanks Lalitha __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Query about getting a table of binned values
Hi I am working with a dataset of age and class of proteins #Age 0 0.0 0.677 #Class Type A Type B . . . Type K I wish to get a table that reads as follows 0-0.02 0.02-0.04 0.04-0.06 . 0.78-0.8 Type A15 20 5 8 Type B 86 . . . Type K 10 7 I would appreciate your input regarding the appropriate functions to use for this purpose regards Lalitha __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Query about the functions used in tapply
Hi I am trying to plot an x-y plot of the values a certain variable against bins. i.e. the x-axiz goes from 0 to 0.7 in increments of 0.02 while the y-axis is the average of values for all the points in that interval. Hence I first used cut to break the data into intervals, then I applied tapply using mean as the function and plotted the results. I also replaced mean with median. the 3 sets of functions that I used were However I am finding that the actual value plotted in the y-axis somehow does not seem to be correct? i.e. for example in the interval 0.38-0.4 there are a humungous number of points with y-axis value below 20 while there are very few with y-axis value above 20. However the median plotted is still around the 20 mark. It does not seem intuitive looking at the data that more than 50% of the points have a clock_rate (plotted on the y-axis) above 20. Is there something about the way these functions work with tapply, that I am missing? Any obvious mistakes that I should look for? SWfac -cut(sorted_inp$age[1:290], seq(0, 0.7,0.02)) SLmean - tapply(sorted_inp$clock_rate[1:290], SWfac, mean) plot(SLmean, type =b, xaxt = n) axis(1, seq(SLmean), levels(SWfac)) I tried a simple x-y scatter plot of the same 290 rows in excel (without binning them) and the concentration of points at lower values of clock rates does not seem to indicate that the medians should be as high as they are shown. Hoping to hear further Regards Lalitha __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] table of means/medians across bins used for a histogram
Hi I think I seem to have phrased my doubt incorrectly. I want a x-y plot of age v/s rate (the bin is irrelevant for this plot); only that instead of a simple x-y plot, i want a plot of average(rate) for each age-intervals. My ages vary from 0 to 0.7 and I want to divide them in groups of 0.02. So I want a plot of the following Age-intervalsAverage rate in that interval 0-0.025 0.02-0.04 7 0.04-0.06 1 0.06-0.08 0 0.08-0.1 0.15 Age-intervals mentioned along the x-axis (like for a histogram) and rates plotted for each age-interval --- Gabor Grothendieck [EMAIL PROTECTED] wrote: Or perhaps a bit simpler: plot(age ~ ave(clock, bin), DF) On 4/30/06, Gabor Grothendieck [EMAIL PROTECTED] wrote: My understanding is that you want to replace each rate with its average over the associated bin and then plot age against that. In that case try this: DF # test data age rate bin 1 0.002 10.0 A 2 0.045 0.1 B 3 0.130 15.0 A 4 0.150 34.0 D with(DF, plot(ave(rate, bin), age)) Assuming they are stored in vectors the columns are age, rate, bin we would have plot(ave(clock, bin), age) On 4/30/06, lalitha viswanath [EMAIL PROTECTED] wrote: Hi I am trying to get a table of means of parameter 1 across BINS of parameter 2. I am working in proteomics and a sample of my data is as follows cluster-age clock-rate(evolutionary rate) scopclass 0.002 10 A 0.045 0.1 B 0.1315 A 0.1534 D Scop class has only 9 distinct categories (A-I) Whereas cluster-age and clock-rate are discrete variables greater than 0. I am trying to do two things with this kind of data, out of which I managed to accomplish one thanks to the documentation and pre-existing queries on the mailing lists. 1. Plot a histogram of the age distribution with scop class category superimposed on each bin. I managed to do this with barplot2. 2. Now I am trying to plot a scatter plot of the age v/s the clock-rate. However to eliminate possible sampling errors, we are trying to get an average of the clock-rate for each of the bins used above. i.e. before plotting a x-y plot, i wish to compute average clock-rate in each of the bins for the age and then plot a x-y plot of the age v/s clock rate. Can anyone point me to appropriate functions for the same? I am trying to work with prop.table, cut, break, etc. But I am not heading anywhere. Thanks Lalitha __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] table of means/medians across bins used for a histogram
Hi I am trying to get a table of means of parameter 1 across BINS of parameter 2. I am working in proteomics and a sample of my data is as follows cluster-age clock-rate(evolutionary rate) scopclass 0.002 10 A 0.045 0.1 B 0.1315 A 0.1534 D Scop class has only 9 distinct categories (A-I) Whereas cluster-age and clock-rate are discrete variables greater than 0. I am trying to do two things with this kind of data, out of which I managed to accomplish one thanks to the documentation and pre-existing queries on the mailing lists. 1. Plot a histogram of the age distribution with scop class category superimposed on each bin. I managed to do this with barplot2. 2. Now I am trying to plot a scatter plot of the age v/s the clock-rate. However to eliminate possible sampling errors, we are trying to get an average of the clock-rate for each of the bins used above. i.e. before plotting a x-y plot, i wish to compute average clock-rate in each of the bins for the age and then plot a x-y plot of the age v/s clock rate. Can anyone point me to appropriate functions for the same? I am trying to work with prop.table, cut, break, etc. But I am not heading anywhere. Thanks Lalitha __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html