[R] Query about RODBC to access MySQL from Windows

2007-05-02 Thread lalitha viswanath
Hi
I am trying to use RODBC in R installed on Windows to
access MySQL database (on a linux box).
I set up a DSN and specified this DSN in R as follows
library(RODBC);
channel - odbcConnect(mysqldsn);
RODB Connection 5
Details:
  case=nochange
  PORT=3306

Although this seems to connect properly, running any
command yields NO results.
i.e. sqlQuery(channel, show tables) yields 0 rows
when there are close to 500 tables in the database.
Ditto with any other query. It does not cause an
error, but it returns 0 rows.

The USER DSN mysqldsn is set up as follows :-
host : zion.xxx.xxx.xxx
default database : default_db
port : 3306
username : uname
password : pwd

Running  use default_db; show tables; command from
the command prompt on the db server returns 500 rows.

I find this problem while running any query. 
Running select * from tname limit 100 returns 0 rows
whereas tname has around a million records.

In the past, I have used MySQL clients for Windows to
access  the database without encountering any such
problem

I even tried setting up the mysqldsn DSN as a system
DSN instead of a user DSN.

I would like to know
a) whether this is a permissions issue at some level
b) whether there is any solution for this problem in R



Thanks
Lalitha

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Query about finding correlations

2007-05-02 Thread lalitha viswanath
Hi
I have a dataframe which has 3 columns of numeric data
A,B,C each of which has been obtained independent of
the other.

We are trying to find out, which of A or B cause C
i.e. We are hypothesising that C is the effect and
either A or B, not both is the cause.

i.e. A causes C and this cause-effect relationship
explains B.

The data for A contains more noise than that for B.
We are working with around 1000 points.

I would greatly appreciate any inputs on the best
statistcal approach to tackle this problem. 
I am thinking that we can find correlation
coefficients between A and C, and between B and C, but
I am not sure this answers the question.
Also we do not know whether the correlation between
them is linear or non linear.

Thanks
Lalitha

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Query about finding correlations

2007-05-02 Thread Lalitha Viswanath
Hi
This is not a homework assignment :)
Me and my manager are trying to understand the problem better. In the
meanwhile, we thought we would post the problem on this forum to seek some
input from statisticians who possibly do this kind of analyses everyday and
hence are possibly more proficient with R and/or any recommended
methodologies.

Lalitha

On 5/2/07, Stefan Grosse [EMAIL PROTECTED] wrote:

 How about making your homeworks yourselfes?

 lalitha viswanath wrote:
  Hi
  I have a dataframe which has 3 columns of numeric data
  A,B,C each of which has been obtained independent of
  the other.
 
  We are trying to find out, which of A or B cause C
  i.e. We are hypothesising that C is the effect and
  either A or B, not both is the cause.
 
  i.e. A causes C and this cause-effect relationship
  explains B.
 
  The data for A contains more noise than that for B.
  We are working with around 1000 points.
 
  I would greatly appreciate any inputs on the best
  statistcal approach to tackle this problem.
  I am thinking that we can find correlation
  coefficients between A and C, and between B and C, but
  I am not sure this answers the question.
  Also we do not know whether the correlation between
  them is linear or non linear.
 
  Thanks
  Lalitha
 
  __
  R-help@stat.math.ethz.ch mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 
 



[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Query about finding correlations

2007-05-02 Thread Lalitha Viswanath
Hi
Thanks for your input. I stand corrected.
The causation is not linear.
We wish to find out which is the cause and under what circumstances. i.e. at
what points along a scale for C, is A the cause and when does B become the
cause, if at all.
As a crude analyses, we assumed that the above is not the case (i.e. either
A causes C all the time or B causes C all the time) and obtained correlation
coefficients using lmFit , however as you rightly mentioned, it is not of
much help to us.
We are trying to find out whether ages of proteins(A) or their rates of
evolution(B) influences parameter C.
There is an obvious correlation between A and B which needs to fulfil the
hypothesis as well.
I am checking out wald.test(eba), HypothesisTesting(fBasics), O8.Tests,
O6.LinearModels(limma) amongst others presently.
Thanks
Lalitha

On 5/2/07, Alberto Monteiro [EMAIL PROTECTED] wrote:

 Lalitha Viswanath wrote:
 
  We are trying to find out, which of A or B cause C
  i.e. We are hypothesising that C is the effect and
  either A or B, not both is the cause.
  (...)
  I would greatly appreciate any inputs on the best
  statistcal approach to tackle this problem.
  I am thinking that we can find correlation
  coefficients between A and C, and between B and C, but
  I am not sure this answers the question.
  Also we do not know whether the correlation between
  them is linear or non linear.
 
 If the causation (not the correlation) is not linear,
 then the correlation (which is linear, always) may not
 be the best indicator.

 Take, as an extreme case, this:

 A - (-50:50) + 100 * rnorm(101)
 B - abs((-50):50) + 10 * rnorm(101)
 C - A^2 / 50 + rnorm(101)
 cor(A, C)
 cor(B, C)

 A is obviously the cause of C, but B (in some cases)
 is better correlated to C than A to C.

 Alberto Monteiro



[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Query about using rowSums/ColSums on table results

2007-04-30 Thread lalitha viswanath
Hi
I have data of the form
class age
A  0.5
B  0.4
A  0.5
C  0.785
D  0.535
A  0.005
C  0.015
D  0.205
A  0.605

etc etc...

I tabulated the above
as
tab -table(data$class, cut(data$age, seq(0,0.6,0.02))

I wish to view the results in individual bins as a
percentage of the points in each bin.
So I tried
tab/colSums(tab)

However that is yielding Inf as a return value in
places where clearly the result should be a non-zero
value.

Is there an alternate way to get the results in each
bin as percentages of the total points in that
age-bin?

Thanks
Lalitha

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Query about substituting characters in a df

2007-03-12 Thread lalitha viswanath
Hi
I have a data frame with 40,000 rows and 4 columns,
one of which is class.


For each row, the class column can be one of 10
possible NUMERIC values.
I wish to substitute these numeric values with
words/characters.
For example, I wish to substitute all occurences of
5467 in the column class with alpha, 7867 with
gamma, etc.
I looked up substitute, but did not find any relevant
examples.

Your input is greatly appreciated
Thanks
Lalitha


 

Never miss an email again!

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Query about using setdiff

2007-03-07 Thread lalitha viswanath
Hi
I have two dataframes
names(DF1) = c(id, val1, val2);

names(DF2) = c(id2);

Ids in DF2 are a complete subset of those in DF1

How can I extract entries from DF1 where id NOT IN
DF2.

I tried setdiff(DF1, DF2); setdiff(DF1$id, DF2$id),
etc.
Although the latter eliminates the ids as required, I
dont know how to extract val1 and val2 for the
resultant set.


Thanks
Lalitha 


 

8:00? 8:25? 8:40? Find a flick in no time

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Query about data manipulation

2007-03-01 Thread lalitha viswanath
Hi
Thanks much for the prompt response to my earlier
enquiry on packages for regression analyses.
Along the same topic(?), I have another question about
which I could use some input.

I am retreiving data from a MySQL database using
RODBC. 
The table has many BLOB columns and each BLOB column
has data in the format
id1 \t id2 \t measure \n id3 \t id4 \t measure
(i.e. multiple rows compressed as one long string)

I am retreiving them as follows.

dataFromDB - sqlQuery(channel, select
uncompress(columnName) from tableName);


I am looking for ways to convert this long string
into a table/dataframe in R, making it easier for
further post processing etc without reading/writing it
to a file first.

Although by doing write.table and reading it in again,
I got the result in a data frame, with the \t and \n
interpreted correctly, I wish to sidestep this as I
need to carry out this analyses for over 4 million
such entries.
I tried 
write.table(dataFromDB, file=FileName);
dataFromFile - read.table(FileName, sep=\t) 
dataFromFile is of the form

92_8_nmenA  993_7_mpul  1.042444
92_8_nmenA  3_5_cpneuA  0.900939
190_1_rpxx  34_4_ctraM  0.822532
190_1_rpxx  781_6_pmul  0.870016

Your input on the above is greatly appreciated.
Thanks
Lalitha



 

Never miss an email again!

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Query about data manipulation

2007-03-01 Thread lalitha viswanath
Hi
Thanks much for that input.It was extremely helpful.

I am seeking some input about another stumbling block
using RODBC; SQLQuery et al with large BLOB values.

Although the following query 
dataFromDB - sqlQuery(channel, select
uncompress(columnName) from tableName where Id=id );
returns just one row , dataFromDB[1,1] actually
contains 4000+ rows of the form 
field1 \t field2 \t value\n described earlier.
(4000+ rows compressed as one long string)

On printing dataFromDB[1,1], it does not print beyond
3600 such rows or so (printing in fact field1 \t
field2 \t value \n.field3600 \t field3601),
abruptly missing the rest of the result. 

Hence it throws an error when I try to use read.table
(after using textConnection as suggested) that row xyz
does not contain 3 values,etc.

It seems to be missing 1/4th of the actual result that
should contain 4000+ such pairs.

The set of 4000+ rows occupy just 100KB if written out
to a file directly from MySQL.
Is there anyway to increase the capacity of the return
result in R so that it does not get thrown off as
above and retrieves the ENTIRE result?

I tried increasing buffsize, but as I understand,
since SqlQuery itself returns just one row in this
case, it  is possibly not very relevant here?

Note that the above mentioned problem does not arise
when the data returned from SQL query contains less
than 3500 such concatenated entries.

Your input is greatly appreciated.
Thanks
Lalitha
--- Marc Schwartz [EMAIL PROTECTED] wrote:

 On Thu, 2007-03-01 at 08:34 -0800, lalitha viswanath
 wrote:
  Hi
  Thanks much for the prompt response to my earlier
  enquiry on packages for regression analyses.
  Along the same topic(?), I have another question
 about
  which I could use some input.
  
  I am retreiving data from a MySQL database using
  RODBC. 
  The table has many BLOB columns and each BLOB
 column
  has data in the format
  id1 \t id2 \t measure \n id3 \t id4 \t
 measure
  (i.e. multiple rows compressed as one long string)
  
  I am retreiving them as follows.
  
  dataFromDB - sqlQuery(channel, select
  uncompress(columnName) from tableName);
  
  
  I am looking for ways to convert this long
 string
  into a table/dataframe in R, making it easier for
  further post processing etc without
 reading/writing it
  to a file first.
  
  Although by doing write.table and reading it in
 again,
  I got the result in a data frame, with the \t and
 \n
  interpreted correctly, I wish to sidestep this as
 I
  need to carry out this analyses for over 4 million
  such entries.
  I tried 
  write.table(dataFromDB, file=FileName);
  dataFromFile - read.table(FileName, sep=\t) 
  dataFromFile is of the form
  
  92_8_nmenA  993_7_mpul  1.042444
  92_8_nmenA  3_5_cpneuA  0.900939
  190_1_rpxx  34_4_ctraM  0.822532
  190_1_rpxx  781_6_pmul  0.870016
  
  Your input on the above is greatly appreciated.
  Thanks
  Lalitha
 
 The easiest way might be to use a textConnection().
 
 Let's say that you have read in your data as above
 and you have a column
 called 'blob':
 
  dataFromDB
 blob
 1 id1 \t id2 \t measure \n id3 \t id4 \t measure
 
 
 #Open textConnection.  Note coercion to character
 BLOB -
 textConnection(as.character(dataFromDB$blob))
 
 # Read in the column
 DF - read.table(BLOB, sep = \t)
 
 # Close the connection
 close(BLOB)
 
 
  DF
  V1V2V3
 1  id1   id2   measure 
 2  id3   id4   measure
 
 
 See ?textConnection
 
 HTH,
 
 Marc Schwartz
 
 
 



 

Any questions? Get answers on any topic at www.Answers.yahoo.com.  Try it now.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Packages in R for least median squares regression and computing outliers (thompson tau technique etc.)

2007-02-28 Thread lalitha viswanath
Hi
I am looking for suitable packages in R that do
regression analyses using least median squares method
(or better). Additionally, I am also looking for
packages that implement algorithms/methods for
detecting outliers that can be discarded before doing
the regression analyses.

Although some websites refer to lms method under
package lps in R, I am unable to find such a package
on CRAN.

I would greatly appreciate any pointers to suitable
functions/packages for doing the above analyses.

Thanks
Lalitha


 

TV dinner still cooling? 
Check out Tonight's Picks on Yahoo! TV.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Query about merging two tables

2007-02-06 Thread lalitha viswanath
Hi
I have table1 which has the foll. columns
id age rate

and table2 which has the foll. columns
id count

I wish to get data from table1 for all the ids which
are persent in table2 and where the rate is not equal
to 999.
The ids in table2 are a subset of those in table1 and
every id in table2 has an entry in table1.

I would appreciate your input regarding the above.

Thanks in advance
Lalitha

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Query about merging two tables

2007-02-06 Thread lalitha viswanath
Hi
I have table1 which has the foll. columns
id age rate

and table2 which has the foll. columns
id count

I wish to get data from table1 for all the ids which
are persent in table2 and where the rate is not equal
to 999.
The ids in table2 are a subset of those in table1 and
every id in table2 has an entry in table1.

I would appreciate your input regarding the above.

Thanks in advance
Lalitha


 

No need to miss a message. Get email on-the-go

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] unique/subset problem

2007-01-26 Thread lalitha viswanath
Hi
The pruned dataset has 8 unique genomes in it while
the dataset before pruning has 65 unique genomes in
it.
However calling unique on the pruned dataset seems to
return 65 no matter what.

Any assistance in this matter would be appreciated.

Thanks
Lalitha
--- Weiwei Shi [EMAIL PROTECTED] wrote:

 Hi,
 
 Even you removed many genomes1 by setting score
 -5; it is not
 necessary saying you changed the uniqueness.
 
 To check this, you can do like
 p0 - unique(dataset[dataset$score -5, genome1])
 # same as subset
 p1 - unique(dataset[dataset$score= -5, genome1])
 
 setdiff(p1, p0)
 
 if the output above has NULL, then it means even
 though you remove
 many genomes1, but it does not help changing the
 uniqueness.
 
 HTH,
 
 weiwei
 
 
 
 On 1/25/07, lalitha viswanath
 [EMAIL PROTECTED] wrote:
  Hi
  I am new to R programming and am using subset to
  extract part of a data as follows
 
  names(dataset) =
  c(genome1,genome2,dist,score);
  prunedrelatives - subset(dataset, score  -5);
 
  However when I use unique to find the number of
 unique
  genomes now present in prunedrelatives I get
 results
  identical to calling unique(dataset$genome1)
 although
  subset has eliminated many genomes and records.
 
  I would greatly appreciate your input about using
  unique correctly  in this regard.
 
  Thanks
  Lalitha
 
 
 
 


  TV dinner still cooling?
  Check out Tonight's Picks on Yahoo! TV.
 
  __
  R-help@stat.math.ethz.ch mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained,
 reproducible code.
 
 
 
 -- 
 Weiwei Shi, Ph.D
 Research Scientist
 GeneGO, Inc.
 
 Did you always know?
 No, I did not. But I believed...
 ---Matrix III
 



 

Bored stiff? Loosen up...

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] unique/subset problem

2007-01-26 Thread lalitha viswanath
Hi
I read in my dataset using
dt read.table(filename)
calling unique(levels(dt$genome1))  yields the
following 

 aero  aful  aquae atum_D   
bbur  bhal  bmel  bsub 
 [9] buch  cace  ccre  cglu 
cjej  cper  cpneuAcpneuC   
[17] cpneuJctraM ecoliO157 hbsp 
hinf  hpyl  linn  llact
[25] lmon  mgen  mjan  mlep 
mlot  mpneu mpul  mthe 
[33] mtub  mtub_cdc  nost  pabyssi  
paer  paero pmul  pyro 
[41] rcon  rpxx  saur_mu50 saur_n315
sent  smel  spneu spyo 
[49] ssol  stok  styp  synecho  
tacid tmar  tpal  tvol 
[57] uure  vcho  xfas  ypes 

It shows 60 genomes, which is correct.

I extracted a subset as follows
possible_relatives_subset - subset(dt, Y  -5)
I am pasting the results below
 genome1   genome2 parameterX  Y
21   sent ecoliO157  0.00590 -200.633493
22   sent  paer  0.18603 -100.200570
27   styp ecoliO157  0.00484 -240.708645
28   styp  paer  0.18497 -30.250127
41   paer  sent  0.18603 -60.200570
44   paer  styp  0.18497 -80.250127
49   paer  hinf  0.18913 -90.056333
53   paer  vcho  0.18703 -10.153929
55   paer  pmul  0.18587 -100.208042
67   paer  buch  0.21485  -80.898667
70   paer  ypes  0.18460 -107.267454
82   paer  xfas  0.26268  -61.920552
95   hinf ecoliO157  0.07654 -163.018417
96   hinf  paer  0.18913 -10.056333
103  vcho ecoliO157  0.09518 -140.921153
104  vcho  paer  0.18703 -10.153929
107  pmul ecoliO157  0.07328 -165.215225
108  pmul  paer  0.18587 -10.208042
131  buch ecoliO157  0.15412 -11.746939
132  buch  paer  0.21485  -8.898667
137  ypes ecoliO157  0.02705 -19.171851
138  ypes  paer  0.18460 -10.267454
171 ecoliO157  sent  0.00590 -20.633493
174 ecoliO157  styp  0.00484 -20.708645
179 ecoliO157  hinf  0.07654 -6.018417
183 ecoliO157  vcho  0.09518 -14.921153
185 ecoliO157  pmul  0.07328 -6.215225
197 ecoliO157  buch  0.15412 -11.746939
200 ecoliO157  ypes  0.02705 -9.171851
211 ecoliO157  xfas  0.25833  -71.091552
217  xfas ecoliO157  0.25833  -75.091552
218  xfas  paer  0.26268  -64.920552

I think  even a cursory look will tell us that there
are not as many unique genomes in the subset results.
(around 8/10).
However when I do
unique(levels(possible_relatives_subset$genome1)), I
get

[1] aero  aful  aquae atum_D   
bbur  bhal  bmel  bsub 
 [9] buch  cace  ccre  cglu 
cjej  cper  cpneuAcpneuC   
[17] cpneuJctraM ecoliO157 hbsp 
hinf  hpyl  linn  llact
[25] lmon  mgen  mjan  mlep 
mlot  mpneu mpul  mthe 
[33] mtub  mtub_cdc  nost  pabyssi  
paer  paero pmul  pyro 
[41] rcon  rpxx  saur_mu50 saur_n315
sent  smel  spneu spyo 
[49] ssol  stok  styp  synecho  
tacid tmar  tpal  tvol 
[57] uure  vcho  xfas  ypes 

Where am I going wrong?
I tried calling unique without the levels too, which
gives me the following response

[1] sent  styp  paer  hinf  vcho 
pmul  buch  ypes  ecoliO157 xfas 
60 Levels: aero aful aquae atum_D bbur bhal bmel bsub
buch cace ccre cglu cjej cper cpneuA ... ypes

--- Weiwei Shi [EMAIL PROTECTED] wrote:

 Then you need to provide more details about the
 calls you made and your dataset.
 For example, you can tell us by
 str(prunedrelatives, 1)
 
 how did you call unique on prunedrelative and so on?
 I made a test
 data it gave me what you wanted (omitted here).
 
 On 1/26/07, lalitha viswanath
 [EMAIL PROTECTED] wrote:
  Hi
  The pruned dataset has 8 unique genomes in it
 while
  the dataset before pruning has 65 unique genomes
 in
  it.
  However calling unique on the pruned dataset seems
 to
  return 65 no matter what.
 
  Any assistance in this matter would be
 appreciated.
 
  Thanks
  Lalitha
  --- Weiwei Shi [EMAIL PROTECTED] wrote:
 
   Hi,
  
   Even you removed many genomes1 by setting
 score
   -5; it is not
   necessary saying you changed the uniqueness.
  
   To check this, you can do like
   p0 - unique(dataset[dataset$score -5,
 genome1])
   # same as subset
   p1 - unique(dataset[dataset$score= -5,
 genome1])
  
   setdiff(p1, p0)
  
   if the output above has NULL, then it means even
   though you remove
   many genomes1, but it does not help changing the
   uniqueness.
  
   HTH,
  
   weiwei
  
  
  
   On 1/25/07, lalitha viswanath
   [EMAIL PROTECTED] wrote:
Hi
I am new to R programming and am using subset
 to
extract part of a data as follows
   
names(dataset) =
c(genome1,genome2,dist,score);
prunedrelatives - subset(dataset, score 
 -5);
   
However when I use unique

[R] Package for phylogenetic tree analyses

2007-01-26 Thread lalitha viswanath
Hi
I am looking for a package that
1. reads in a phylogenetic tree in NEXUS format
2. given two members/nodes on the tree, can return the
distance between the two using the tree.

I came across the following packages on CRAN
ouch, ape, apTreeShape, phylgr all of which seem to
provide extensive range of functions for reading in a
Nexus-format tree and performing phylogenetic
analyses, tree comparisons etc, but none to the best
of my undestanding seem to provide a function obtain
distances (in terms of branch lengths) between two
nodes on a single tree.
I am working with just one tree and need a function to
return distances between various pairs of nodes on the
tree.

Is there any other package out there that has this
functionality?

Thanks for your responses to my earlier queries. As a
beginning R programmer, your responses have been of
utmost help and guidance.

Lalitha


 


Access over 1 million songs.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] unique/subset problem

2007-01-25 Thread lalitha viswanath
Hi
I am new to R programming and am using subset to
extract part of a data as follows

names(dataset) =
c(genome1,genome2,dist,score);
prunedrelatives - subset(dataset, score  -5);

However when I use unique to find the number of unique
genomes now present in prunedrelatives I get results
identical to calling unique(dataset$genome1) although
subset has eliminated many genomes and records.

I would greatly appreciate your input about using
unique correctly  in this regard.

Thanks
Lalitha


 

TV dinner still cooling? 
Check out Tonight's Picks on Yahoo! TV.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Query about extracting subset of datafram

2007-01-24 Thread lalitha viswanath
Hi
I have a table read from a mysql database which is of
the  kind

clusterid clockrate

I obtained this table in R as
clockrates_table -sqlQuery(channel,select);
I have a function within which I wish to extract the
clusterid for a given cluster.
Although I know that there is just one row per
clusterid in the data frame, I am using subset to
extract the clockrate.

clockrate = subset(clockrates_table, clusterid==15,
select=c(clockrate));

Is there any way of extracting the clockrate without
using subset.

In the help section for subset, it mentioned to see
also: [,...
However I could find no mention for this entry when I
searched as ?[, etc.

The R manuals also, despite discussing complex
libraries, techniques etc, dont always seem to provide
such handy hints/tips and tricks for manipulating
data, which is a first stumbling block for newbies
like me.
I would greatly appreciate if you could point me to
such resources as well, for future reference.

Thanks
Lalitha 



 

8:00? 8:25? 8:40? Find a flick in no time

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] User defined function calls

2007-01-24 Thread lalitha viswanath
Hi
I have a script processfiles.R that contains, amongst
other functions
1) a database access function called get_clockrates
which retreives from a database, a table containing
columns (clusterid, clockrate) and 45000 rows(one for
each clusterid).
Clusterid is an integer and clockrate is a float.

2) process_clusterid which takes clusterid as an
argument and after doing some data processing,
retrieves the clockrate corresponding to the
clusterid.

I wish to call get_clockrates only once and keep the
dataframe returned by it as a GLOBAL which the
function process_clusterid can use for each clusterid
that it processes.

To ensure that clockrates is global, I retreive it as
clockrate - sqlQuery..
Trust that this is correct.

Without the inclusion of get_clockrates function, I
have run this script under R as follows
 source(process_files.R);
 for (index in c(1:45000)) { try(process_file,
silent=TRUE); }

How do I get this code to execute get_clockrates only
once and subsequently call process_file for each of
the 45000 files in turn.

I would greatly appreciate your input regarding my
query.

Thanks
Lalitha


 

Finding fabulous fares is fun.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Query about extracting subsets from a table

2007-01-23 Thread lalitha viswanath
Hi
I am trying to process tabular data as follows:

Data in the input file is of the form

genome1 genome2 tree-dist log10escore

Genome1 and genome2 are alphabetic.
Tree-dist and log10escore are numeric.

I wish to extract only those  rows from this table
where the log10escore is less than -3.


data -read.table(filename);
data$log10escore = data$log10escore[ data$log10escore
 -3];

I would like to use this pruned list of escores to get
the corresponding genomenames and treedist.

I did not find anything useful in the FAQs and Notes
on R for this part of the data extraction.

As I am just beginning programming in R, I would
appreciate your input about this.

Thanks
L


 

Food fight? Enjoy some healthy debate

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Query about using optimizers in R without causing program to crash

2007-01-22 Thread lalitha viswanath
Hi
I am a newbie to R and am using  the lm function to
fit my data.
This optimization is to be performed for around 45000
files not all of which lend themselves to
optimization. Some of these will and do crash.
 
However, How do I ensure that the program simply goes
to the next file in line without exiting the code with
the error
Error in lm.fit(x, y, offset = offset, singular.ok =
singular.ok, ...) : 
NA/NaN/Inf in foreign function call (arg 4)
everytime it encounters troublesome data?

I would greatly appreciate your input as it would
avoid me having to manually type
for fileId in (c(4351:46000)) { ... }
for fileId in (c(5761:46000)) { ... }, etc...

Thanks
Lalitha


 

Now that's room service!  Choose from over 150,000 hotels

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Query about using try block

2007-01-22 Thread lalitha viswanath
Hi
Thanks for your response.
However I seem to be doing something wrong regarding
the try block resulting in yet another error described
below.

I have a function that takes in a file name and
does the fit for the data in that file.
Hence based on your input, I tried

try ( (fit = lm(y~x, data = data_fitting)), silent =
T);


I left the subsequent lines of my code unchanged.
coeffs = as.list(coef(fit);
lambda = exp(coeffs$x)

After the change using try, when I tried to resume
processing under R as follows
source(fitting.R)
for filename in list { process(filename); }
It says Cannot find object fit ...(in the line
trying to get the coefficients...)

Am I closing the try block in the wrong place?
This function does some post processing on the
coefficients returned by coef(fit), puts them in a
list and sends it to another function.
(i.e. around 6 lines of code after the call to fit).
Thanks
Lalitha
--- Andreas Hary [EMAIL PROTECTED] wrote:

 Look at ?try
 
 Your code will probably need to be something like
 the following:
 
 fit - list()
 for(fileId in 1:n){
try(fit[i] - lm(formula,data=???,...), silent=F)
#or silent=T if you would like to be made aware
 of problems
 }
 
 Best wishes,
 
 Andreas
 
 
 
 
 lalitha viswanath wrote:
  Hi
  I am a newbie to R and am using  the lm function
 to
  fit my data.
  This optimization is to be performed for around
 45000
  files not all of which lend themselves to
  optimization. Some of these will and do crash.
   
  However, How do I ensure that the program simply
 goes
  to the next file in line without exiting the code
 with
  the error
  Error in lm.fit(x, y, offset = offset,
 singular.ok =
  singular.ok, ...) : 
  NA/NaN/Inf in foreign function call (arg
 4)
  everytime it encounters troublesome data?
  
  I would greatly appreciate your input as it would
  avoid me having to manually type
  for fileId in (c(4351:46000)) { ... }
  for fileId in (c(5761:46000)) { ... }, etc...
  
  Thanks
  Lalitha
  
  
   
 


  Now that's room service!  Choose from over 150,000
 hotels
  
  __
  R-help@stat.math.ethz.ch mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained,
 reproducible code.
  
 
 -- 
 =
 Andreas Hary
 Flat 5, 70 Finsbury Park Road
 Lond, N4 2JX, UK
 
 Email:[EMAIL PROTECTED]
 Mobile: 07906 860 987
 



 

Want to start your own business?

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Query about using table

2006-10-26 Thread lalitha viswanath
Hi
I have data of the following form
ID  age  member_FLAG
125  Y
236.75   N 
375.5N
.
.

I want to get a histogram of this data showing 
distribution of member_flag in each age-bin i.e. how
many values in each age bin have a member_flag of 'Y'
and how many have 'N'.
I was able to do the same using barplot2.

However I also need similar information in a tabular
form using percentages.
i.e in each age bin, what is the PERCENTAGE of IDs
with a member_flag of 'Y' 

I am trying to work with table for the same, but would
appreciate some guidance regarding the above.

Thanks
Lalitha

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Query about using table

2006-10-26 Thread lalitha viswanath
Hi
I have data of the following form
ID  age  member_FLAG
125  Y
236.75   N 
375.5N
.
.

I want to get a histogram of this data showing 
distribution of member_flag in each age-bin i.e. how
many values in each age bin have a member_flag of 'Y'
and how many have 'N'.
I was able to do the same using barplot2.

However I also need similar information in a tabular
form using percentages.
i.e in each age bin, what is the PERCENTAGE of IDs
with a member_flag of 'Y' 

I am trying to work with table for the same, but would
appreciate some guidance regarding the above.

Thanks
Lalitha

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Query about getting averages across a certain parameter in a table

2006-07-11 Thread lalitha viswanath
Hi
I have a table that goes 
data

cluster_ac  clockrate age class
7337 0.9   0.001  alpha_proteins
7888 0.1   0.78   beta proteins

etc

The class column can have 7-8 different unique values
While the clockrate and age columns are floats varying
from 0 to 1.

I wish to get the average clockrate across each of the
classes for this data.

I would appreciate your help regarding the aboe.

Thanks
Lalitha

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] Query about getting a table of binned values

2006-06-20 Thread lalitha viswanath
Hi
I am working with a dataset of age and class of
proteins 
#Age
0
0.0
0.677

#Class
Type A
Type B
.
.
.
Type K

I wish to get a table that reads as follows
 0-0.02   0.02-0.04 0.04-0.06 . 0.78-0.8
Type A15   20   5 8
Type B 86 
.
.
.
Type K 10   7

I would appreciate your input regarding the
appropriate functions to use for this purpose

regards
Lalitha

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] Query about the functions used in tapply

2006-05-04 Thread lalitha viswanath
Hi
I am trying to plot an x-y plot of the values a
certain variable against bins.
i.e. the x-axiz goes from 0 to 0.7 in increments of
0.02 while the y-axis is the average of values for all
the points in that interval.

Hence I first used cut to break the data into
intervals, then I applied tapply using mean as the
function and plotted the results.

I also replaced mean with median.

the 3 sets of functions that I used were

However I am finding that the actual value plotted in
the y-axis somehow does not seem to be correct?

i.e. for example in the interval 0.38-0.4 there are a
humungous number of points with y-axis value below 20
while there are very few with y-axis value above 20.
However the median plotted is still around the 20
mark.
It does not seem intuitive looking at the data that
more than 50% of the points have a clock_rate (plotted
on the y-axis) above 20.

Is there something about the way these functions work
with tapply, that I am missing?
Any obvious mistakes that I should look for?

SWfac -cut(sorted_inp$age[1:290], seq(0, 0.7,0.02))
 SLmean - tapply(sorted_inp$clock_rate[1:290], SWfac,
mean)
 plot(SLmean, type =b, xaxt = n)
 axis(1, seq(SLmean), levels(SWfac))

I tried a simple x-y scatter plot of the same 290 rows
in excel (without binning them) and the concentration
of points at lower values of clock rates does not seem
to indicate that the medians should be as high as they
are shown.

Hoping to hear further
Regards
Lalitha

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] table of means/medians across bins used for a histogram

2006-05-01 Thread lalitha viswanath
Hi
I think I seem to have phrased my doubt incorrectly.
I want a x-y plot of age v/s rate (the bin is
irrelevant for this plot); only that instead of a
simple x-y plot, i want a plot of average(rate) for
each age-intervals.

My ages vary from 0 to 0.7 and I want to divide them
in groups of 0.02.

So I want a plot of the following
Age-intervalsAverage rate in that interval
0-0.025
0.02-0.04 7 
0.04-0.06 1
0.06-0.08 0 
0.08-0.1  0.15

Age-intervals mentioned along the x-axis (like for a
histogram) and rates plotted for each age-interval
   
--- Gabor Grothendieck [EMAIL PROTECTED]
wrote:

 Or perhaps a bit simpler:
 
 plot(age ~ ave(clock, bin), DF)
 
 
 On 4/30/06, Gabor Grothendieck
 [EMAIL PROTECTED] wrote:
  My understanding is that you want to replace each
 rate with its average
  over the associated bin and then plot age against
 that.  In that
  case try this:
 
   DF  # test data
 age rate bin
  1 0.002 10.0   A
  2 0.045  0.1   B
  3 0.130 15.0   A
  4 0.150 34.0   D
   with(DF, plot(ave(rate, bin), age))
 
  Assuming they
  are stored in vectors
  the columns are age, rate, bin we would have
 
  plot(ave(clock, bin), age)
 
  On 4/30/06, lalitha viswanath
 [EMAIL PROTECTED] wrote:
   Hi
   I am trying to get a table of means of parameter
 1
   across BINS of parameter 2.
  
   I am working in proteomics and a sample of my
 data is
   as follows
  
   cluster-age clock-rate(evolutionary rate)
 scopclass
   0.002   10  A
   0.045   0.1 B
   0.1315  A
   0.1534  D
   
   
   
   
  
   Scop class has only 9 distinct categories (A-I)
   Whereas cluster-age and clock-rate are discrete
   variables greater than 0.
  
   I am trying to do two things with this kind of
 data,
   out of which I managed to accomplish one thanks
 to the
   documentation and pre-existing queries on the
 mailing
   lists.
   1. Plot a histogram of the age distribution with
 scop
   class category superimposed on each bin. I
 managed to
   do this with barplot2.
   2. Now I am trying to plot a scatter plot of the
 age
   v/s the clock-rate. However to eliminate
 possible
   sampling errors, we are trying to get an average
 of
   the clock-rate for each of the bins used above.
   i.e. before plotting a x-y plot, i wish to
 compute
   average clock-rate in each of the bins for the
 age and
   then plot a x-y plot of the age v/s clock rate.
  
   Can anyone point me to appropriate functions for
 the
   same?
   I am trying to work with prop.table, cut, break,
 etc.
   But I am not heading anywhere.
  
   Thanks
   Lalitha
  
   __
   R-help@stat.math.ethz.ch mailing list
   https://stat.ethz.ch/mailman/listinfo/r-help
   PLEASE do read the posting guide!
 http://www.R-project.org/posting-guide.html
  
 


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] table of means/medians across bins used for a histogram

2006-04-30 Thread lalitha viswanath
Hi
I am trying to get a table of means of parameter 1
across BINS of parameter 2.

I am working in proteomics and a sample of my data is
as follows

cluster-age clock-rate(evolutionary rate) scopclass
0.002   10  A
0.045   0.1 B
0.1315  A 
0.1534  D


 


Scop class has only 9 distinct categories (A-I)
Whereas cluster-age and clock-rate are discrete
variables greater than 0.

I am trying to do two things with this kind of data,
out of which I managed to accomplish one thanks to the
documentation and pre-existing queries on the mailing
lists.
1. Plot a histogram of the age distribution with scop
class category superimposed on each bin. I managed to
do this with barplot2. 
2. Now I am trying to plot a scatter plot of the age
v/s the clock-rate. However to eliminate possible
sampling errors, we are trying to get an average of
the clock-rate for each of the bins used above. 
i.e. before plotting a x-y plot, i wish to compute
average clock-rate in each of the bins for the age and
then plot a x-y plot of the age v/s clock rate.

Can anyone point me to appropriate functions for the
same?
I am trying to work with prop.table, cut, break, etc.
But I am not heading anywhere.

Thanks
Lalitha

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html