[R] Minor documentation issue
I looked at ?seq -- -- Vivek Satsangi Rochester, NY USA __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Minor documentation issue
(Sorry about the last email which was incomplete. I hit 'send' accidentally). I looked at ?seq. One of the forms given under Usage is seq(from). This would be the form used if seq is called with only one argument. However, this should actually say seq(to). For example, seq(1) [1] 1 seq(3) [1] 1 2 3 Cheers, -- -- Vivek Satsangi Rochester, NY USA __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] (newbie) Weighted qqplot?
Folks, Normally, in a data frame, one observation counts as one observation of the distribution. Thus one can easily produce a CDF and (in Splus atleast) use cdf.compare to compare the CDF (BTW: what is the R equivalent of the SPlus cdf.compare() function, if any?) However, if each point should not count equally, how can I weight the points before comparing the distributions? I was thinking of somehow creating multiple observations for each actual observation based on weights and creating a new dataframe etc. -- but that seem excessive. Surely there is a simpler way? x - rnorm(100) y - rnorm(10) xw - rnorm(100) * 1.73 # The weights. These won't add up to 1 or N or anything because of missing values. yw - rnorm(10) * 6.23 # The weights. These won't add up to 1 or to the same number as xw. # The question to answer is, how can I create a qq plot or cdf compare of x vs. y, weighted by their weights, xw and yw (to eventually figure out if y comes from the population x, similar to Kolmogorov-Smirnov GOF)? qqplot(x,y) # What now? Thanks for any help, -- -- Vivek Satsangi Student, Rochester, NY USA Life is short, the art long, opportunity fleeting, experiment treacherous, judgement difficult. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] (newbie) Weighted qqplot?
Folks, I am documenting what I finally did, for the next person who comes along... Following Dr. Murdoch's suggestion, I looked at qqplot. The following approach might be helpful to get to the same information as given by qqplot. To summarize the ask: given x, y, xw and yw, show (visually is okay) whether a and b are from the same distribution. xw is the weight of each x observation and yw is the weight of each y observation. Put x and xw into a dataframe. Sort by x. Calculate cumulative x weights, normalized to total 1. Put y and yw into a dataframe. Sort by y Calculate cumulative weights, normalized to total 1. Plot x and y against cumulative normalized weights. The shapes of the two lines should be similar (to the eye)-- or the distribution is different. Vivek On 3/15/06, Duncan Murdoch [EMAIL PROTECTED] wrote: On 3/15/2006 8:31 AM, Vivek Satsangi wrote: Folks, Normally, in a data frame, one observation counts as one observation of the distribution. Thus one can easily produce a CDF and (in Splus atleast) use cdf.compare to compare the CDF (BTW: what is the R equivalent of the SPlus cdf.compare() function, if any?) However, if each point should not count equally, how can I weight the points before comparing the distributions? I was thinking of somehow creating multiple observations for each actual observation based on weights and creating a new dataframe etc. -- but that seem excessive. Surely there is a simpler way? x - rnorm(100) y - rnorm(10) xw - rnorm(100) * 1.73 # The weights. These won't add up to 1 or N or anything because of missing values. yw - rnorm(10) * 6.23 # The weights. These won't add up to 1 or to the same number as xw. # The question to answer is, how can I create a qq plot or cdf compare of x vs. y, weighted by their weights, xw and yw (to eventually figure out if y comes from the population x, similar to Kolmogorov-Smirnov GOF)? qqplot(x,y) # What now? qqplot doesn't support weights, but it's a simple enough function that you could write a version that did. Look at the cases where length(x) is not equal to length(y): e.g. if length(y) length(x), qqplot constructs a linear approximation to a function mapping 1:nx onto the sorted x values, then takes length(y) evenly spaced values from that function. You want to do the same sort of thing, except that instead of even spacing, you want to look at the cumulative sums of the weights. You might want to use some kind of graphical indicator of whether points are heavily weighted or not, but I don't know what to recommend for that. By the way, your example above will give negative weights in xw and yw; you probably won't like the results if you do that. Duncan Murdoch -- -- Vivek Satsangi Student, Rochester, NY USA Life is short, the art long, opportunity fleeting, experiment treacherous, judgement difficult. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] (newbie) Accessing the pieces of a 'by' object
Folks, I know that I can do the following using a loop. That's been a lot easier for me to write and understand. But I am trying to force myself to use more vectorized / matrixed code so that eventually I will become a better R programmer. I have a dataframe that has some values by Year, Quarter and Ranking. The variable of interest is the return (F3MRet), to be weighted averaged within the year, quarter and ranking. At the end, we want to end up with a table like this: year quarter ranking1 ranking2 ... ranking10 1987 1 1.33 1.45 ... 1.99 1987 2 6.45 3.22 ... 8.33 . . 2005 4 2.22 3.33 ... 1.22 The dataset is too large to post and I can't come up with a small working example very easily. I tried the Reshape() package and also the aggregate and reshape functions. Those don't work too well becuase of the need to pass weighted.mean a weights vector. I tried the by() function, but now I don't know how to coerce the returned object into a matrix so that I can reshape it. fvs_weighted.mean - function(y) weighted.mean(y$F3MRet, y$IndexWeight, na.rm=T); tmp_byRet - by(dfReturns, list(dfReturns$Quarter,dfReturns$Year,dfReturns$Ranking), fvs_weighted.mean); And various other ways to get the tmp_byRet object into a matrix were tried, eg. unlist(), a loop like this: dfRet - data.frame(tmp_byRet); for(i in 1:dim(dfRet)[2]){ dfRet[ ,i] - as.vector(dfRet[ ,i]); } In each case, I got some error or the other. So, please help me get unstuck. How can I get the tmp_byRet() object into a matrix or a dataframe? -- -- Vivek Satsangi Rochester, NY USA No amount of sophistication is going to allay the fact that all your knowledge is about the past and all your decisions are about the future. -- Ian Wilson __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] (newbie) Accessing the pieces of a 'by' object
I am writing to document the answer for the next poor sod who comes along. To get tmp_byRet() into a multi-dimentional matrix, copy the object using as.vector(), then copy the dim and dimnames from tmp_byRet into the new object. However, this may not be what you want, since you probably want the values of the factors within the object (i.e. it should be a dataframe, not a matrix). To get tmp_byRet into a dataframe, use unique() to create a dataframe with just the unique values of your factors. Add a new column to the dataframe, where you will store the summary stats. Use a loop to populate this vector. Then use reshape() on the dataframe to get it to the shape you want it in. It is difficult at best to vectorize this and avoid the loop -- and trying to do so will lead to probably less transparent code. Vivek On 3/7/06, Vivek Satsangi [EMAIL PROTECTED] wrote: Folks, I know that I can do the following using a loop. That's been a lot easier for me to write and understand. But I am trying to force myself to use more vectorized / matrixed code so that eventually I will become a better R programmer. I have a dataframe that has some values by Year, Quarter and Ranking. The variable of interest is the return (F3MRet), to be weighted averaged within the year, quarter and ranking. At the end, we want to end up with a table like this: year quarter ranking1 ranking2 ... ranking10 1987 1 1.33 1.45 ... 1.99 1987 2 6.45 3.22 ... 8.33 . . 2005 4 2.22 3.33 ... 1.22 The dataset is too large to post and I can't come up with a small working example very easily. I tried the Reshape() package and also the aggregate and reshape functions. Those don't work too well becuase of the need to pass weighted.mean a weights vector. I tried the by() function, but now I don't know how to coerce the returned object into a matrix so that I can reshape it. fvs_weighted.mean - function(y) weighted.mean(y$F3MRet, y$IndexWeight, na.rm=T); tmp_byRet - by(dfReturns, list(dfReturns$Quarter,dfReturns$Year,dfReturns$Ranking), fvs_weighted.mean); And various other ways to get the tmp_byRet object into a matrix were tried, eg. unlist(), a loop like this: dfRet - data.frame(tmp_byRet); for(i in 1:dim(dfRet)[2]){ dfRet[ ,i] - as.vector(dfRet[ ,i]); } In each case, I got some error or the other. So, please help me get unstuck. How can I get the tmp_byRet() object into a matrix or a dataframe? -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Minor documentation improvement
Gentlemen, In the documentation for reshape, in the function signature, the argument direction is not listed. However, it is explained in the explanation of parameters below. I am using R 2.2.1. Out of curiosity: Is the R core team still an all-male affair? I don't think I have seen a single lady's name. -- -- Vivek Satsangi Student, Rochester, NY USA __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] (Newbie) Aggregate for NA values
Folks, Sorry if this question has been answered before or is obvious (or worse, statistically bad). I don't understand what was said in one of the search results that seems somewhat related. I use aggregate to get a quick summary of the data. Part of what I am looking for in the summary is, how much influence might the NA's have had, if they were included, and is excluding them from the means causing some sort of bias. So I want the summary stat for the NA's also. Here is a simple example session (edited to remove the typos I made, comments added later): tmp_a - 1:10 tmp_b - rep(1:5,2) tmp_c - rep(1:2,5) tmp_d - c(1,1,1,2,2,2,3,3,3,4) tmp_df - data.frame(tmp_a,tmp_b,tmp_c,tmp_d); tmp_df$tmp_c[9:10] - NA ; tmp_df tmp_a tmp_b tmp_c tmp_d 1 1 1 1 1 2 2 2 2 1 3 3 3 1 1 4 4 4 2 2 5 5 5 1 2 6 6 1 2 2 7 7 2 1 3 8 8 3 2 3 9 9 4NA 3 1010 5NA 4 aggregate(tmp_df$tmp_d,by=list(tmp_df$tmp_b,tmp_df$tmp_c),mean); Group.1 Group.2 x 1 1 1 1 2 2 1 3 3 3 1 1 4 5 1 2 5 1 2 2 6 2 2 1 7 3 2 3 8 4 2 2 # Only one row for each (tmp_b, tmp_c) combination, NA's getting dropped. aggregate(tmp_df$tmp_d,by=list(tmp_df$tmp_c),mean); Group.1x 1 1 1.75 2 2 2.00 What I want in this last aggregate is, a mean for the values in tmp_d that correspond to the tmp_c values of NA. Similarly, perhaps there is a way to make the second last call to aggregate return the values of tmp_d for the NA values of tmp_c also. How can I achieve this? -- -- Vivek Satsangi Student, Rochester, NY USA __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Minor documentation improvement
Please ignore this message. I was not reading carefully enough, the parameter is in there. Vivek On 2/24/06, Vivek Satsangi [EMAIL PROTECTED] wrote: Gentlemen, In the documentation for reshape, in the function signature, the argument direction is not listed. However, it is explained in the explanation of parameters below. I am using R 2.2.1. Out of curiosity: Is the R core team still an all-male affair? I don't think I have seen a single lady's name. -- -- Vivek Satsangi Student, Rochester, NY USA -- -- Vivek Satsangi Student, Rochester, NY USA __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] R-help, specifying the places to decimal
In addition to round() mentioned earlier, if you are merely looking to *display* your results differently, you may want to check out the digits option, e.g. in summary(): (This is the method signature for data.frame 's): summary(object, maxsum = 7, digits = max(3, getOption(digits)-3), ...) (Begin quoted message) Date: Mon, 13 Feb 2006 14:03:55 +0530 From: Subhabrata [EMAIL PROTECTED] Subject: [R] R-help, specifying the places to decimal To: r-help r-help@stat.math.ethz.ch Message-ID: [EMAIL PROTECTED] Content-Type: text/plain; charset=iso-8859-1 Hello - R-experts, Is there any way with which we can specify the number after decimal point to take. Like I have a situation where the values are comming 0.160325923 but I only want 4 place to decimal say 0.1603. Is there any way for that. I am no expert in R- and this may sound simple to many.sorry Thanks for any help. With Regards Subhabrata -- -- Vivek Satsangi Student, Rochester, NY USA __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Bloomberg Data Import to R
Hi Sumanta, 1. This messages is much more appropriate for the sig-finance DL instead. Consider signing up (I read up on Amba, so I am sure you have good contributions to make in that forum). 2. To my knowledge, there isn't a direct package. However, if you use Bloomberg's excel plugin, just get the data into excel, save and then bring it in as usual. I suspect that that's what you are doing already. 3. You may have better luck with the S-Plus plugins. I am just getting started (an don't have any support/maintenace contract), so I don't know what all Insightful has up its sleeve, but I talked to Carol Wedekind about this just thing yesterday. Dr. Yollin, who also listens in on the sig-finance list, may be able to advise you better about what exists. With warm regards, Vivek Message: 70 Date: Wed, 8 Feb 2006 15:51:13 +0530 From: Sumanta Basak [EMAIL PROTECTED] Subject: [R] Bloomberg Data Import to R To: r-help@stat.math.ethz.ch Message-ID: [EMAIL PROTECTED] Content-Type: text/plain Hi R-Experts, Can anyone tell me how Bloomberg data can be directly downloaded to R? Is there any package? Sumanta Basak. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] (Newbie) Merging two data frames
This one is an easy question. I am looking for the idiomatic way to do it. I have two large data frames. I want to merge them. What is the idiomatic way to say match the rows from dataframe 1 to the rows in dataframe2 which have the following fields the same: Identifier, Year and Quarter? (These three fields form something like a composite primary key in SQL). Then tell me which rows you could not find a match for etc. -- Vivek Satsangi Student, Rochester, NY USA __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Possible improvement in lm
Folks, I do a series of regressions (one for each quarter in the dataset) and then go and extract the residuals from each stored lm object that is returned as follows: vResiduals - as.vector(unlist(resid(lQuarterlyRegressions[[i]]))); Here lQuarterlyRegressions is a vector of objects returned by lm(). Next, I may go find outliers using identify() on a plot or do some other analysis which tells me which row of the quarterly data I need to take a closer look at. However, if I try to match some point in one of the quarters that I have with its residual, then I have to drop the points from my current Data which have NA's for either the explanatory variables or the explained, so that the vector or residuals and the data have the same indexes. This lead to some serious confusion/bugs for me, and I am wondering if it might not be better for lm to put an NA into those rows where the point was dropped because of NA's in the explanatory or explained variables (currently it just returns nothing at that index). Ofcourse, there might be some arguments against this idea, and I would be interested to hear them. Thank you for your time and attention, -- Vivek Satsangi Student, Rochester, NY USA __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] / Operator not meaningful for factors
Folks, I have a very basic question. The solution eludes me perhaps because of my own lack of creativity. I am not attaching a fully reproducible session because the issue may well be becuase of the way the data file is, and the data file is large (and I don't know whether I can legally distribute it). If people can suggest things that might be wrong in my data or the way that I am reading it, I would be most grateful. I get the following error message in the session quoted at the end of this email: / not meaningful for factors in: Ops.factor(BookValuePS, Price) As you can see in that some session, I check that the two vectors being divided are numeric. I also check that the divisor is not 0 at any index. I also believe that this is not because of the NA's in the data. My question is, what are other problems that can cause the / operator to not be meaningful? I did try some simple examples to try to get the same error. However, I am not sure how to put the same NA's that one gets from read.table() into a vector: a - c(1, 2, 3, NA); a [1] 1 2 3 NA b - c( 1, 2, 3, 4); c - b / a; b [1] 1 2 3 4 a - c(1, 2, 3, ); c - b/a; Warning message: longer object length is not a multiple of shorter object length in: b/a Quoted Session below explainPriceSimplified - read.table(combinedClean.csv, +sep = ,, header=TRUE); attach(explainPriceSimplified); summary(explainPriceSimplified); Symbol Date PriceEPS BookValuePS XL : 98 Min. :19870630 22 : 61 Min. :-1.401e+05 Min. :-6.901e+05 ZION : 97 1st Qu.:19910930 26.5 : 61 1st Qu.: 4.650e-01 1st Qu.: 3.892e+00 YRCW : 72 Median :19960331 27.5 : 58 Median : 1.060e+00 Median : 7.882e+00 AA : 71 Mean :19957688 30 : 58 Mean :-1.534e+01 Mean : 1.515e+02 ABS: 71 3rd Qu.:20001231 25 : 56 3rd Qu.: 1.890e+00 3rd Qu.: 1.444e+01 ABT: 71 Max. :20041231 (Other):29561 Max. : 5.309e+03 Max. : 3.366e+06 (Other):29624 NA's : 249 NA's : 2.460e+02 NA's : 4.760e+02 FiscalQuarterRepF12MRet 2004/2F: 482Min. :-100.00 2003/4F: 4711st Qu.: -8.82 2004/1F: 470Median : 10.57 2004/3F: 470Mean : 13.36 2003/3F: 4643rd Qu.: 31.12 2003/2F: 463Max. :4700.00 (Other):27284NA's : 463.00 mode(Price) [1] numeric mode(EPS) [1] numeric mode(BookValuePS) [1] numeric BP - BookValuePS / Price ; Warning message: / not meaningful for factors in: Ops.factor(BookValuePS, Price) which(Price==0) numeric(0) -- -- Vivek Satsangi Student, Rochester, NY USA __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] / Operator not meaningful for factors
Sir, I made the (incorrect, probably unjustified) deduction of using mode() based on section 3.1 of An Introduction to R. Since the write up talks about the mode of an object, and using attr() did not work (it gives some error saying that mode of name must be character), I tried mode() and reached this incorrect conclusion. I have had this confusion for a while now about the fact that something is numeric AND it is a factor, since if it were just a vector and not a factor, it would still be numeric, as in: a - c (1, 2, 3); class(a); [1] numeric I'll try to think of a way to improve the explanation in An Introduction to R so that the next person coming along does not fall into the same pit. Thank you for getting me unstuck, Vivek On 1/15/06, Prof Brian Ripley [EMAIL PROTECTED] wrote: The mode of a factor is numeric, so your test does not do what you think it does. is.numeric() is the recommended test of a vector being numeric. I have no idea where you got the idea that mode() was a useful test (perhaps you could give us the reference you used), but it rather rarely is (typeof is usually more informative). From the summary quoted, Price is clearly a factor. Test it with is.factor. On Sun, 15 Jan 2006, Vivek Satsangi wrote: Folks, I have a very basic question. The solution eludes me perhaps because of my own lack of creativity. I am not attaching a fully reproducible session because the issue may well be becuase of the way the data file is, and the data file is large (and I don't know whether I can legally distribute it). If people can suggest things that might be wrong in my data or the way that I am reading it, I would be most grateful. I get the following error message in the session quoted at the end of this email: / not meaningful for factors in: Ops.factor(BookValuePS, Price) As you can see in that some session, I check that the two vectors being divided are numeric. (see the request above for your reference here) I also check that the divisor is not 0 at any index. I also believe that this is not because of the NA's in the data. My question is, what are other problems that can cause the / operator to not be meaningful? Why not test for factor, since that is what the very helpful error message told you the problem was? I did try some simple examples to try to get the same error. However, I am not sure how to put the same NA's that one gets from read.table() into a vector: a - c(1, 2, 3, NA); a [1] 1 2 3 NA b - c( 1, 2, 3, 4); c - b / a; b [1] 1 2 3 4 a - c(1, 2, 3, ); c - b/a; Warning message: longer object length is not a multiple of shorter object length in: b/a Quoted Session below explainPriceSimplified - read.table(combinedClean.csv, +sep = ,, header=TRUE); attach(explainPriceSimplified); summary(explainPriceSimplified); Symbol Date PriceEPS BookValuePS XL : 98 Min. :19870630 22 : 61 Min. :-1.401e+05 Min. :-6.901e+05 ZION : 97 1st Qu.:19910930 26.5 : 61 1st Qu.: 4.650e-01 1st Qu.: 3.892e+00 YRCW : 72 Median :19960331 27.5 : 58 Median : 1.060e+00 Median : 7.882e+00 AA : 71 Mean :19957688 30 : 58 Mean :-1.534e+01 Mean : 1.515e+02 ABS: 71 3rd Qu.:20001231 25 : 56 3rd Qu.: 1.890e+00 3rd Qu.: 1.444e+01 ABT: 71 Max. :20041231 (Other):29561 Max. : 5.309e+03 Max. : 3.366e+06 (Other):29624 NA's : 249 NA's : 2.460e+02 NA's : 4.760e+02 FiscalQuarterRepF12MRet 2004/2F: 482Min. :-100.00 2003/4F: 4711st Qu.: -8.82 2004/1F: 470Median : 10.57 2004/3F: 470Mean : 13.36 2003/3F: 4643rd Qu.: 31.12 2003/2F: 463Max. :4700.00 (Other):27284NA's : 463.00 mode(Price) [1] numeric mode(EPS) [1] numeric mode(BookValuePS) [1] numeric BP - BookValuePS / Price ; Warning message: / not meaningful for factors in: Ops.factor(BookValuePS, Price) which(Price==0) numeric(0) -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 -- -- Vivek Satsangi Student, Rochester, NY USA __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Suggested add to the documentation for the identify() function
Folks, 1. Is there a more appropriate list (r-devel?) for posting such suggestions? I am a newbie to R, and doubtless will have some suggestions for the documentation -- some good, others not quite so. I would actually like to help give back to the community (I was motivated by Prof. Ripley's 2001 talk in which he had commented that open source software users rarely give back anything.) -- but I know very little right now, so I might make things worse in some cases. 2. I would like to suggest adding the following to the examples section of the help on the identify function: Suppose you want to be able to remove some points from your analysis. In its simplest form, Identify() will give you the row number of the points that you mark. Try running the following 3 commands: plot.new() plot(1:10, 1:10) identify(x=1:10, y=1:10, n=10) What you will observe is that when you click on the points of the plot , it will show the row number of those points. If you are using some other function to produce your plot, identify can work with that as wellJust use the same vectors in the arguments to plot and identify. Next, you can remove those outlier points from your data using - x1 - x[-c(3,5,7), ] In this case x is your orignal matrix and 3,5,7 are the row numbers shown by identify() for your outlier data points. See also: Negative subscripts 3. My most sincere apologies for sending HTML in my email to the distribution list the last time. -- Vivek Satsangi __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Cacheing in read.table/ attached data?
Disclaimer/Apology: I am an R newbie I am seeing some behaviour that seems to me to be the result of some cacheing going on at some level, and perhaps this is expected behaviour. I would just like to understand the basic rules. What I have is a file with some data. I read it in and then do a summary on the resulting dataframe. I find the some values are completely outside the expected range, these value need to be dropped from further analysis as erroneous observations (yes, I apologize to the purists in advance :-) ). If I do this and read the file again, then circlesPlot (from fBasics) two of the columns in the data, then the plot is not updated. The outlier point is still there. However, when I detach and reattach the dataframe, it seems to work okay. For example, # Plot has the outlier point in it. # Edit the file, commenting out the outlier line, save, then... SG - read.table (c:/Vivek/MFC/Data/SG/combinedSG.tdf,header=TRUE,sep=\t) SGm2 - lm(A3Yr ~ A10Holdings, data=SG) circlesPlot(A10Holdings,A3Yr, size=NetAssets) abline(coef(SGm2)) # Put the regression line on the plot SG - read.table (c:/Vivek/MFC/Data/SG/combinedSG.tdf,header=TRUE,sep=\t) summary(SG) #Outlier does not show in the summary circlesPlot(A10Holdings,A3Yr, size=NetAssets) # ... But Plot still has the outlier detach(SG) attach(SG) circlesPlot(A10Holdings,A3Yr, size=NetAssets) # Outlier is gone from the plot So, here are my questions: 1. Is there a simpler / more idiomatic way in R, than commenting out the data in the data file to exclude some outliers in the data (i.e. to do data trimming). In EViews this is done by setting the sample. 2. Is the flushing of the cache happening as a result of the detach/attach, or some other reason? Thanks for any help, Vivek Satsangi [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html