[R] combining dataframes with different numbers of columns
Dear list members, I have to combine dataframes together. However they contain different numbers of variables. It is possible that all the variables in the dataframe with fewer variables are contained in the dataframe with more variables, though it is not always the case. There are key variables identifying observations. These could be used in a merge statement, although this won't quite work for me (see below). I was hoping to find a way to combine dataframes where I needed only to ensure the key variables were present. The total number of variables in the final dataframe would be the total number of different variables in both initial dataframes. Variables that were absent in one dataframe would automatically get missing values in the joint dataframe. Here is a simple example. The initial dataframes are a and b. All variables in b are also in a. a - data.frame(X=seq(1,10), matrix(runif(100, 0,15), ncol=10)) b - data.frame(X=seq(16,20), X4=runif(5,0,15)) A merge does not work because the common variable X4 becomes 2 variables, X4.x and X4.y. c - merge(a,b,by=X, all=T) This can be fixed but it requires several steps (although my solution is probably not optimal): names(c)[5] - X4 c$X4[is.na(c$X4)] - c$X4.y[is.na(c$X4)] c - c[,1:11] One quickly becomes tired with this solution with my real-life dataframes where different columns would require repair from one case to the next. I think I still prefer making the narrower dataframe like the wider one: b2 - upData(b, X1=NA, X2=NA, X3=NA, X5=NA, X6=NA, X7=NA, X8=NA, X9=NA, X10=NA) b2 - b2[,c(1, 3:5, 2, 6:11)] d - rbind(a, b2) But again this requires quite a bit of fine-tuning from one case to the next in my real-life dataframes. I suspect R has a neat way to do this and I just cannot come up with the proper search term to find help on my own. Or this can be automated: can one compare variable lists from 2 dataframes and add missing variables in the narrower dataframe? Ideally, the solution would be able to handle the situation where the narrower dataframe contains one or more variables that are absent from the wider one. If this was the case, I'd like the new variable to be present in the combined dataframe, with missing values given to the observations from the wider dataframe. Thanks in advance, Denis Chabot __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] combining dataframes with different numbers of columns
Thanks you very much Hadley, Stephen, and Sundar, your suggestions all solve my problem, even if the narrower dataframe contains variables that are new to the wider dataframe. I'm sure glad I took the time to write instead of pursuing with my ugly and time-consuming solution. Denis Dear list members, I have to combine dataframes together. However they contain different numbers of variables. It is possible that all the variables in the dataframe with fewer variables are contained in the dataframe with more variables, though it is not always the case. There are key variables identifying observations. These could be used in a merge statement, although this won't quite work for me (see below). I was hoping to find a way to combine dataframes where I needed only to ensure the key variables were present. The total number of variables in the final dataframe would be the total number of different variables in both initial dataframes. Variables that were absent in one dataframe would automatically get missing values in the joint dataframe. Here is a simple example. The initial dataframes are a and b. All variables in b are also in a. a - data.frame(X=seq(1,10), matrix(runif(100, 0,15), ncol=10)) b - data.frame(X=seq(16,20), X4=runif(5,0,15)) A merge does not work because the common variable X4 becomes 2 variables, X4.x and X4.y. c - merge(a,b,by=X, all=T) This can be fixed but it requires several steps (although my solution is probably not optimal): names(c)[5] - X4 c$X4[is.na(c$X4)] - c$X4.y[is.na(c$X4)] c - c[,1:11] One quickly becomes tired with this solution with my real-life dataframes where different columns would require repair from one case to the next. I think I still prefer making the narrower dataframe like the wider one: b2 - upData(b, X1=NA, X2=NA, X3=NA, X5=NA, X6=NA, X7=NA, X8=NA, X9=NA, X10=NA) b2 - b2[,c(1, 3:5, 2, 6:11)] d - rbind(a, b2) But again this requires quite a bit of fine-tuning from one case to the next in my real-life dataframes. I suspect R has a neat way to do this and I just cannot come up with the proper search term to find help on my own. Or this can be automated: can one compare variable lists from 2 dataframes and add missing variables in the narrower dataframe? Ideally, the solution would be able to handle the situation where the narrower dataframe contains one or more variables that are absent from the wider one. If this was the case, I'd like the new variable to be present in the combined dataframe, with missing values given to the observations from the wider dataframe. Thanks in advance, Denis Chabot __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] distance between legend title and legend box
Hi, I've looked at the parameters available for the legend function and cannot find a way to change the distance between the top of the box surrounding a legend and the legend's title. I have a math expression that raises the height of my title. If you don't mind the non-sensical title I give to the legend for this plot (Figure 3.20 in R Graphics): with(iris, plot(Sepal.Length, Sepal.Width, pch=as.numeric(Species), cex=1.2)) legend(6.5, 4.2, c(setosa, versicolor, virginica), cex=1, pch=1:3, title=expression(kg/km^2)) The result depends on the device, but I think any device will show the box needs to be raised a bit (in quartz, the top of the box passes in the middle of the 2, in pdf it is acceptable, but just (the top of the box lightly touches the top of the 2). Sincerely, Denis Chabot __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] out.format for chron
Dear R users, Do you know of a way to precise an out.format for a chron object that would use numbers for months and yet 4 digits for the year? I have tried out.format=c(d-m-year) (note the m instead of either mon or month) but got 27-Feb-1992. Also, the help for chron tells us how to define an out.format when we create a chron object, but how can you change the out.format of an existing chron object? Thanks in advance, Denis __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] out.format for chron
Got it Gabor, Thank you very much. Denis Le 06-10-19 à 10:38, Gabor Grothendieck a écrit : For your second question: x - chron(1) x - chron(x, out.format = ddmm) using the ddmm from below. On 10/19/06, Gabor Grothendieck [EMAIL PROTECTED] wrote: As discussed the Help Desk article in R News 4/1, the 2 vs 4 year length is controlled by the chron.year.abb option, e.g. options(chron.year.abb = FALSE) chron(20) however, as also discussed there its not really recommended that you use this option so try this instead: ddmm - function( x ) with( month.day.year( x ), sprintf( %02.f-%02.f-%04.f, day, month, year) ) chron( 20, out.format = ddmm ) On 10/19/06, Denis Chabot [EMAIL PROTECTED] wrote: Dear R users, Do you know of a way to precise an out.format for a chron object that would use numbers for months and yet 4 digits for the year? I have tried out.format=c(d-m-year) (note the m instead of either mon or month) but got 27-Feb-1992. Also, the help for chron tells us how to define an out.format when we create a chron object, but how can you change the out.format of an existing chron object? Thanks in advance, Denis __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/ posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] functionality of update in SAS
Dear list, I've tried to search the archives but found nothing, although I may use the wrong wording in my searches. I've also double-checked the upData function in Hmisc, but it does something else. I'm wondering if one can update a dataframe by forcing into it a shorter dataframe containing the corrections, like the update provided in SAS data steps. In this simple example: a - data.frame(id=c(1:5),x=rnorm(5)) b - data.frame(id=4,x=rnorm(1)) a id x 1 1 0.6557921 2 2 0.1897523 3 3 0.7976721 4 4 0.2107103 5 5 -0.8855786 b id x 1 4 0.8369147 I would like the updated dataframe to look like (row names are not important to me) id x 1 1 0.6557921 2 2 0.1897523 3 3 0.7976721 4 4 0.8369147 5 5 -0.8855786 I thought this could be done with merge, but this never removes the old version of a row, it just gives me two rows with id==4. I thought of this solution: reject - a$id %in% b$id a2 - a[!reject,] a3 - rbind(a2,b) a3 id x 1 1 0.6557921 2 2 0.1897523 3 3 0.7976721 5 5 -0.8855786 11 4 0.8369147 This works, and obviously it is not the best way to make the correction in a simple case like this. But providing a few lines of corrected data can be an effective method with large dataframes, especially if many identifier (grouping) variables are needed to identify each line that needs updating, and in this context my solution above rapidly becomes ugly. Furthermore (but I can live with this constraint) this method removes entire rows, so I need to make sure the dataframe used to make corrections contains all the Y variables in the original dataframe, even those that do not need correcting. If a method exists to just change one variable in 5 lines for a dataframe of 5000 lines and 30 variables, I'd appreciate learning about it. But I'll already be thrilled if I can update whole lines at a time. Sincerely, Denis Chabot __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] merge gives me too many rows
Hi Phil and Don, Le 06-09-18 à 05:02, Phil Spector a écrit : Denis - As long as there is one data frame that has exactly 0 or 1 observations for each level of the by variables, merge in R will operate identically to SAS without the need for any special options. The problem arises when there are multiple observations for some of the levels of the by variables in both of the data frames. In relational databases and R, new observations are created to represent every combination of observations in the original data set that share the same level of the by variables. In sas, the last observation in the data set with fewer observations having a particular level of the by variable is repeated enough times to provide a match for each observation in the other data set. You can see this with a tiny example -- look at the observations in the output data set corresponding to id==3: a = data.frame(id=c(1,2,3,3,4),x=rnorm(5)) b = data.frame(id=c(1,2,3,3,3,4),y=rnorm(6)) c = merge(a,b) dim(a) [1] 5 2 dim(b) [1] 6 2 dim(c) [1] 9 3 c id xy 1 1 -2.15987259 -1.555878070 2 2 0.96856842 -0.666941824 3 3 1.68080409 -0.009239307 4 3 0.06044914 -0.009239307 5 3 1.68080409 -0.242538473 6 3 0.06044914 -0.242538473 7 3 1.68080409 -1.298344557 8 3 0.06044914 -1.298344557 9 4 1.76424928 -0.420144744 The important thing to remember is that there is no unambiguous way to merge two data sets when there are differing numbers of observations for a given level of the by variable in the two data sets -- sas and R simply choose different strategies in this case. - Phil Spector Statistical Computing Facility Department of Statistics UC Berkeley [EMAIL PROTECTED] I knew this, but it is good to be remembered and be forced to verify one's assumptions. Originally both informations about stomach contents and fish size came from the same, larger, database. I made the 2 smaller databases to ease the memory footprint in R when not all variables were needed. The method I used, actually suggested to me on this very helpful list a long time back, insured that all rows were unique: fish - subset(esto, select=c(predateu, origin, navire, nbpc, no_rel, trait, tagno, longueur, masse)) fish - unique(fish) # NO NEED TO SORT FIRST Or at least it would if there were NO MISTAKES in my database to start with. It is this false feeling that all rows had to be unique that made me pull my hair last night and assume the problem was with merge. I just verified and found that some fish have 2 lengths or masses, either mistakes when entering them into the databases, or mistakes when numbering fishes (2 fish with the same number). I'll go back to the raw data to fix this! So thanks for making me verify this assumption. Le 06-09-18 à 01:38, Don MacQueen a écrit : I think you may misunderstand the meaning of all.x = FALSE. Setting all.x to false ensures that only rows of x that have matches in y will be included. Equivalently, if a row of x is not matched in y, it will not be in the output. However, if a row in x is matched by more than one row in y, then that row will be repeated as many times as there are matching rows in y. That is, you have a 1 to many match (1 in x to many in y). SAS behaves the same way. Are you sure this is not what is happening? Also, all.x = FALSE is the default; it is not necessary to specify it. In fact, the default is to output only rows that are found in both x and y (matching on the specified variables, of course). -Don Thank you Don, Looking at my old programs using merge, I always used all.x=T. Yesterday, because I was getting more rows than expected, I assumed that switch caused it and started setting it to F (i.e. early on when I got extra lines, I assumed that fish measured in the second dataframe but for which I did not have stomach contents were merged anyway). I never went back to all.X=T after verifying it was not the cause of the problem, instead some lines in the first database were duplicated (multiple matches, it turns out). But you are right. In fact I made some tests and the behavior I want in this and most cases is all.x=T but all.y=F. That is if I ever found a case where I had a line in the first dataframe with no match in the second, I'd want to keep that line in the final dataframe. Again, many thanks, Denis At 9:11 PM -0400 9/17/06, Denis Chabot wrote: Hi, I am using merge to add some variables to an existing dataframe. I use the option all.x=F so that my final dataframe will only have as many rows as the first file I name in the call to merge. With a large dataframe using a lot of by variables
[R] merge gives me too many rows
Hi, I am using merge to add some variables to an existing dataframe. I use the option all.x=F so that my final dataframe will only have as many rows as the first file I name in the call to merge. With a large dataframe using a lot of by variables, the number of rows of the merged dataframe increases from 177325 to 179690: dim(test) [1] 177325 9 test2 - merge(test, fish, by=c(predateu, origin, navire, nbpc, no_rel, trait, tagno), all.x=F) dim(test2) [1] 179690 11 I tried to make a smaller dataset with R commands that I could post here so that other people could reproduce, but merge behaved as expected: final number of rows was the same as the number of rows in the first file named in the call to merge. I took a subset of my large dataframe and could mail this to anyone interested in verifying the problem. test3 - test[11:16,] dim(test3) [1] 6 9 test4 - merge(test3, fish, by=c(predateu, origin, navire, nbpc, no_rel, trait, tagno), all.x=F) dim(test4) [1] 6004311 I compared test3 and test4 line by line. The first 11419 lines were the same (except for added variables, obviously) in both dataframes, but then lines 11420 to 11423 were repeated in test4. Then no problem for a lot of rows, until rows 45756-45760 in test3. These are offset by 4 in test4 because of the first group of extraneous lines just reported, and are found on lines 45760 to 45765. But they are also repeated on lines 45765 to 45769. And so on a few more times. Thus merge added lines (repeated a small number of lines) to the final dataframe despite my use of all.x=F. Am I doing something wrong? If not, is there a solution? Not being able to merge is a setback! I was attempting to move the last few things I was doing with SAS to R... Please let me know if you want the file test3 (2.3 MB as a csv file, but only 352 KB in R (.rda) format). Sincerely, Denis Chabot R.Version() $platform [1] powerpc-apple-darwin8.6.0 $arch [1] powerpc $os [1] darwin8.6.0 $system [1] powerpc, darwin8.6.0 $status [1] $major [1] 2 $minor [1] 3.1 $year [1] 2006 $month [1] 06 $day [1] 01 $`svn rev` [1] 38247 $language [1] R $version.string [1] Version 2.3.1 (2006-06-01) __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] reshaping a dataset
Thank you Gabor, I'll need to explore a bit the reshape package to see what benefits I get compared with the basic reshape function, but I'm glad you made me aware of it. And your solution for fixing NAs just for the columns I want is just what I wanted. Many thanks, Denis Le 06-09-13 à 00:55, Gabor Grothendieck a écrit : I missed your second question which was how to set the NAs to zero for some of the columns. Suppose we want to replace the NAs in columns ic and for sake of example suppose ic specifies columns 1 to 8: library(reshape) testm - melt(test, id = 1:6) out - cast(testm, nbpc + trip + set + tagno + depth ~ prey, sum) # fix up NAs ic - 1:8 out2 - out[,ic] out2[is.na(out2)] - 0 out[,ic] - out2 On 9/13/06, Gabor Grothendieck [EMAIL PROTECTED] wrote: If I understand this correctly we want to sum the mass over each combination of the first 6 variables and display the result with the 6th, prey, along the top and the others along the side. library(reshape) testm - melt(test, id = 1:6) cast(testm, nbpc + trip + set + tagno + depth ~ prey) Now fix up the NAs. On 9/12/06, Denis Chabot [EMAIL PROTECTED] wrote: Hi, I'm trying to move to R the last few data handling routines I was performing in SAS. I'm working on stomach content data. In the simplified example I provide below, there are variables describing the origin of each prey item (nbpc is a ship number, each ship may have been used on different trips, each trip has stations, and individual fish (tagno) can be caught at each station. For each stomach the number of lines corresponds to the number of prey items. Thus a variable identifies prey type, and others (here only one, mass) provide information on prey abundance or size or digestion level. Finally, there can be accompanying variables that are not used but that I need to keep for later analyses (e.g. depth in the example below). At some point I need to transform such a dataset into another format where each stomach occupies a single line, and there are columns for each prey item. The reshape function works really well, my program is in fact simpler than the SAS equivalent (not shown, don't want to bore you, but available on request), except that I need zeros when prey types are absent from a stomach instead of NAs, a problem for which I only have a shaky solution at the moment: 1) creation of a dummy dataset: ### nbpc - rep(c(20,34), c(110,90)) trip - c(rep(1:3, c(40, 40, 30)), rep(1:2, c(60,30))) set - c(rep(1:4, c(10, 8, 7, 15)), rep(c(10,12), c(25,15)), rep (1:3, rep(10,3)), rep(10:12, c(20, 10, 30)), rep(7:8, rep(15,2))) depth - c(rep(c(100, 150, 200, 250), c(10, 8, 7, 15)), rep(c (100,120), c(25,15)), rep(c(75, 50, 200), rep(10,3)), rep(c(200, 150, 50), c(20, 10, 30)), rep(c(100, 250), rep (15,2))) tagno - rep(round(runif(42,1,200)), c(7,3, 4,4, 2,2,3, 5,5,5, 4,6,4,3,5,3, 7,8, 4,6, 5,5, 7,3, 6,6,4,4, 4,6, 3,3,4,5,5,6,4, 5,5,5, 8,7)) prey.codes -c(187, 438, 792, 811) prey - sample(prey.codes, 200, replace=T) mass - runif(200, 0, 10) test - data.frame(nbpc, trip, set, depth, tagno, prey, mass) Because there are often multiple occurrences of the same prey in a single stomach, I need to sum them for each stomach before using reshape. Here I use summarizeBy because my understanding of the many variants of apply is not very good: test2 - summaryBy(mass~nbpc+trip+set+tagno+prey, data=test, FUN=sum, keep.names=T, id=~depth) #this messes up sorting order, I fix it k - order(test2$nbpc, test2$trip, test2$set, test2$tagno) test3 - test2[k,] result - reshape(test3, v.names=mass, idvar=c(nbpc, trip, set, tagno), timevar=prey, direction=wide) # I'm quite happy with this, although you may know of better ways of doing it. But my problem is with preys that are absent from a stomach. In later analyses, I need them to have zero abundance instead of NA. My shaky solution is: # empties - is.na(result) result[empties] - 0 # which did the job in this example, but it won't always. For instance there could have been NAs for depth, which I do not want to become zero. Is there a way to transform NAs into zeros for multiple columns of a dataframe in one step, while ignoring some columns? Or maybe there is another way to achieve this that would have put zeros where I need them (i.e. something else than reshape)? Thanking you in advance, Denis Chabot __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/ posting-guide.html and provide commented, minimal, self-contained, reproducible code
[R] reshaping a dataset
Hi, I'm trying to move to R the last few data handling routines I was performing in SAS. I'm working on stomach content data. In the simplified example I provide below, there are variables describing the origin of each prey item (nbpc is a ship number, each ship may have been used on different trips, each trip has stations, and individual fish (tagno) can be caught at each station. For each stomach the number of lines corresponds to the number of prey items. Thus a variable identifies prey type, and others (here only one, mass) provide information on prey abundance or size or digestion level. Finally, there can be accompanying variables that are not used but that I need to keep for later analyses (e.g. depth in the example below). At some point I need to transform such a dataset into another format where each stomach occupies a single line, and there are columns for each prey item. The reshape function works really well, my program is in fact simpler than the SAS equivalent (not shown, don't want to bore you, but available on request), except that I need zeros when prey types are absent from a stomach instead of NAs, a problem for which I only have a shaky solution at the moment: 1) creation of a dummy dataset: ### nbpc - rep(c(20,34), c(110,90)) trip - c(rep(1:3, c(40, 40, 30)), rep(1:2, c(60,30))) set - c(rep(1:4, c(10, 8, 7, 15)), rep(c(10,12), c(25,15)), rep(1:3, rep(10,3)), rep(10:12, c(20, 10, 30)), rep(7:8, rep(15,2))) depth - c(rep(c(100, 150, 200, 250), c(10, 8, 7, 15)), rep(c (100,120), c(25,15)), rep(c(75, 50, 200), rep(10,3)), rep(c(200, 150, 50), c(20, 10, 30)), rep(c(100, 250), rep (15,2))) tagno - rep(round(runif(42,1,200)), c(7,3, 4,4, 2,2,3, 5,5,5, 4,6,4,3,5,3, 7,8, 4,6, 5,5, 7,3, 6,6,4,4, 4,6, 3,3,4,5,5,6,4, 5,5,5, 8,7)) prey.codes -c(187, 438, 792, 811) prey - sample(prey.codes, 200, replace=T) mass - runif(200, 0, 10) test - data.frame(nbpc, trip, set, depth, tagno, prey, mass) Because there are often multiple occurrences of the same prey in a single stomach, I need to sum them for each stomach before using reshape. Here I use summarizeBy because my understanding of the many variants of apply is not very good: test2 - summaryBy(mass~nbpc+trip+set+tagno+prey, data=test, FUN=sum, keep.names=T, id=~depth) #this messes up sorting order, I fix it k - order(test2$nbpc, test2$trip, test2$set, test2$tagno) test3 - test2[k,] result - reshape(test3, v.names=mass, idvar=c(nbpc, trip, set, tagno), timevar=prey, direction=wide) # I'm quite happy with this, although you may know of better ways of doing it. But my problem is with preys that are absent from a stomach. In later analyses, I need them to have zero abundance instead of NA. My shaky solution is: # empties - is.na(result) result[empties] - 0 # which did the job in this example, but it won't always. For instance there could have been NAs for depth, which I do not want to become zero. Is there a way to transform NAs into zeros for multiple columns of a dataframe in one step, while ignoring some columns? Or maybe there is another way to achieve this that would have put zeros where I need them (i.e. something else than reshape)? Thanking you in advance, Denis Chabot __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] transparent background for PDF
As far as I know, PowerPoint does not import pdfs. Without telling you it transforms it in a png. And it is not very smart about making a high res png at that. Maybe it is also not very clever about taking the transparent background into account? When I must use PowerPoint, I either transform the pdf in png myself, or prepare the presentation in Keynote and export to PowerPoint format. This also translates the pdf figures into png figures, but at much better resolution than what PowerPoint does. I tried making both pdf and png from my R program. I had hoped I would use the same settings for both devices (just copy and paste the plot instructions into a call for each device), unfortunately it did not work because to do better than PowerPoint's translation, I had to save the png version of my plots at high resolution, and that meant that all of the plot adjustments I made so that the pdf version looked nice (position of text, size of margins, etc) had to be fine- tuned differently for the png version, which was too time consuming... Denis Le 06-03-25 à 06:00, [EMAIL PROTECTED] a écrit : De : Dennis Fisher [EMAIL PROTECTED] Date : 24 mars 2006 13:30:24 HNE À : r-help@stat.math.ethz.ch Objet : [R] transparent background for PDF Colleagues Running R2.2.1 on either a Linux (RedHat 9) or Mac (10.4) platform. I created a PDF document using pdf(FILENAME.pdf, bg=transparent, version=1.4). I then imported the graphic into PowerPoint - background was set to a non-transparent color. I was hoping that the inserted graphic would be transparent - instead, it had a white background. The help pages indicate: The 'version' argument modifies the sort of PDF code that gets produced. At the moment this only concerns the production of transparent output. The version must be greater than 1.4 for transparent output to be produced. Specifying a lower version number may be useful if you want to produce PDF output that can be viewed on older PDF viewers. The archives reveal the following: From: Benno Ptz puetz Date: Thu Nov 4 14:20:28 2004 Hello, I have run across the following problem: Creating PDF files manually by using pdf(version=1.4) I can make graphs using the new transparency feature of R2.0. Note that I used version 1.4. Attempts to use larger numbers (1.5, 2.0, etc.) resulted in error messages: Error in pdf(FILENAME.pdf, bg = transparent, : invalid PDF version I also attempted to address this with par(bg=transparent) without success. Does anybody have any ideas? Dennis Dennis Fisher MD P (The P Less Than Company) Phone: 1-866-PLessThan (1-866-753-7784) Fax: 1-415-564-2220 www.PLessThan.com __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] legend in bubble plots made with symbols()
Hi, I have read about the use of symbols() to draw circles of different sizes, but I have not been able to find out how to add a legend to such a graph, legend that would display some specific sizes and their meaning. Before finding the symbols function in Paul Murrell's book, I had rolled by own function where the variable I want to use to control circle size was actually used to control cex. I was able to draw a legend afterward. Symbols seems a bit simpler and I wanted to see if it would be better than my own function. But without legend it is less useful. However I'm sure there is a way which I'm not aware of to draw a legend for a plot drawn with symbols()... Thanks in advance, Denis Chabot __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] predicted values in mgcv gam
Hi, In fitting GAMs to assess environmental preferences, I use the part of the fit where the lower confidence interval is above zero as my criterion for positive association between the environmental variable and species abundance. However I like to plot this on the original scale of species abundance. To do so I extract the fit and SE using predict.gam. Lately I compared more carefully the plots I obtain in this way and those obtained with plot.gam and noticed differences which I do not understand. To avoid sending a large dataset I took an example from gam Help to illustrate this. Was I wrong to believe that the fit and its confidence band should behave the same way on both scales? Thanks in advance, Denis Chabot ### library(mgcv) set.seed(0) n-400 sig-2 x0 - runif(n, 0, 1) x1 - runif(n, 0, 1) x2 - runif(n, 0, 1) x3 - runif(n, 0, 1) f0 - function(x) 2 * sin(pi * x) f1 - function(x) exp(2 * x) f2 - function(x) 0.2*x^11*(10*(1-x))^6+10*(10*x)^3*(1-x)^10 f - f0(x0) + f1(x1) + f2(x2) g-exp(f/4) y-rpois(rep(1,n),g) mean.y - mean(y) gam2 - gam(y~ s(x2), poisson) # to plot on the response scale val.for.pred - data.frame(x2=seq(min(x2), max(x2), length.out=100)) pred.2.resp - predict.gam(gam2, val.for.pred ,type=response, se.fit=TRUE) lower.band - pred.2.resp$fit - 2*pred.2.resp$se.fit upper.band - pred.2.resp$fit + 2*pred.2.resp$se.fit pred.2.resp - data.frame(val.for.pred, pred.2.resp, lower.band, upper.band) # same thing on term scale pred.2.term - predict.gam(gam2, val.for.pred ,type=terms, se.fit=TRUE) lower.band - pred.2.term$fit - 2*pred.2.term$se.fit upper.band - pred.2.term$fit + 2*pred.2.term$se.fit pred.2.term - data.frame(val.for.pred, pred.2.term, lower.band, upper.band) # it is easier to compare two plots instead of looking at these two data.frames plot(gam2, residuals=T, pch=1, cex=0.7) abline(h=0) plot(y~x2, col=grey(0.5)) lines(fit~x2, col=blue, data=pred.2.resp) lines(lower.band~x2, col=red, lty=2, data=pred.2.resp) lines(upper.band~x2, col=red, lty=2, data=pred.2.resp) abline(h=mean.y,lty=5,col=grey(0.35)) __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] transforming data frame for use with persp
Hi, This is probably documented, but I cannot find the right words or expression for a search. My attempts failed. I have a data frame of 3 vectors (x, y and z) and would like to transform this so that I could use persp. Presently I have y-level copies of each x level, and a z value for each x-y pair. I need 2 columns giving the possible levels of x and y, and then a transformation of z from a long vector into a matrix of x-level rows and y-level columns. How do I accomplish this? In this example, I made a set of x and y values to get predictions from a GAM, then combined them with the predictions into a data frame. This is the one I'd like to transform as described above: My.data - expand.grid(Depth=seq(40,220, 20), Temp=seq(-1, 6, 0.5)) predgam - predict.gam(dxt.gam, My.data, type=response) pred.data - data.frame(My.data, predgam) pred.data has 150 lines and 3 columns. Thanks for your help, Denis Chabot __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] plotting lines that break if data break
Hi, Sometimes data series (not necessarily time series) suffer breaks where data were expected, but not collected. Often the regular lines command to add such data to a plot is what I want, but other times I'd like the line to break where the data series is interrupted, instead of the line jumping to the next point in the series uninterrupted. Usually my data file contain one value of x but none of y, but alternatively a break could also appear as a NA value for both x and y. I have found I could use the segments command instead of lines, but this seems to require I manipulate my data (which may or may not contain breaks, and the number of breaks can vary if there are breaks at all). Is there another command that works like lines but will break the line if the data series suffer an interruption? Sincerely, Denis Chabot __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] plotting lines that break if data break
I'm sorry, too long without coffee. Lately I have done numerous plots with superimposed non-linear quantile regression fits, and those do not break (which is fine), and I wrongly transposed this to lines in my head. Denis Le 06-02-07 à 22:06, Gabor Grothendieck a écrit : It does break already. Try this: plot(1:10, c(1:5,NA,7:10), type = l) plot(c(1:5,NA,7:10), 1:10, type = l) On 2/7/06, Denis Chabot [EMAIL PROTECTED] wrote: Hi, Sometimes data series (not necessarily time series) suffer breaks where data were expected, but not collected. Often the regular lines command to add such data to a plot is what I want, but other times I'd like the line to break where the data series is interrupted, instead of the line jumping to the next point in the series uninterrupted. Usually my data file contain one value of x but none of y, but alternatively a break could also appear as a NA value for both x and y. I have found I could use the segments command instead of lines, but this seems to require I manipulate my data (which may or may not contain breaks, and the number of breaks can vary if there are breaks at all). Is there another command that works like lines but will break the line if the data series suffer an interruption? Sincerely, Denis Chabot __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting- guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] how to extract predicted values from a quantreg fit?
Hi, I have used package quantreg to estimate a non-linear fit to the lowest part of my data points. It works great, by the way. But I'd like to extract the predicted values. The help for predict.qss1 indicates this: predict.qss1(object, newdata, ...) and states that newdata is a data frame describing the observations at which prediction is to be made. I used the same technique I used to extract predicted values of GAM fits with mgcv package: if I fit a GAM using x as the x variable on which I smooth, I then create such a dataframe like this MyX - data.frame(x=seq(-1,60)) This works fine with GAM (mgcv) but not with quantreg: y - rnorm(500, 10, 5) x - rep(seq(1,50,1), 10) My.data - data.frame(x, y) My.x - data.frame(x=seq(5,45)) fit - rqss(y ~ qss(x, lambda=5), tau=0.05) pred - predict.qss1(fit, My.x) Could someone please help me creating a dataframe newdata that would satisfy predict.qss1? Thanks in advance, Denis Chabot __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] how to control ticks in plots with yasp or xasp
Hi, A few times I tried to control the number and position of tick marks in plots with the yasp or xasp parameters. For example, a y axis was drawn by default with tick marks at 0, 20, 40, 80 and 100. I tried to get tick marks every 10 by adding yasp=(0, 100, 10) but this had no effect at all. I know I can draw the axis and tick marks manually, but often this simple option would suffice if I could understand how to make it work. Thanks in advance, Denis Chabot __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] how to control ticks in plots with yasp or xasp
Hi, sorry about the bad syntax, though the right syntax would not have worked either, according to your tests (Mark, Brian, Peter). Anyway it is too finicky, I will draw them myself. For instance, plot(1:100, xaxt=n) par(xaxp=c(0, 100, 10)) # the value is reset at each plot axis(1) Placed tick marks at intervals at 0, 10, ..., 100, as expected, but did not place a label under 100... Thanks for the answers, usufull as always, Denis Le 05-10-08 à 11:58, Marc Schwartz a écrit : On Sat, 2005-10-08 at 16:37 +0100, Prof Brian Ripley wrote: On Sat, 8 Oct 2005, Marc Schwartz wrote: On Sat, 2005-10-08 at 09:28 -0400, Denis Chabot wrote: Hi, A few times I tried to control the number and position of tick marks in plots with the yasp or xasp parameters. For example, a y axis was drawn by default with tick marks at 0, 20, 40, 80 and 100. I tried to get tick marks every 10 by adding yasp=(0, 100, 10) but this had no effect at all. I know I can draw the axis and tick marks manually, but often this simple option would suffice if I could understand how to make it work. Thanks in advance, Denis Chabot I suspect that one problem you are having is that there is no par(xasp) or par(yasp)unless these are typos and you are trying to use par(xaxp) and par(yaxp)? In any case, (0, 100, 10) is invalid syntax, and c(0, 100, 10) is needed. Indeed. There is an 'asp' argument to some of the plot functions (ie. plot.default), but this has a different intention. par(xaxp) and par(yaxp) are not listed as read only pars in ? par, however, I cannot recall an instance where R does not overwrite the user settings during the calculation of the axes, whether passed as arguments to a plot function or set a priori via a par() call. Really? Try plot(1:100, xaxt=n) par(xaxp=c(0, 50, 5)) # the value is reset at each plot axis(1) for how it works (but not inline, which is probably a bug). AhI had not thought about that 'par'ticular combination... ;-) Hence, not R.O. So it must be used _after_ a plot call, which makes sense. I had reached for my copy of Paul's book and on page 96 (last paragraph in section 3.4.5 on Axes), he suggests using the approach I elucidate below with axTicks(). I thought he might have some other ideas and that I was missing something. This is also referenced on page 70, third paragraph in section 3.2.5 on Axes. If you want explicit control over the tick marks, you will need to use axis(), perhaps in combination with axTicks(), after using 'xaxt = n' and/or 'yaxt = n' in the plot call, depending upon the circumstances. That is usually as easy. Agreed. Thanks, Marc __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] how to control ticks in plots with yasp or xasp
Hi Brian, Le 05-10-08 à 13:21, Prof Brian Ripley a écrit : On Sat, 8 Oct 2005, Denis Chabot wrote: Hi, sorry about the bad syntax, though the right syntax would not have worked either, according to your tests (Mark, Brian, Peter). It DOES work according to my tests! (Do give us the credit for testing our advice: we would appreciate your showing equal care.) I am very sorry, I did not express myself properly. I do appreciate your help, and I did take the time to test your solution. What I meant was that even if I had use the proper syntax, i.e. yaxp=c (10,100,10) instead of y=(10,100,10) [actually in my program I had used it but I was careless when I composed the message], it would not have worked because you demonstrated that it has to be used within a par statement, which I had not done. Therefore my attempt was doomed. I did not mean there was no way to make yaxp work, so I apologize if you think I take your advice lightly, and I cannot state strongly enough that it is not the case at all. Anyway it is too finicky, I will draw them myself. For instance, plot(1:100, xaxt=n) par(xaxp=c(0, 100, 10)) # the value is reset at each plot axis(1) Placed tick marks at intervals at 0, 10, ..., 100, as expected, but did not place a label under 100... Does for me (2.2.0, Windows and X11 on Linux). Might the text be clipped on your unnamed device, so you need to set xpd? (If so, it will be clipped however you try to do this, and is an unrelated local problem.) This is strange. I work on Mac OS X, so the default device is quartz. Label 100 is shown without clipping if plot without changing xaxp. But just in case I made more space at the right of the graph with par(mai=c(0.7, 0.7, 0.5, 0.5)) plot(1:100, xaxt=n) par(xaxp=c(0, 100, 10)) # the value is reset at each plot axis(1) and the 100 does not appear. If I run this again: plot(1:100, xaxt=n) axis(1) I'm back with R's default axis and the 100 is present. Anyone on a Mac wants to try? Denis -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] testing non-linear component in mgcv:gam
Hi, I need further help with my GAMs. Most models I test are very obviously non-linear. Yet, to be on the safe side, I report the significance of the smooth (default output of mgcv's summary.gam) and confirm it deviates significantly from linearity. I do the latter by fitting a second model where the same predictor is entered without the s(), and then use anova.gam to compare the two. I thought this was the equivalent of the default output of anova.gam using package gam instead of mgcv. I wonder if this procedure is correct because one of my models appears to be linear. In fact mgcv estimates df to be exactly 1.0 so I could have stopped there. However I inadvertently repeated the procedure outlined above. I would have thought in this case the anova.gam comparing the smooth and the linear fit would for sure have been not significant. To my surprise, P was 6.18e-09! Am I doing something wrong when I attempt to confirm the non- parametric part a smoother is significant? Here is my example case where the relationship does appear to be linear: library(mgcv) This is mgcv 1.3-7 Temp - c(-1.38, -1.12, -0.88, -0.62, -0.38, -0.12, 0.12, 0.38, 0.62, 0.88, 1.12, 1.38, 1.62, 1.88, 2.12, 2.38, 2.62, 2.88, 3.12, 3.38, 3.62, 3.88, 4.12, 4.38, 4.62, 4.88, 5.12, 5.38, 5.62, 5.88, 6.12, 6.38, 6.62, 6.88, 7.12, 8.38, 13.62) N.sets - c(2, 6, 3, 9, 26, 15, 34, 21, 30, 18, 28, 27, 27, 29, 31, 22, 26, 24, 23, 15, 25, 24, 27, 19, 26, 24, 22, 13, 10, 2, 5, 3, 1, 1, 1, 1, 1) wm.sed - c(0.0, 0.016129032, 0.0, 0.062046512, 0.396459596, 0.189082949, 0.054757925, 0.142810440, 0.168005168, 0.180804428, 0.111439628, 0.128799505, 0.193707937, 0.105921610, 0.103497845, 0.028591837, 0.217894389, 0.020535469, 0.080389068, 0.105234450, 0.070213450, 0.050771363, 0.042074434, 0.102348837, 0.049748344, 0.019100478, 0.005203125, 0.101711864, 0.0, 0.0, 0.014808824, 0.0, 0.22200, 0.16700, 0.0, 0.0, 0.0) sed.gam - gam(wm.sed~s(Temp),weight=N.sets) summary.gam(sed.gam) Family: gaussian Link function: identity Formula: wm.sed ~ s(Temp) Parametric coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 0.084030.01347 6.241 3.73e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Est.rank F p-value s(Temp) 11 13.95 0.000666 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.554 Deviance explained = 28.5% GCV score = 0.09904 Scale est. = 0.093686 n = 37 # testing non-linear contribution sed.lin - gam(wm.sed~Temp,weight=N.sets) summary.gam(sed.lin) Family: gaussian Link function: identity Formula: wm.sed ~ Temp Parametric coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 0.162879 0.019847 8.207 1.14e-09 *** Temp-0.023792 0.006369 -3.736 0.000666 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.554 Deviance explained = 28.5% GCV score = 0.09904 Scale est. = 0.093686 n = 37 anova.gam(sed.lin, sed.gam, test=F) Analysis of Deviance Table Model 1: wm.sed ~ Temp Model 2: wm.sed ~ s(Temp) Resid. Df Resid. Dev Df Deviance F Pr(F) 1 3.5000e+01 3.279 2 3.5000e+01 3.279 5.5554e-10 2.353e-11 0.4521 6.18e-09 *** Thanks in advance, Denis Chabot __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] testing non-linear component in mgcv:gam
Fair enough, Andy. I thought I was getting both predictive ability and confirmation that the phenomenon I was studying was not linear. I have two projects, in one prediction is the goal and I don't really need to test linearity. In the second I needed to confirm a cycle was taking place and I thought my procedure did this. There was no theoretical reason to anticipate the shape of the cycle, so GAM was an appealing methodology. Denis Le 05-10-05 à 09:38, Liaw, Andy a écrit : I think you probably should state more clearly the goal of your analysis. In such situation, estimation and hypothesis testing are quite different. The procedure that gives you the `best' estimate is not necessarily the `best' for testing linearity of components. If your goal is estimation/prediction, why test linearity of components? If the goal is testing linearity, then you probably need to find the procedure that gives you a good test, rather than relying on what gam() gives you. Just my $0.02... Andy From: Denis Chabot Hi, I need further help with my GAMs. Most models I test are very obviously non-linear. Yet, to be on the safe side, I report the significance of the smooth (default output of mgcv's summary.gam) and confirm it deviates significantly from linearity. I do the latter by fitting a second model where the same predictor is entered without the s(), and then use anova.gam to compare the two. I thought this was the equivalent of the default output of anova.gam using package gam instead of mgcv. I wonder if this procedure is correct because one of my models appears to be linear. In fact mgcv estimates df to be exactly 1.0 so I could have stopped there. However I inadvertently repeated the procedure outlined above. I would have thought in this case the anova.gam comparing the smooth and the linear fit would for sure have been not significant. To my surprise, P was 6.18e-09! Am I doing something wrong when I attempt to confirm the non- parametric part a smoother is significant? Here is my example case where the relationship does appear to be linear: library(mgcv) This is mgcv 1.3-7 Temp - c(-1.38, -1.12, -0.88, -0.62, -0.38, -0.12, 0.12, 0.38, 0.62, 0.88, 1.12, 1.38, 1.62, 1.88, 2.12, 2.38, 2.62, 2.88, 3.12, 3.38, 3.62, 3.88, 4.12, 4.38, 4.62, 4.88, 5.12, 5.38, 5.62, 5.88, 6.12, 6.38, 6.62, 6.88, 7.12, 8.38, 13.62) N.sets - c(2, 6, 3, 9, 26, 15, 34, 21, 30, 18, 28, 27, 27, 29, 31, 22, 26, 24, 23, 15, 25, 24, 27, 19, 26, 24, 22, 13, 10, 2, 5, 3, 1, 1, 1, 1, 1) wm.sed - c(0.0, 0.016129032, 0.0, 0.062046512, 0.396459596, 0.189082949, 0.054757925, 0.142810440, 0.168005168, 0.180804428, 0.111439628, 0.128799505, 0.193707937, 0.105921610, 0.103497845, 0.028591837, 0.217894389, 0.020535469, 0.080389068, 0.105234450, 0.070213450, 0.050771363, 0.042074434, 0.102348837, 0.049748344, 0.019100478, 0.005203125, 0.101711864, 0.0, 0.0, 0.014808824, 0.0, 0.22200, 0.16700, 0.0, 0.0, 0.0) sed.gam - gam(wm.sed~s(Temp),weight=N.sets) summary.gam(sed.gam) Family: gaussian Link function: identity Formula: wm.sed ~ s(Temp) Parametric coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 0.084030.01347 6.241 3.73e-07 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Approximate significance of smooth terms: edf Est.rank F p-value s(Temp) 11 13.95 0.000666 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 R-sq.(adj) = 0.554 Deviance explained = 28.5% GCV score = 0.09904 Scale est. = 0.093686 n = 37 # testing non-linear contribution sed.lin - gam(wm.sed~Temp,weight=N.sets) summary.gam(sed.lin) Family: gaussian Link function: identity Formula: wm.sed ~ Temp Parametric coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 0.162879 0.019847 8.207 1.14e-09 *** Temp-0.023792 0.006369 -3.736 0.000666 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 R-sq.(adj) = 0.554 Deviance explained = 28.5% GCV score = 0.09904 Scale est. = 0.093686 n = 37 anova.gam(sed.lin, sed.gam, test=F) Analysis of Deviance Table Model 1: wm.sed ~ Temp Model 2: wm.sed ~ s(Temp) Resid. Df Resid. Dev Df Deviance F Pr(F) 1 3.5000e+01 3.279 2 3.5000e+01 3.279 5.5554e-10 2.353e-11 0.4521 6.18e-09 *** Thanks in advance, Denis Chabot __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Notice
Re: [R] testing non-linear component in mgcv:gam
Hi John, Le 05-10-05 à 09:45, John Fox a écrit : Dear Denis, Take a closer look at the anova table: The models provide identical fits to the data. The differences in degrees of freedom and deviance between the two models are essentially zero, 5.5554e-10 and 2.353e-11 respectively. I hope this helps, John This is one of my difficulties. In some examples I found on the web, the difference in deviance is compared directly against the chi- squared distribution. But my y variable has a very small range (between 0 and 0.5, most of the time) so the difference in deviance is always very small and if I compared it against the chi-squared distribution as I have seen done in examples, the non-linear component would always be not significant. Yet it is (with one exception), tested with both mgcv:gam and gam:gam. I think the examples I have read were wrong in this regard, the scale factor seen in mgcv output seems to intervene. But exactly how is still mysterious to me and I hesitate to judge the size of the deviance difference myself. I agree it is near zero in my example. I guess I need to have more experience with these models to better interpret the output... Denis -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Denis Chabot Sent: Wednesday, October 05, 2005 8:22 AM To: r-help@stat.math.ethz.ch Subject: [R] testing non-linear component in mgcv:gam Hi, I need further help with my GAMs. Most models I test are very obviously non-linear. Yet, to be on the safe side, I report the significance of the smooth (default output of mgcv's summary.gam) and confirm it deviates significantly from linearity. I do the latter by fitting a second model where the same predictor is entered without the s(), and then use anova.gam to compare the two. I thought this was the equivalent of the default output of anova.gam using package gam instead of mgcv. I wonder if this procedure is correct because one of my models appears to be linear. In fact mgcv estimates df to be exactly 1.0 so I could have stopped there. However I inadvertently repeated the procedure outlined above. I would have thought in this case the anova.gam comparing the smooth and the linear fit would for sure have been not significant. To my surprise, P was 6.18e-09! Am I doing something wrong when I attempt to confirm the non- parametric part a smoother is significant? Here is my example case where the relationship does appear to be linear: library(mgcv) This is mgcv 1.3-7 Temp - c(-1.38, -1.12, -0.88, -0.62, -0.38, -0.12, 0.12, 0.38, 0.62, 0.88, 1.12, 1.38, 1.62, 1.88, 2.12, 2.38, 2.62, 2.88, 3.12, 3.38, 3.62, 3.88, 4.12, 4.38, 4.62, 4.88, 5.12, 5.38, 5.62, 5.88, 6.12, 6.38, 6.62, 6.88, 7.12, 8.38, 13.62) N.sets - c(2, 6, 3, 9, 26, 15, 34, 21, 30, 18, 28, 27, 27, 29, 31, 22, 26, 24, 23, 15, 25, 24, 27, 19, 26, 24, 22, 13, 10, 2, 5, 3, 1, 1, 1, 1, 1) wm.sed - c(0.0, 0.016129032, 0.0, 0.062046512, 0.396459596, 0.189082949, 0.054757925, 0.142810440, 0.168005168, 0.180804428, 0.111439628, 0.128799505, 0.193707937, 0.105921610, 0.103497845, 0.028591837, 0.217894389, 0.020535469, 0.080389068, 0.105234450, 0.070213450, 0.050771363, 0.042074434, 0.102348837, 0.049748344, 0.019100478, 0.005203125, 0.101711864, 0.0, 0.0, 0.014808824, 0.0, 0.22200, 0.16700, 0.0, 0.0, 0.0) sed.gam - gam(wm.sed~s(Temp),weight=N.sets) summary.gam(sed.gam) Family: gaussian Link function: identity Formula: wm.sed ~ s(Temp) Parametric coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 0.084030.01347 6.241 3.73e-07 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Approximate significance of smooth terms: edf Est.rank F p-value s(Temp) 11 13.95 0.000666 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 R-sq.(adj) = 0.554 Deviance explained = 28.5% GCV score = 0.09904 Scale est. = 0.093686 n = 37 # testing non-linear contribution sed.lin - gam(wm.sed~Temp,weight=N.sets) summary.gam(sed.lin) Family: gaussian Link function: identity Formula: wm.sed ~ Temp Parametric coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 0.162879 0.019847 8.207 1.14e-09 *** Temp-0.023792 0.006369 -3.736 0.000666 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 R-sq.(adj) = 0.554 Deviance explained = 28.5% GCV score = 0.09904 Scale est. = 0.093686 n = 37 anova.gam(sed.lin, sed.gam, test=F) Analysis of Deviance Table Model 1: wm.sed ~ Temp Model 2: wm.sed ~ s(Temp) Resid. Df Resid. Dev Df Deviance F Pr(F) 1 3.5000e+01 3.279 2 3.5000e+01 3.279
Re: [R] testing non-linear component in mgcv:gam
Thank you everyone for your help, but my introduction to GAM is turning my brain to mush. I thought the one part of the output I understood the best was r-sq (adj), but now even this is becoming foggy. In my original message I mentioned a gam fit that turns out to be a linear fit. By curiosity I analysed it with a linear predictor only with mgcv package, and then as a linear model. The output was identical in both, but the r-sq (adj) was 0.55 in mgcv and 0.26 in lm. In lm I hope that my interpretation that 26% of the variance in y is explained by the linear relationship with x is valid. Then what does r2 mean in mgcv? Denis summary.gam(lin) Family: gaussian Link function: identity Formula: wm.sed ~ Temp Parametric coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 0.162879 0.019847 8.207 1.14e-09 *** Temp-0.023792 0.006369 -3.736 0.000666 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.554 Deviance explained = 28.5% GCV score = 0.09904 Scale est. = 0.093686 n = 37 summary(sed.true.lin) Call: lm(formula = wm.sed ~ Temp, weights = N.sets) Residuals: Min 1Q Median 3Q Max -0.6138 -0.1312 -0.0325 0.1089 1.1449 Coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 0.162879 0.019847 8.207 1.14e-09 *** Temp-0.023792 0.006369 -3.736 0.000666 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3061 on 35 degrees of freedom Multiple R-Squared: 0.285,Adjusted R-squared: 0.2646 F-statistic: 13.95 on 1 and 35 DF, p-value: 0.000666 Le 05-10-05 à 09:45, John Fox a écrit : Dear Denis, Take a closer look at the anova table: The models provide identical fits to the data. The differences in degrees of freedom and deviance between the two models are essentially zero, 5.5554e-10 and 2.353e-11 respectively. I hope this helps, John John Fox Department of Sociology McMaster University Hamilton, Ontario Canada L8S 4M4 905-525-9140x23604 http://socserv.mcmaster.ca/jfox -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Denis Chabot Sent: Wednesday, October 05, 2005 8:22 AM To: r-help@stat.math.ethz.ch Subject: [R] testing non-linear component in mgcv:gam Hi, I need further help with my GAMs. Most models I test are very obviously non-linear. Yet, to be on the safe side, I report the significance of the smooth (default output of mgcv's summary.gam) and confirm it deviates significantly from linearity. I do the latter by fitting a second model where the same predictor is entered without the s(), and then use anova.gam to compare the two. I thought this was the equivalent of the default output of anova.gam using package gam instead of mgcv. I wonder if this procedure is correct because one of my models appears to be linear. In fact mgcv estimates df to be exactly 1.0 so I could have stopped there. However I inadvertently repeated the procedure outlined above. I would have thought in this case the anova.gam comparing the smooth and the linear fit would for sure have been not significant. To my surprise, P was 6.18e-09! Am I doing something wrong when I attempt to confirm the non- parametric part a smoother is significant? Here is my example case where the relationship does appear to be linear: library(mgcv) This is mgcv 1.3-7 Temp - c(-1.38, -1.12, -0.88, -0.62, -0.38, -0.12, 0.12, 0.38, 0.62, 0.88, 1.12, 1.38, 1.62, 1.88, 2.12, 2.38, 2.62, 2.88, 3.12, 3.38, 3.62, 3.88, 4.12, 4.38, 4.62, 4.88, 5.12, 5.38, 5.62, 5.88, 6.12, 6.38, 6.62, 6.88, 7.12, 8.38, 13.62) N.sets - c(2, 6, 3, 9, 26, 15, 34, 21, 30, 18, 28, 27, 27, 29, 31, 22, 26, 24, 23, 15, 25, 24, 27, 19, 26, 24, 22, 13, 10, 2, 5, 3, 1, 1, 1, 1, 1) wm.sed - c(0.0, 0.016129032, 0.0, 0.062046512, 0.396459596, 0.189082949, 0.054757925, 0.142810440, 0.168005168, 0.180804428, 0.111439628, 0.128799505, 0.193707937, 0.105921610, 0.103497845, 0.028591837, 0.217894389, 0.020535469, 0.080389068, 0.105234450, 0.070213450, 0.050771363, 0.042074434, 0.102348837, 0.049748344, 0.019100478, 0.005203125, 0.101711864, 0.0, 0.0, 0.014808824, 0.0, 0.22200, 0.16700, 0.0, 0.0, 0.0) sed.gam - gam(wm.sed~s(Temp),weight=N.sets) summary.gam(sed.gam) Family: gaussian Link function: identity Formula: wm.sed ~ s(Temp) Parametric coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 0.084030.01347 6.241 3.73e-07 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Approximate significance of smooth terms: edf Est.rank F p-value
Re: [R] p-level in packages mgcv and gam
Thank you very much Henric, now I see! Denis Le 05-10-02 à 11:47, Henric Nilsson a écrit : This test concerns only the non-linear part of the term s(Number, 3). In order to simultaneously test both the linear and non-linear part, as mgcv::summary.gam does, you'd kyp1.1 - gam(Kyphosis ~ 1, family=binomial, data=kyphosis) anova(kyp1.1, kyp1, test = Chisq) Analysis of Deviance Table Model 1: Kyphosis ~ 1 Model 2: Kyphosis ~ s(Number, 3) Resid. Df Resid. Dev Df Deviance P(|Chi|) 1 80. 83.234 2 76. 71.997 3.0001 11.237 0.011 HTH, Henric __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] p-level in packages mgcv and gam
75.44 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Approximate significance of smooth terms: edf Est.rank F p-value s(x0) 5.1739.000 3.785 0.000137 *** s(x1) 2.3579.000 34.631 2e-16 *** s(x2) 8.5179.000 84.694 2e-16 *** s(x3) 1.0001.000 0.444 0.505797 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 R-sq.(adj) = 0.726 Deviance explained = 73.7% GCV score = 4.611 Scale est. = 4.4029n = 400 If we assume that the scale is known and fixed at 4.4029, we get summary(b, dispersion = 4.4029) Family: gaussian Link function: identity Formula: y ~ s(x0) + s(x1) + s(x2) + s(x3) Parametric coefficients: Estimate Std. Error z value Pr(|z|) (Intercept) 7.9150 0.1049 75.44 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Approximate significance of smooth terms: edf Est.rank Chi.sq p-value s(x0) 5.1739.000 34.067 8.7e-05 *** s(x1) 2.3579.000 311.679 2e-16 *** s(x2) 8.5179.000 762.255 2e-16 *** s(x3) 1.0001.000 0.444 0.505 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 R-sq.(adj) = 0.726 Deviance explained = 73.7% GCV score = 4.611 Scale est. = 4.4029n = 400 Note that t/F changed into z/Chi.sq. so as far as i m concerned, i use GCV methods, and fix a 5% on the null hypothesis (pvalue) to select significant predictor. after, i look at my smooth, and if the parametrization look fine to me, i validate. generaly, for p-values smaller than 0.001, you can be confident. over 0.001, you have to check. 2) for difference between package gam and mgcv, i sent a mail about this The underlying algorithms are very different. HTH, Henric De : Liaw, Andy [EMAIL PROTECTED] Date : 28 septembre 2005 14:01:25 HAE À : 'Denis Chabot' [EMAIL PROTECTED], Peter Dalgaard [EMAIL PROTECTED] Cc : Thomas Lumley [EMAIL PROTECTED], R list r- [EMAIL PROTECTED] Objet : RE: [R] p-level in packages mgcv and gam Just change the df in what Thomas described to the degree of polynomial, and everything he said still applies. Any good book on regression that covers polynomial regression ought to point this out. Andy From: Denis Chabot But what about another analogy, that of polynomials? You may not be sure what degree polynomial to use, and you have not decided before analysing your data. You fit different polynomials to your data, checking if added degrees increase r2 sufficiently by doing F-tests. I thought it was the same thing with GAMs. You can fit a model with 4 df, and in some cases it is of interest to see if this is a better fit than a linear fit. But why can't you also check if 7df is better than 4df? And if you used mgcv first and it tells you that 7df is better than 4df, why bother repeating the comparison 7df against 4df, why not just take the p-value for the model with 7df (fixed)? Denis De : Peter Dalgaard [EMAIL PROTECTED] Date : 28 septembre 2005 12:04:58 HAE À : Thomas Lumley [EMAIL PROTECTED] Cc : Denis Chabot [EMAIL PROTECTED], R list r- [EMAIL PROTECTED] Objet : Rép : [R] p-level in packages mgcv and gam Thomas Lumley [EMAIL PROTECTED] writes: Bob, on the other hand, chooses the amount of smoothing depending on the data. When a 4 df smooth fits best he ends up with the same model as Alice and the same p-value. When some other df fits best he ends up with a different model and a *smaller* p-value than Alice. This doesn't actually follow, unless the p-value (directly or indirectly) found its way into the definition of best fit. It does show the danger, though. -- O__ Peter Dalgaard Øster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907 De : Thomas Lumley [EMAIL PROTECTED] Date : 26 septembre 2005 12:54:43 HAE À : Denis Chabot [EMAIL PROTECTED] Cc : r-help@stat.math.ethz.ch Objet : Rép : [R] p-level in packages mgcv and gam On Mon, 26 Sep 2005, Denis Chabot wrote: But the mgcv manual warns that p-level for the smooth can be underestimated when df are estimated by the model. Most of the time my p-levels are so small that even doubling them would not result in a value close to the P=0.05 threshold, but I have one case with P=0.033. I thought, probably naively, that running a second model with fixed df, using the value of df found in the first model. I could not achieve this with mgcv: its gam function does not seem to accept fractional values of df (in my case 8.377). No, this won't work. The problem is the usual one with model selection: the p-value is calculated as if the df had been fixed, when really it was estimated. It is likely to be quite hard to get an honest
Re: [R] p-level in packages mgcv and gam
I only got one reply to my message: No, this won't work. The problem is the usual one with model selection: the p-value is calculated as if the df had been fixed, when really it was estimated. It is likely to be quite hard to get an honest p-value out of something that does adaptive smoothing. -thomas I do not understand this: it seems that a lot of people chose df=4 for no particular reason, but p-levels are correct. If instead I choose df=8 because a previous model has estimated this to be an optimal df, P-levels are no good because df are estimated? Furthermore, shouldn't packages gam and mgcv give similar results when the same data and df are used? I tried this: library(gam) data(kyphosis) kyp1 - gam(Kyphosis ~ s(Age, 4), family=binomial, data=kyphosis) kyp2 - gam(Kyphosis ~ s(Number, 4), family=binomial, data=kyphosis) kyp3 - gam(Kyphosis ~ s(Start, 4), family=binomial, data=kyphosis) anova.gam(kyp1) anova.gam(kyp2) anova.gam(kyp3) detach(package:gam) library(mgcv) kyp4 - gam(Kyphosis ~ s(Age, k=4, fx=T), family=binomial, data=kyphosis) kyp5 - gam(Kyphosis ~ s(Number, k=4, fx=T), family=binomial, data=kyphosis) kyp6 - gam(Kyphosis ~ s(Start, k=4, fx=T), family=binomial, data=kyphosis) anova.gam(kyp4) anova.gam(kyp5) anova.gam(kyp6) P levels for these models, by pair kyp1 vs kyp4: p= 0.083 and 0.068 respectively (not too bad) kyp2 vs kyp5: p= 0.445 and 0.03 (wow!) kyp3 vs kyp6: p= 0.053 and 0.008 (wow again) Also if you plot all these you find that the mgcv plots are smoother than the gam plots, even the same df are used all the time. I am really confused now! Denis Début du message réexpédié : De : Denis Chabot [EMAIL PROTECTED] Date : 26 septembre 2005 12:25:04 HAE À : r-help@stat.math.ethz.ch Objet : p-level in packages mgcv and gam Hi, I am fairly new to GAM and started using package mgcv. I like the fact that optimal smoothing is automatically used (i.e. df are not determined a priori but calculated by the gam procedure). But the mgcv manual warns that p-level for the smooth can be underestimated when df are estimated by the model. Most of the time my p-levels are so small that even doubling them would not result in a value close to the P=0.05 threshold, but I have one case with P=0.033. I thought, probably naively, that running a second model with fixed df, using the value of df found in the first model. I could not achieve this with mgcv: its gam function does not seem to accept fractional values of df (in my case 8.377). So I used the gam package and fixed df to 8.377. The P-value I obtained was slightly larger than with mgcv (0.03655 instead of 0.03328), but it is still 0.05. Was this a correct way to get around the underestimated P-level? Furthermore, although the gam.check function of the mgcv package suggests to me that the gaussian family (and identity link) are adequate for my data, I must say the instructions in R help for family and in Hastie, T. and Tibshirani, R. (1990) Generalized Additive Models are too technical for me. If someone knows a reference that explains how to choose model and link, i.e. what tests to run on your data before choosing, I would really appreciate it. Thanks in advance, Denis Chabot __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] p-level in packages mgcv and gam
Hi Yves, Le 05-09-28 à 11:05, Yves Magliulo a écrit : hi, i'll try to help you, i send a mail about this subject last week... and i did not have any response... Sorry, I did not see your message last week. I'm using gam from package mgcv. 1) How to interpret the significance of smooth terms is hard for me to understand perfectly : using UBRE, you fix df. p-value are estimated by chi-sq distribution using GCV, the best df are estimated by GAM. (that's what i want) and p-values are estimated by an F distribution But in that case they said use at your own risk in ?summary.gam so you can also look at the chi.sq : but i don't know how to choose a criterion like for p-values... for me, chi.sq show the best predictor in a model, but it's hard to reject one with it. so as far as i m concerned, i use GCV methods, and fix a 5% on the null hypothesis (pvalue) to select significant predictor. after, i look at my smooth, and if the parametrization look fine to me, i validate. generaly, for p-values smaller than 0.001, you can be confident. over 0.001, you have to check. I think I follow you, but how do you validate? My fit goes very nicely in the middle of the data points and appears fine. In most cases p is way smaller than 0.001. I have one case that is bimodal in shape and more noisy, and p is only 0.03. How do I validate it, how do I check? 2) for difference between package gam and mgcv, i sent a mail about this one year ago, here's the response : - package gam is based very closely on the GAM approach presented in Hastie and Tibshirani's Generalized Additive Models book. Estimation is by back-fitting and model selection is based on step-wise regression methods based on approximate distributional results. A particular strength of this approach is that local regression smoothers (`lo()' terms) can be included in GAM models. - gam in package mgcv represents GAMs using penalized regression splines. Estimation is by direct penalized likelihood maximization with integrated smoothness estimation via GCV or related criteria (there is also an alternative `gamm' function based on a mixed model approach). Strengths of the this approach are that s() terms can be functions of more than one variable and that tensor product smooths are available via te() terms - these are useful when different degrees of smoothness are appropriate relative to different arguments of a smooth. (...) Basically, if you want integrated smoothness selection, an underlying parametric representation, or want smooth interactions in your models then mgcv is probably worth a try (but I would say that). If you want to use local regression smoothers and/or prefer the stepwise selection approach then package gam is for you. It is hard to evaluate the explanations based on the algorithm used to fit the data, but it seems to me that the answer, in terms of significance of the smooth, should be at least very similar. Otherwise, what do you do when an author cites one package? You wonder if the fit would have been significant using the other package? i think the difference of p-values between :gam and :mgcv, is because you don't have same number of step iteration. mgcv : gam choose the number of step and with gam : gam you have to choose it.. hope it helps and someone gives us more details... Yves Again, I can see that the p-values could differ a bit considering the differences between the 2 packages. But when the differences are huge and result in contradictory conclusions, I have a problem. Like you I hope more help is forthcoming. Denis __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] p-level in packages mgcv and gam
But what about another analogy, that of polynomials? You may not be sure what degree polynomial to use, and you have not decided before analysing your data. You fit different polynomials to your data, checking if added degrees increase r2 sufficiently by doing F-tests. I thought it was the same thing with GAMs. You can fit a model with 4 df, and in some cases it is of interest to see if this is a better fit than a linear fit. But why can't you also check if 7df is better than 4df? And if you used mgcv first and it tells you that 7df is better than 4df, why bother repeating the comparison 7df against 4df, why not just take the p-value for the model with 7df (fixed)? Denis Maybe one is in Le 05-09-28 à 12:04, Peter Dalgaard a écrit : Thomas Lumley [EMAIL PROTECTED] writes: Bob, on the other hand, chooses the amount of smoothing depending on the data. When a 4 df smooth fits best he ends up with the same model as Alice and the same p-value. When some other df fits best he ends up with a different model and a *smaller* p-value than Alice. This doesn't actually follow, unless the p-value (directly or indirectly) found its way into the definition of best fit. It does show the danger, though. -- O__ Peter Dalgaard Øster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] p-level in packages mgcv and gam
Hi, I am fairly new to GAM and started using package mgcv. I like the fact that optimal smoothing is automatically used (i.e. df are not determined a priori but calculated by the gam procedure). But the mgcv manual warns that p-level for the smooth can be underestimated when df are estimated by the model. Most of the time my p-levels are so small that even doubling them would not result in a value close to the P=0.05 threshold, but I have one case with P=0.033. I thought, probably naively, that running a second model with fixed df, using the value of df found in the first model. I could not achieve this with mgcv: its gam function does not seem to accept fractional values of df (in my case 8.377). So I used the gam package and fixed df to 8.377. The P-value I obtained was slightly larger than with mgcv (0.03655 instead of 0.03328), but it is still 0.05. Was this a correct way to get around the underestimated P-level? Furthermore, although the gam.check function of the mgcv package suggests to me that the gaussian family (and identity link) are adequate for my data, I must say the instructions in R help for family and in Hastie, T. and Tibshirani, R. (1990) Generalized Additive Models are too technical for me. If someone knows a reference that explains how to choose model and link, i.e. what tests to run on your data before choosing, I would really appreciate it. Thanks in advance, Denis Chabot __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] inconsistant decimal marker with write.table
Hi, My problem does not happen all the time, nor with all files I save to csv format. I can send my test file (format rda or csv) to whoever would like to replicate (and hopefully explain) my problem. In short, I have a dataset with mostly numerical variables. One of my variable is called pfi2 and is definitely numerical, as shown by this: summary(test$pfi2) Min. 1st Qu. MedianMean 3rd Qu.Max. 0.0 0.0 0.0 0.01208 0.01500 0.08900 Yet, when I export to a text file and asking for comma as the decimal separator, I get a period as the separator for this variable, whereas all the other variables have the desired comma as separator. write.table(test, out1.csv, sep = ;, dec = ,, row.names = F, col.names = TRUE) write.csv2(test, out2.csv) First 5 lines of out1.csv: trait;tfi;sto_wt;pfi2;pfi_st;focc;sed;mob;date;latitu de;longitud;hre_deb;temp_fon;strate;lat.km;long.km;temp_.2 5C 10;0,764;5,1;0.007;0,213;0,143;0,048;0,095;05/08/99; 47,49;-58,9267;20:11;4,9;302;-167,7912;230,761439786593;4,88 106;0,566;3,3;0.089;4,762;0,25;0;0,25;14/08/99; 49,84;-59,655;23:25;4,4;814;93,34084;168,052052432345;4,38 110;1,517;8,3;0.003;0,172;0,25;0;0;15/08/99; 49,88667;-60,2167;04:57;0,3;833;98,526770403;127,674990442535;0,38 111;1,232;7,5;0.016;1,309;0,25;0;0;15/08/99;49,84;-60,1617;06:14; 0,1;833;93,34084;131,739909589075;0,12 I'm sorry but I wanted to make this lighter by removing more variables from my original dataset, but the problem disappeared, so I had to keep these variables to show you the problem SOMETIMES happen. You'll notice that the 4th variable has periods for decimal markers. The write.csv2 command produced the same problem. With other files I sometimes get commas for all variables, sometimes I get more than one variable with periods. It is frustrating. So let me know if you'd like the data file, Sincerely, Denis Chabot platform powerpc-apple-darwin7.9.0 arch powerpc os darwin7.9.0 system powerpc, darwin7.9.0 status major2 minor1.1 year 2005 month06 day 20 language R __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] grep help needed
Thanks for your help, the proposed solutions were much more elegant than what I was attempting. I adopted a slight modification of Tom Mulholland's solution with a piece from John Fox's solution, but many of you had very similar solutions. require(maptools) nc - read.shape(system.file(shapes/sids.shp, package = maptools) [1]) mappolys - Map2poly(nc, as.character(nc$att.data$FIPSNO)) selected.shapes - which(nc$att.data$SID74 20) # just to make it a smaller example submap - subset(mappolys, nc$att.data$SID74 20) final.data - NULL for (j in 1:length(selected.shapes)){ temp.verts - matrix(as.vector(submap[[j]]),ncol = 2) n - length(temp.verts[,1]) temp.order - 1:n temp.data - cbind(rep(j,n),temp.order,temp.verts) final.data - rbind(final.data,temp.data) } colnames(final.data) - c(PID, POS, X, Y) final.data my.data - as.data.frame(final.data) class(my.data) - c(PolySet, data.frame) attr(my.data, projection) - LL meta - nc[2]$att.data[selected.shapes,] PID - seq(1,length(submap)) meta.data - cbind(PID, meta) class(meta.data) - c(PolyData, data.frame) attr(meta.data, projection) - LL It would be nice if a variant of this was incorporated into PBSmapping to make it easier to import data from shapefiles! Thanks again for your help, Denis Chabot Le 05-07-26 à 00:48, Mulholland, Tom a écrit : -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Denis Chabot Sent: Tuesday, 26 July 2005 10:46 AM To: R list Subject: [R] grep help needed Hi, In another thread (PBSmapping and shapefiles) I asked for an easy way to read shapefiles and transform them in data that PBSmapping could use. One person is exploring some ways of doing this, but it is possible I'll have to do this manually. With package maptools I am able to extract the information I need from a shapefile but it is formatted like this: [[1]] [,1] [,2] [1,] -55.99805 51.68817 [2,] -56.00222 51.68911 [3,] -56.01694 51.68911 [4,] -56.03781 51.68606 [5,] -56.04639 51.68759 [6,] -56.04637 51.69445 [7,] -56.03777 51.70207 [8,] -56.02301 51.70892 [9,] -56.01317 51.71578 [10,] -56.00330 51.73481 [11,] -55.99805 51.73840 attr(,pstart) attr(,pstart)$from [1] 1 attr(,pstart)$to [1] 11 attr(,nParts) [1] 1 attr(,shpID) [1] NA [[2]] [,1] [,2] [1,] -57.76294 50.88770 [2,] -57.76292 50.88693 [3,] -57.76033 50.88163 [4,] -57.75668 50.88091 [5,] -57.75551 50.88169 [6,] -57.75562 50.88550 [7,] -57.75932 50.88775 [8,] -57.76294 50.88770 attr(,pstart) attr(,pstart)$from [1] 1 attr(,pstart)$to [1] 8 attr(,nParts) [1] 1 attr(,shpID) [1] NA I do not quite understand the structure of this data object (list of lists I think) but at this point I resorted to printing it on the console and imported that text into Excel for further cleaning, which is easy enough. I'd like to complete the process within R to save time and to circumvent Excel's limit of around 64000 lines. But I have a hard time figuring out how to clean up this text in R. What I need to produce for PBSmapping is a file where each block of coordinates shares one ID number, called PID, and a variable POS indicates the position of each coordinate within a shape. All other lines must disappear. So the above would become: PID POS X Y 1 1 -55.99805 51.68817 1 2 -56.00222 51.68911 1 3 -56.01694 51.68911 1 4 -56.03781 51.68606 1 5 -56.04639 51.68759 1 6 -56.04637 51.69445 1 7 -56.03777 51.70207 1 8 -56.02301 51.70892 1 9 -56.01317 51.71578 1 10 -56.00330 51.73481 1 11 -55.99805 51.73840 2 1 -57.76294 50.88770 2 2 -57.76292 50.88693 2 3 -57.76033 50.88163 2 4 -57.75668 50.88091 2 5 -57.75551 50.88169 2 6 -57.75562 50.88550 2 7 -57.75932 50.88775 2 8 -57.76294 50.88770 First I imported this text file into R: test - read.csv2(test file.txt,header=F, sep=;, colClasses = character) I used sep=; to insure there would be only one variable in this file, as it contains no ; To remove lines that do not contain coordinates, I used the fact that longitudes are expressed as negative numbers, so with my very limited knowledge of grep searches, I thought of this, which is probably not the best way to go: a - rep(-, length(test$V1)) b - grep(a, test$V1) this gives me a warning (Warning message: the condition has length 1 and only the first element will be used in: if (is.na(pattern)) { but seems to do what I need anyway c - seq(1, length(test$V1)) d - c %in% b e - test$V1[d] Partial victory, now I only have lines that look like [1,] -57.76294 50.88770 But I don't know how to go further: the number in square brackets can be used for variable POS, after removing the square brackets and the comma, but this requires a better knowledge of grep than I have. Furthermore, I don't know how to add a PID (polygon ID) variable, i.e. all lines of a polygon must have the same ID, as in the example above (i.e. each
[R] grep help needed
Hi, In another thread (PBSmapping and shapefiles) I asked for an easy way to read shapefiles and transform them in data that PBSmapping could use. One person is exploring some ways of doing this, but it is possible I'll have to do this manually. With package maptools I am able to extract the information I need from a shapefile but it is formatted like this: [[1]] [,1] [,2] [1,] -55.99805 51.68817 [2,] -56.00222 51.68911 [3,] -56.01694 51.68911 [4,] -56.03781 51.68606 [5,] -56.04639 51.68759 [6,] -56.04637 51.69445 [7,] -56.03777 51.70207 [8,] -56.02301 51.70892 [9,] -56.01317 51.71578 [10,] -56.00330 51.73481 [11,] -55.99805 51.73840 attr(,pstart) attr(,pstart)$from [1] 1 attr(,pstart)$to [1] 11 attr(,nParts) [1] 1 attr(,shpID) [1] NA [[2]] [,1] [,2] [1,] -57.76294 50.88770 [2,] -57.76292 50.88693 [3,] -57.76033 50.88163 [4,] -57.75668 50.88091 [5,] -57.75551 50.88169 [6,] -57.75562 50.88550 [7,] -57.75932 50.88775 [8,] -57.76294 50.88770 attr(,pstart) attr(,pstart)$from [1] 1 attr(,pstart)$to [1] 8 attr(,nParts) [1] 1 attr(,shpID) [1] NA I do not quite understand the structure of this data object (list of lists I think) but at this point I resorted to printing it on the console and imported that text into Excel for further cleaning, which is easy enough. I'd like to complete the process within R to save time and to circumvent Excel's limit of around 64000 lines. But I have a hard time figuring out how to clean up this text in R. What I need to produce for PBSmapping is a file where each block of coordinates shares one ID number, called PID, and a variable POS indicates the position of each coordinate within a shape. All other lines must disappear. So the above would become: PID POS X Y 1 1 -55.99805 51.68817 1 2 -56.00222 51.68911 1 3 -56.01694 51.68911 1 4 -56.03781 51.68606 1 5 -56.04639 51.68759 1 6 -56.04637 51.69445 1 7 -56.03777 51.70207 1 8 -56.02301 51.70892 1 9 -56.01317 51.71578 1 10 -56.00330 51.73481 1 11 -55.99805 51.73840 2 1 -57.76294 50.88770 2 2 -57.76292 50.88693 2 3 -57.76033 50.88163 2 4 -57.75668 50.88091 2 5 -57.75551 50.88169 2 6 -57.75562 50.88550 2 7 -57.75932 50.88775 2 8 -57.76294 50.88770 First I imported this text file into R: test - read.csv2(test file.txt,header=F, sep=;, colClasses = character) I used sep=; to insure there would be only one variable in this file, as it contains no ; To remove lines that do not contain coordinates, I used the fact that longitudes are expressed as negative numbers, so with my very limited knowledge of grep searches, I thought of this, which is probably not the best way to go: a - rep(-, length(test$V1)) b - grep(a, test$V1) this gives me a warning (Warning message: the condition has length 1 and only the first element will be used in: if (is.na(pattern)) { but seems to do what I need anyway c - seq(1, length(test$V1)) d - c %in% b e - test$V1[d] Partial victory, now I only have lines that look like [1,] -57.76294 50.88770 But I don't know how to go further: the number in square brackets can be used for variable POS, after removing the square brackets and the comma, but this requires a better knowledge of grep than I have. Furthermore, I don't know how to add a PID (polygon ID) variable, i.e. all lines of a polygon must have the same ID, as in the example above (i.e. each time POS == 1, a new polygon starts and PID needs to be incremented by 1, and PID is kept constant for lines where POS ! 1). Any help will be much appreciated. Sincerely, Denis Chabot __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] PBSmapping and shapefiles
Hi, I got no reply to this: Le 16-Jul-05 à 2:42 PM, Denis Chabot a écrit : Hi, Is there a way, preferably with R, to read shapefiles and transform them in a format that I could then use with package PBSmapping? I have been able to read such files into R with maptools' read.shape and plot it with plot.Map, but I'd like to bring the data to PBSmapping and plot from there. I also looked at the package shapefile, but it does not seem to do what I want either. Sincerely, Denis Chabot but I managed to progress somewhat on my own. Although it does not allow one to use shapefiles in PBSmapping, maptools at least makes it possible to read such files. In some cases I can extract the information I want from not-too-complex shapefiles. For instance, to extract all the lines corresponding to 60-m isobath in a shapefile, I was able to do: library(maptools) test - read.shape(bathy.shp) test2 - Map2lines(test) bathy60 - subset(test2, test$att.data$Z == 60) I do not quite understand the structure of bathy60 (list of lists I think) but at this point I resorted to printing bathy60 on the console and imported that text into Excel for further cleaning, which is easy enough. I'd like to complete the process within R to save time and to circumvent Excel's limit of around 64000 lines. But I have a hard time figuring out loops in R, coming from a background of observation based programs such as SAS. The output of bathy60 looks like this: [[1]] [,1] [,2] [1,] -55.99805 51.68817 [2,] -56.00222 51.68911 [3,] -56.01694 51.68911 [4,] -56.03781 51.68606 [5,] -56.04639 51.68759 [6,] -56.04637 51.69445 [7,] -56.03777 51.70207 [8,] -56.02301 51.70892 [9,] -56.01317 51.71578 [10,] -56.00330 51.73481 [11,] -55.99805 51.73840 attr(,pstart) attr(,pstart)$from [1] 1 attr(,pstart)$to [1] 11 attr(,nParts) [1] 1 attr(,shpID) [1] NA [[2]] [,1] [,2] [1,] -57.76294 50.88770 [2,] -57.76292 50.88693 [3,] -57.76033 50.88163 [4,] -57.75668 50.88091 [5,] -57.75551 50.88169 [6,] -57.75562 50.88550 [7,] -57.75932 50.88775 [8,] -57.76294 50.88770 attr(,pstart) attr(,pstart)$from [1] 1 attr(,pstart)$to [1] 8 attr(,nParts) [1] 1 attr(,shpID) [1] NA What I need to produce for PBSmapping is a file where each block of coordinates shares one ID number, called PID, and a variable POS indicating the number of each coordinate within a shape. All other lines must disappear. So the above would become: PID X Y 1 1 -55.99805 51.68817 1 2 -56.00222 51.68911 1 3 -56.01694 51.68911 1 4 -56.03781 51.68606 1 5 -56.04639 51.68759 1 6 -56.04637 51.69445 1 7 -56.03777 51.70207 1 8 -56.02301 51.70892 1 9 -56.01317 51.71578 1 10 -56.00330 51.73481 1 11 -55.99805 51.73840 2 1 -57.76294 50.88770 2 2 -57.76292 50.88693 2 3 -57.76033 50.88163 2 4 -57.75668 50.88091 2 5 -57.75551 50.88169 2 6 -57.75562 50.88550 2 7 -57.75932 50.88775 2 8 -57.76294 50.88770 I don't know how to do this in R. My algorithm would involve looking at the structure of a line, discarding it if not including coordinates, and then creating PID and POS for lines with coordinates, depending on the content of lines i and i-1. In R? Thanks in advance, Denis Chabot __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] PBSmapping and shapefiles
Hi, Is there a way, preferably with R, to read shapefiles and transform them in a format that I could then use with package PBSmapping? I have been able to read such files into R with maptools' read.shape and plot it with plot.Map, but I'd like to bring the data to PBSmapping and plot from there. I also looked at the package shapefile, but it does not seem to do what I want either. Sincerely, Denis Chabot __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] how to create a variable to rank within subgroups
Thank you very much Andy, this is exactly what I was looking for. I did not know this function. Sincerely, Denis Chabot Le 06 juin 2005 à 21:21, Liaw, Andy a écrit : Try something like: g - gl(4, 5) x - sample(20) d - data.frame(g, x) d g x 1 1 10 2 1 3 3 1 11 4 1 12 5 1 20 6 2 13 7 2 6 8 2 2 9 2 1 10 2 14 11 3 17 12 3 15 13 3 16 14 3 19 15 3 9 16 4 4 17 4 7 18 4 5 19 4 8 20 4 18 d$xrank - ave(d$x, d$g, FUN=rank) d g x xrank 1 1 10 2 2 1 3 1 3 1 11 3 4 1 12 4 5 1 20 5 6 2 13 4 7 2 6 3 8 2 2 2 9 2 1 1 10 2 14 5 11 3 17 4 12 3 15 2 13 3 16 3 14 3 19 5 15 3 9 1 16 4 4 1 17 4 7 3 18 4 5 2 19 4 8 4 20 4 18 5 Andy From: Denis Chabot Hi, I would like to create a new variable that would be the rank of a continuous variable within each level of a grouping variable. Say the first level of the grouping variable has 5 observations, I'd like them ranked from one to five and a new variable would hold the rank value (one to five). So forth for each level of the grouping variable. I'm quite new with R and cannot figure out this one by myself. Thanks in advance, Denis Chabot __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Notice: This e-mail message, together with any attachments, contains information of Merck Co., Inc. (One Merck Drive, Whitehouse Station, New Jersey, USA 08889), and/or its affiliates (which may be known outside the United States as Merck Frosst, Merck Sharp Dohme or MSD and in Japan, as Banyu) that may be confidential, proprietary copyrighted and/or legally privileged. It is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please notify us immediately by reply e-mail and then delete it from your system. -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] how to create a variable to rank within subgroups
I sure agree the name is not very helpful in guessing what it can do. May I suggest propagate? Denis Chabot Le 07 juin 2005 à 06:18, Peter Dalgaard a écrit : Denis Chabot [EMAIL PROTECTED] writes: Thank you very much Andy, this is exactly what I was looking for. I did not know this function. It's a horrible misnomer though (ave() is originally for replacing values with averages, but obviously has other uses). Any suggestions for a better name (or alias). -- O__ Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] how to create a variable to rank within subgroups
Hi, I would like to create a new variable that would be the rank of a continuous variable within each level of a grouping variable. Say the first level of the grouping variable has 5 observations, I'd like them ranked from one to five and a new variable would hold the rank value (one to five). So forth for each level of the grouping variable. I'm quite new with R and cannot figure out this one by myself. Thanks in advance, Denis Chabot __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Objet: [R] New user of R on Mac OS X - Please help
De: Depiereux Constant [EMAIL PROTECTED] Date: 12 mars 2005 09:44:08 GMT-05:00 À: r-help@stat.math.ethz.ch Objet: [R] New user of R on Mac OS X - Please help Brand new Mac OS X user, I am transfering my R stuffs from my windows machine. When porting some of my functions, I got messages such as : 2005-03-12 15:37:52.456 R[673] *** NSTimer discarding exception 'NSRangeException' (reason '*** NSRunStorage, _NSBlockNumberForIndex(): index (3607) beyond array bounds (2000)') that raised during firing of timer with target 3ba850 and selector 'runRELP:' Does any body of you have any idea where I should be starting looking for a solution for ditto? Thnaks in advance four your help. Constant Depièreux If this is too specific to the Mac platform to receive help here, you might also try posting your message on [EMAIL PROTECTED] Denis __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Rép : [R] Problem installing Hmisc
Hi, I do have it installed on 2 Macs as well (OS X 10.2.8 and 10.3.7) and what I need does work, however if you do the command check routine some problems will likely be revealed. At least there were problems for me. Denis Le 08 févr. 2005, à 12:23, [EMAIL PROTECTED] a écrit : De: Don MacQueen [EMAIL PROTECTED] Date: 07 février 2005 16:05:14 GMT+01:00 À: Thomas Lumley [EMAIL PROTECTED], Michael Kubovy [EMAIL PROTECTED] Cc: r-help@stat.math.ethz.ch Objet: Rép : [R] Problem installing Hmisc At 10:28 AM -0800 2/5/05, Thomas Lumley wrote: On Sat, 5 Feb 2005, Michael Kubovy wrote: What am I doing wrong? You can look at the console log (eg start Console.app in /Applications/Utilities) to see what the problem is, but the fact that Hmisc is not available as a binary package probably means that it does not compile for the Mac (at least without some modification) -thomas I have Hmisc installed on a Mac without any modification, i.e., using install.packages(). R itself was installed from source code. version _ platform powerpc-apple-darwin6.8.5 arch powerpc os darwin6.8.5 system powerpc, darwin6.8.5status major2 minor0.1 year 2004month11 day 15 language R library(Hmisc) Hmisc library by Frank E Harrell Jr Type library(help='Hmisc'), ?Overview, or ?Hmisc.Overview') to see overall documentation. NOTE:Hmisc no longer redefines [.factor to drop unused levels when subsetting. To get the old behavior of Hmisc type dropUnusedLevels(). Attaching package 'Hmisc': The following object(s) are masked from package:stats : ecdf --- snip --- Thomas LumleyAssoc. Professor, Biostatistics [EMAIL PROTECTED]University of Washington, Seattle -Don -- -- Don MacQueen Environmental Protection Department Lawrence Livermore National Laboratory Livermore, CA, USA [[alternative text/enriched version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Rép : [R] 2 small problems: integer division and the nature of NA
Thanks to the many R users who convinced me that the sum of NAs should be zero and gave me a solution if I did not want it to be zero. Thank you also for the explanations of rounding errors with floating point arithmetics. I did not expect it. This small error was a real problem for me as I was trying to find a way to recode numeric values into intervals. Because I wanted to retain numeric values as a result, I tried not to use cut or cut2. Hence to convert a range of temperatures into 0.2 degree intervals I had written: (lets first make a fake temperature variable k for testing) k - seq(-5,5,0.1) k1 - ifelse(k0,-0.2*(abs(k) %/% 0.2) - 0.1, 0.2 *(k %/% 0.2) + 0.1) Note that this works well to quickly recode a numeric variable that only takes integer values. But it produces the problem that prompted my call for help when there are decimals: some values end up in a different class than what you'd expect. Considering your answers, I found 3 solutions: k2 - ifelse(k0,-0.2*(abs(round(10*k)) %/% 2) - 0.1, 0.2 *(round(10*k) %/% 2) + 0.1) k3 - (-0.1+min(k)) + 0.2 * as.numeric(cut(k, seq(min(k),max(k)+0.2,0.2), right=F, labels=F)) k4 - cut2(k, seq(min(k), max(k)+0.2, 0.2), levels.mean=T) k5 - as.numeric(levels(k7))[k7] I could round to 1 decimal to be even more exact but this is good enough. If it can be more elegant, please let me know! Denis Subject: [R] 2 small problems: integer division and the nature of NA Hi, I'm wondering why 48 %/% 2 gives 24 but 4.8 %/% 0.2 gives 23... I'm not trying to round up here, but to find out how many times something fits into something else, and the answer should have been the same for both examples, no? On a different topic, I like the behavior of NAs better in R than in SAS (at least they are not considered the smallest value for a variable), but at the same time I am surprised that the sum of NAs is 0 instead of NA. The sum of a vector having at least one NA but also valid data gives NA if we do not specify na.rm=T. But with na.rm=T, we are telling sum to give the sum of valid data, ignoring NAs that do not tell us anything about the value of a variable. I found out while getting the sum of small subsets of my data (such as when subsetting by several variables), sometimes a cell only contained NAs for my response variable. I would have expected the sum to be NA in such cases, as I do not have a single data point telling me the value of my response here. But R tells me the sum was zero in that cell! Was this behavior considered desirable when sum was built? If not, any hope it will be fixed? Sincerely, Denis Chabot __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] 2 small problems: integer division and the nature of NA
Hi, I'm wondering why 48 %/% 2 gives 24 but 4.8 %/% 0.2 gives 23... I'm not trying to round up here, but to find out how many times something fits into something else, and the answer should have been the same for both examples, no? On a different topic, I like the behavior of NAs better in R than in SAS (at least they are not considered the smallest value for a variable), but at the same time I am surprised that the sum of NAs is 0 instead of NA. The sum of a vector having at least one NA but also valid data gives NA if we do not specify na.rm=T. But with na.rm=T, we are telling sum to give the sum of valid data, ignoring NAs that do not tell us anything about the value of a variable. I found out while getting the sum of small subsets of my data (such as when subsetting by several variables), sometimes a cell only contained NAs for my response variable. I would have expected the sum to be NA in such cases, as I do not have a single data point telling me the value of my response here. But R tells me the sum was zero in that cell! Was this behavior considered desirable when sum was built? If not, any hope it will be fixed? Sincerely, Denis Chabot __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] recoding large number of categories (select in SAS)
Hi, I have data on stomach contents. Possible prey species are in the hundreds, so a list of prey codes has been in used in many labs doing this kind of work. When comes time to do analyses on these data one often wants to regroup prey in broader categories, especially for rare prey. In SAS you can nest a large number of if-else, or do this more cleanly with select like this: select; when (149 = prey =150) preyGr= 150; when (186 = prey = 187) preyGr= 187; when (prey= 438) preyGr= 438; when (prey= 430) preyGr= 430; when (prey= 436) preyGr= 436; when (prey= 431) preyGr= 431; when (prey= 451) preyGr= 451; when (prey= 461) preyGr= 461; when (prey= 478) preyGr= 478; when (prey= 572) preyGr= 572; when (692 = prey = 695 ) preyGr= 692; when (808 = prey = 826, 830 = prey = 832 ) preyGr= 808; when (997 = prey = 998, 792 = prey = 796) preyGr= 792; when (882 = prey = 909) preyGr= 882; when (prey in (999, 125, 994)) preyGr= 9994; otherwise preyGr= 1; end; *select; The number of transformations is usually much larger than this short example. What is the best way of doing this in R? Sincerely, Denis Chabot __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
RE: [R] subsetting like in SAS
I want to thank Petr Pikal, Robert Balshaw and Na Li for suggesting the use of unique or !duplicated on a subset of my data where unwanted variables have been removed. This worked perfectly. Denis Chabot On 13 Jan 2005 at 11:52, Denis Chabot wrote: Hi, Being in the process of translating some of my SAS programs to R, I encountered one difficulty. I have a solution, but it is not elegant (and not pleasant to implement). I have a large dataset with many variables needed to identify the origin of a sample, many to describe sample characteristics, others to describe site characteristics. I want only a (shorter) list of sites and their characteristics. If origin, ship_cat, ship_nb, trip and set are needed to identify a site, in SAS you'd sort on those variables, then read the data with: data sites; set alldata; by origin ship_cat ship_nb trip set; if first.set; keep list-of-variables-detailing-sites; run; In R I did this with the Lag function of Hmisc, and the original data set also needs to be sorted first: oL - Lag(origin) scL - Lag(ship_cat) snL - Lag(ship_nb) tL - Lag(trip) sL - Lag(set) same - origin==oL ship_cat==scL ship_nb==snL trip==tL set==sL sites - subset(alldata, !same, select=c(list-of-variables-detailing-sites) Could I do better than this? __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Re: labels attached to variable names
Thank you to the many helpful list members who answered. I am grateful to have been told about attr and about Hmisc. Actually Hmisc also brings me solutions to many things on my wish list! Cheers, Denis Le Vendredi, 7 janv 2005, à 12:11 Europe/Paris, [EMAIL PROTECTED] a écrit : De: Denis Chabot [EMAIL PROTECTED] Date: Jeud 6 janv 2005 15:40:07 Europe/Paris À: r-help@stat.math.ethz.ch Objet: [R] labels attached to variable names Hi, Can we attach a more descriptive label (I may use the wrong terminology, which would explain why I found nothing on the FAQ) to variable names, and later have an easy way to switch to these labels in plots? I fear this is not possible and one must enter this by hand as ylab and xlab when making plots. Thanks in advance, Denis Chabot __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] how to do by-processing using weighted.mean
Hi, I'd like to compute the weighted mean of some variables, all using the same weight variable, for each combination of 3 factor variables. I found how I could use summarize (from Hmisc) to do normal means for combinations of 3 factors, but I cannot find a way of doing weighted means. Is it possible in R? Thanks in advance, Denis Chabot __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] labels attached to variable names
Hi, Can we attach a more descriptive label (I may use the wrong terminology, which would explain why I found nothing on the FAQ) to variable names, and later have an easy way to switch to these labels in plots? I fear this is not possible and one must enter this by hand as ylab and xlab when making plots. Thanks in advance, Denis Chabot --- Denis Chabot, Ph.D. Chercheur en bioénergétique Bioenergetics researcher Institut Maurice-Lamontagne Institut Maurice-Lamontagne Pêches et Océans CanadaFisheries Oceans Canada CP 1000, Mont-Joli, QC PB 1000, Mont-Joli, QC G5H 3Z4G5H 3Z4 Canada Canada (418) 775-0624 (418) 775-0624 http://www.qc.dfo-mpo.gc.ca/iml __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] position of labels in vis.gam or persp
Hi, Is there a way to control the position of labels in 3D plots such as that produced by vis.gam (in mgvc), or persp? Some of my plots have axis labels (titles) that collide with the tick labels. I have increased the margin around the plot so that there is plenty of space to move the axis labels further away from their axis, but I found no option for this in the documentation of either vis.gam or persp. The later states that some par options can work, but the one that seemed promising (mgp) does not have any effect in vis.gam. Thanks for any help, Denis --- Denis Chabot, Ph.D. Chercheur en bioénergétique Bioenergetics researcher Institut Maurice-Lamontagne Institut Maurice-Lamontagne Pêches et Océans CanadaFisheries Oceans Canada CP 1000, Mont-Joli, QC PB 1000, Mont-Joli, QC G5H 3Z4G5H 3Z4 Canada Canada (418) 775-0624 (418) 775-0624 http://www.qc.dfo-mpo.gc.ca/iml __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html