[R] combining dataframes with different numbers of columns

2006-11-07 Thread Denis Chabot
Dear list members,

I have to combine dataframes together. However they contain different  
numbers of variables. It is possible that all the variables in the  
dataframe with fewer variables are contained in the dataframe with  
more variables, though it is not always the case.

There are key variables identifying observations. These could be used  
in a merge statement, although this won't quite work for me (see below).

I was hoping to find a way to combine dataframes where I needed only  
to ensure the key variables were present. The total number of  
variables in the final dataframe would be the total number of  
different variables in both initial dataframes. Variables that were  
absent in one dataframe would automatically get missing values in the  
joint dataframe.

Here is a simple example. The initial dataframes are a and b. All  
variables in b are also in a.

a - data.frame(X=seq(1,10), matrix(runif(100, 0,15), ncol=10))
b - data.frame(X=seq(16,20), X4=runif(5,0,15))

A merge does not work because the common variable X4 becomes 2  
variables, X4.x and X4.y.

c - merge(a,b,by=X, all=T)

This can be fixed but it requires several steps (although my solution  
is probably not optimal):

names(c)[5] - X4
c$X4[is.na(c$X4)] - c$X4.y[is.na(c$X4)]
c - c[,1:11]

One quickly becomes tired with this solution with my real-life  
dataframes where different columns would require repair from one  
case to the next.

I think I still prefer making the narrower dataframe like the wider one:

b2 - upData(b, X1=NA, X2=NA, X3=NA, X5=NA, X6=NA, X7=NA, X8=NA,  
X9=NA, X10=NA)
b2 - b2[,c(1, 3:5, 2, 6:11)]

d - rbind(a, b2)

But again this requires quite a bit of fine-tuning from one case to  
the next in my real-life dataframes.

I suspect R has a neat way to do this and I just cannot come up with  
the proper search term to find help on my own.

Or this can be automated: can one compare variable lists from 2  
dataframes and add missing variables in the narrower dataframe?

Ideally, the solution would be able to handle the situation where the  
narrower dataframe contains one or more variables that are absent  
from the wider one. If this was the case, I'd like the new variable  
to be present in the combined dataframe, with missing values given to  
the observations from the wider dataframe.

Thanks in advance,

Denis Chabot

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] combining dataframes with different numbers of columns

2006-11-07 Thread Denis Chabot
Thanks you very much Hadley, Stephen, and Sundar, your suggestions  
all solve my problem, even if the narrower dataframe contains  
variables that are new to the wider dataframe.

I'm sure glad I took the time to write instead of pursuing with my  
ugly and time-consuming solution.

Denis

 Dear list members,

 I have to combine dataframes together. However they contain  
 different numbers of variables. It is possible that all the  
 variables in the dataframe with fewer variables are contained in  
 the dataframe with more variables, though it is not always the case.

 There are key variables identifying observations. These could be  
 used in a merge statement, although this won't quite work for me  
 (see below).

 I was hoping to find a way to combine dataframes where I needed  
 only to ensure the key variables were present. The total number of  
 variables in the final dataframe would be the total number of  
 different variables in both initial dataframes. Variables that were  
 absent in one dataframe would automatically get missing values in  
 the joint dataframe.

 Here is a simple example. The initial dataframes are a and b. All  
 variables in b are also in a.

 a - data.frame(X=seq(1,10), matrix(runif(100, 0,15), ncol=10))
 b - data.frame(X=seq(16,20), X4=runif(5,0,15))

 A merge does not work because the common variable X4 becomes 2  
 variables, X4.x and X4.y.

 c - merge(a,b,by=X, all=T)

 This can be fixed but it requires several steps (although my  
 solution is probably not optimal):

 names(c)[5] - X4
 c$X4[is.na(c$X4)] - c$X4.y[is.na(c$X4)]
 c - c[,1:11]

 One quickly becomes tired with this solution with my real-life  
 dataframes where different columns would require repair from one  
 case to the next.

 I think I still prefer making the narrower dataframe like the wider  
 one:

 b2 - upData(b, X1=NA, X2=NA, X3=NA, X5=NA, X6=NA, X7=NA, X8=NA,  
 X9=NA, X10=NA)
 b2 - b2[,c(1, 3:5, 2, 6:11)]

 d - rbind(a, b2)

 But again this requires quite a bit of fine-tuning from one case to  
 the next in my real-life dataframes.

 I suspect R has a neat way to do this and I just cannot come up  
 with the proper search term to find help on my own.

 Or this can be automated: can one compare variable lists from 2  
 dataframes and add missing variables in the narrower dataframe?

 Ideally, the solution would be able to handle the situation where  
 the narrower dataframe contains one or more variables that are  
 absent from the wider one. If this was the case, I'd like the new  
 variable to be present in the combined dataframe, with missing  
 values given to the observations from the wider dataframe.

 Thanks in advance,

 Denis Chabot



__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] distance between legend title and legend box

2006-10-26 Thread Denis Chabot
Hi,

I've looked at the parameters available for the legend function and  
cannot find a way to change the distance between the top of the box  
surrounding a legend and the legend's title. I have a math expression  
that raises the height of my title.

If you don't mind the non-sensical title I give to the legend for  
this plot (Figure 3.20 in R Graphics):

with(iris,
   plot(Sepal.Length, Sepal.Width,
pch=as.numeric(Species), cex=1.2))
legend(6.5, 4.2, c(setosa, versicolor, virginica),
   cex=1, pch=1:3, title=expression(kg/km^2))

The result depends on the device, but I think any device will show  
the box needs to be raised a bit (in quartz, the top of the box  
passes in the middle of the 2, in pdf it is acceptable, but just  
(the top of the box lightly touches the top of the 2).

Sincerely,

Denis Chabot

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] out.format for chron

2006-10-19 Thread Denis Chabot
Dear R users,

Do you know of a way to precise an out.format for a chron object that  
would use numbers for months and yet 4 digits for the year?

I have tried out.format=c(d-m-year) (note the m instead of either  
mon or month) but got 27-Feb-1992.

Also, the help for chron tells us how to define an out.format when we  
create a chron object, but how can you change the out.format of an  
existing chron object?

Thanks in advance,

Denis

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] out.format for chron

2006-10-19 Thread Denis Chabot
Got it Gabor,

Thank you very much.

Denis

Le 06-10-19 à 10:38, Gabor Grothendieck a écrit :

 For your second question:

 x - chron(1)
 x - chron(x, out.format = ddmm)

 using the ddmm from below.

 On 10/19/06, Gabor Grothendieck [EMAIL PROTECTED] wrote:
 As discussed the Help Desk article in R News 4/1, the 2 vs 4 year
 length is controlled by the chron.year.abb option,  e.g.

   options(chron.year.abb = FALSE)
   chron(20)

 however, as also discussed there its not really recommended that
 you use this option so try this instead:

   ddmm - function( x )
 with( month.day.year( x ), sprintf( %02.f-%02.f-%04.f, day,
 month, year) )
   chron( 20, out.format = ddmm )




 On 10/19/06, Denis Chabot [EMAIL PROTECTED] wrote:
  Dear R users,
 
  Do you know of a way to precise an out.format for a chron object  
 that
  would use numbers for months and yet 4 digits for the year?
 
  I have tried out.format=c(d-m-year) (note the m instead of either
  mon or month) but got 27-Feb-1992.
 
  Also, the help for chron tells us how to define an out.format  
 when we
  create a chron object, but how can you change the out.format of an
  existing chron object?
 
  Thanks in advance,
 
  Denis
 
  __
  R-help@stat.math.ethz.ch mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide http://www.R-project.org/ 
 posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] functionality of update in SAS

2006-09-20 Thread Denis Chabot
Dear list,

I've tried to search the archives but found nothing, although I may  
use the wrong wording in my searches. I've also double-checked the  
upData function in Hmisc, but it does something else.

I'm wondering if one can update a dataframe by forcing into it a  
shorter dataframe containing the corrections, like the update  
provided in SAS data steps.

In this simple example:
a - data.frame(id=c(1:5),x=rnorm(5))
b - data.frame(id=4,x=rnorm(1))
  a
   id  x
1  1  0.6557921
2  2  0.1897523
3  3  0.7976721
4  4  0.2107103
5  5 -0.8855786
  b
   id x
1  4 0.8369147

I would like the updated dataframe to look like (row names are not  
important to me)

id  x
1   1  0.6557921
2   2  0.1897523
3   3  0.7976721
4   4  0.8369147
5   5 -0.8855786

I thought this could be done with merge, but this never removes the  
old version of a row, it just gives me two rows with id==4.

I thought of this solution:

reject - a$id %in% b$id
a2 - a[!reject,]
a3 - rbind(a2,b)
  a3
id  x
1   1  0.6557921
2   2  0.1897523
3   3  0.7976721
5   5 -0.8855786
11  4  0.8369147

This works, and obviously it is not the best way to make the  
correction in a simple case like this. But providing a few lines of  
corrected data can be an effective method with large dataframes,  
especially if many identifier (grouping) variables are needed to  
identify each line that needs updating, and in this context my  
solution above rapidly becomes ugly.

Furthermore (but I can live with this constraint) this method removes  
entire rows, so I need to make sure the dataframe used to make  
corrections contains all the Y variables in the original dataframe,  
even those that do not need correcting.

If a method exists to just change one variable in 5 lines for a  
dataframe of 5000 lines and 30 variables, I'd appreciate learning  
about it. But I'll already be thrilled if I can update whole lines at  
a time.

Sincerely,

Denis Chabot

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] merge gives me too many rows

2006-09-18 Thread Denis Chabot
Hi Phil and Don,
Le 06-09-18 à 05:02, Phil Spector a écrit :

 Denis -
As long as there is one data frame that has exactly 0 or 1  
 observations
 for each level of the by variables, merge in R will operate  
 identically
 to SAS without the need for any special options.
The problem arises when there are multiple observations for some of
 the levels of the by variables in both of the data frames.  In
 relational databases and R, new observations are created to  
 represent every combination of observations in the original data  
 set that share
 the same level of the by variables.  In sas, the last observation  
 in the data set with fewer observations having a particular level  
 of the by variable is repeated enough times to provide a match for  
 each
 observation in the other data set.
 You can see this with a tiny example -- look at the observations
 in the output data set corresponding to id==3:

 a = data.frame(id=c(1,2,3,3,4),x=rnorm(5))
 b = data.frame(id=c(1,2,3,3,3,4),y=rnorm(6))
 c = merge(a,b)
 dim(a)
 [1] 5 2
 dim(b)
 [1] 6 2
 dim(c)
 [1] 9 3
 c
   id   xy
 1  1 -2.15987259 -1.555878070
 2  2  0.96856842 -0.666941824
 3  3  1.68080409 -0.009239307
 4  3  0.06044914 -0.009239307
 5  3  1.68080409 -0.242538473
 6  3  0.06044914 -0.242538473
 7  3  1.68080409 -1.298344557
 8  3  0.06044914 -1.298344557
 9  4  1.76424928 -0.420144744

 The important thing to remember is that there is no unambiguous way  
 to merge two data sets when there are differing numbers of  
 observations for a given level of
 the by variable in the two data sets -- sas and R simply choose  
 different strategies
 in this case.

- Phil Spector
Statistical Computing Facility
Department of Statistics
UC Berkeley
[EMAIL PROTECTED]



I knew this, but it is good to be remembered and be forced to verify  
one's assumptions.

Originally both informations about stomach contents and fish size  
came from the same, larger, database.
I made the 2 smaller databases to ease the memory footprint in R when  
not all variables were needed. The method I used, actually suggested  
to me on this very helpful list a long time back, insured that all  
rows were unique:

fish - subset(esto, select=c(predateu, origin, navire, nbpc, no_rel,  
trait, tagno, longueur, masse))
fish - unique(fish)   # NO NEED TO SORT FIRST

Or at least it would if there were NO MISTAKES in my database to  
start with. It is this false feeling that all rows had to be unique  
that made me pull my hair last night and assume the problem was with  
merge.

I just verified and found that some fish have 2 lengths or masses,  
either mistakes when entering them into the databases, or mistakes  
when numbering fishes (2 fish with the same number). I'll go back to  
the raw data to fix this! So thanks for making me verify this  
assumption.

Le 06-09-18 à 01:38, Don MacQueen a écrit :
 I think you may misunderstand the meaning of all.x = FALSE.

 Setting all.x to false ensures that only rows of x that have  
 matches in y will be included. Equivalently, if a row of x is not  
 matched in y, it will not be in the output. However, if a row in x  
 is matched by more than one row in y, then that row will be  
 repeated as many times as there are matching rows in y. That is,  
 you have a 1 to many match (1 in x to many in y). SAS behaves the  
 same way.

 Are you sure this is not what is happening?

 Also, all.x = FALSE is the default; it is not necessary to specify  
 it. In fact, the default is to output only rows that are found in  
 both x and y (matching on the specified variables, of course).

 -Don

Thank you Don,

Looking at my old programs using merge, I always used all.x=T.  
Yesterday, because I was getting more rows than expected, I assumed  
that switch caused it and started setting it to F (i.e. early on when  
I got extra lines, I assumed that fish measured in the second  
dataframe but for which I did not have stomach contents were merged  
anyway). I never went back to all.X=T after verifying it was not the  
cause of the problem, instead some lines in the first database were  
duplicated (multiple matches, it turns out). But you are right. In  
fact I made some tests and the behavior I want in this and most cases  
is all.x=T but all.y=F. That is if I ever found a case where I  
had a line in the first dataframe with no match in the second, I'd  
want to keep that line in the final dataframe.

Again, many thanks,

Denis

 At 9:11 PM -0400 9/17/06, Denis Chabot wrote:

 Hi,

 I am using merge to add some variables to an existing dataframe. I  
 use the option all.x=F so that my final dataframe will only have  
 as many rows as the first file I name in the call to merge.

 With a large dataframe using a lot of by variables

[R] merge gives me too many rows

2006-09-17 Thread Denis Chabot
Hi,

I am using merge to add some variables to an existing dataframe. I  
use the option all.x=F so that my final dataframe will only have as  
many rows as the first file I name in the call to merge.

With a large dataframe using a lot of by variables, the number of  
rows of the merged dataframe increases from 177325 to 179690:

 dim(test)
[1] 177325  9
  test2 - merge(test, fish, by=c(predateu, origin, navire,  
nbpc, no_rel, trait, tagno), all.x=F)
  dim(test2)
[1] 179690 11

I tried to make a smaller dataset with R commands that I could post  
here so that other people could reproduce, but merge behaved as  
expected: final number of rows was the same as the number of rows in  
the first file named in the call to merge.

I took a subset of my large dataframe and could mail this to anyone  
interested in verifying the problem.

  test3 - test[11:16,]
 
  dim(test3)
[1] 6 9
  test4 - merge(test3, fish, by=c(predateu, origin, navire,  
nbpc, no_rel, trait, tagno), all.x=F)
 
  dim(test4)
[1] 6004311

I compared test3 and test4 line by line. The first 11419 lines were  
the same (except for added variables, obviously) in both dataframes,  
but then lines 11420 to 11423 were repeated in test4. Then no problem  
for a lot of rows, until rows 45756-45760 in test3. These are offset  
by 4 in test4 because of the first group of extraneous lines just  
reported, and are found on lines 45760 to 45765. But they are also  
repeated on lines 45765 to 45769. And so on a few more times.

Thus merge added lines (repeated a small number of lines) to the  
final dataframe despite my use of all.x=F.

Am I doing something wrong? If not, is there a solution? Not being  
able to merge is a setback! I was attempting to move the last few  
things I was doing with SAS to R...

Please let me know if you want the file test3 (2.3 MB as a csv file,  
but only 352 KB in R (.rda) format).

Sincerely,

Denis Chabot

  R.Version()
$platform
[1] powerpc-apple-darwin8.6.0

$arch
[1] powerpc

$os
[1] darwin8.6.0

$system
[1] powerpc, darwin8.6.0

$status
[1] 

$major
[1] 2

$minor
[1] 3.1

$year
[1] 2006

$month
[1] 06

$day
[1] 01

$`svn rev`
[1] 38247

$language
[1] R

$version.string
[1] Version 2.3.1 (2006-06-01)

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] reshaping a dataset

2006-09-13 Thread Denis Chabot
Thank you Gabor,

I'll need to explore a bit the reshape package to see what benefits I  
get compared with the basic reshape function, but I'm glad you made  
me aware of it.

And your solution for fixing NAs just for the columns I want is just  
what I wanted.

Many thanks,

Denis
Le 06-09-13 à 00:55, Gabor Grothendieck a écrit :

 I missed your second question which was how to set the NAs to zero
 for some of the columns.  Suppose we want to replace the NAs
 in columns ic and for sake of example suppose ic specifies
 columns 1 to 8:

 library(reshape)
 testm - melt(test, id = 1:6)
 out - cast(testm, nbpc + trip + set + tagno + depth ~ prey, sum)

 # fix up NAs
 ic - 1:8
 out2 - out[,ic]
 out2[is.na(out2)] - 0
 out[,ic] - out2

 On 9/13/06, Gabor Grothendieck [EMAIL PROTECTED] wrote:
 If I understand this correctly we want to sum the mass over each  
 combination
 of the first 6 variables and display the result with the 6th, prey,
 along the top and the others along the side.

 library(reshape)
 testm - melt(test, id = 1:6)
 cast(testm, nbpc + trip + set + tagno + depth ~ prey)

 Now fix up the NAs.

 On 9/12/06, Denis Chabot [EMAIL PROTECTED] wrote:
  Hi,
 
  I'm trying to move to R the last few data handling routines I was
  performing in SAS.
 
  I'm working on stomach content data. In the simplified example I
  provide below, there are variables describing the origin of each  
 prey
  item (nbpc is a ship number, each ship may have been used on
  different trips, each trip has stations, and individual fish  
 (tagno)
  can be caught at each station.
 
  For each stomach the number of lines corresponds to the number of
  prey items. Thus a variable identifies prey type, and others (here
  only one, mass) provide information on prey abundance or size or
  digestion level.
 
  Finally, there can be accompanying variables that are not used but
  that I need to keep for later analyses (e.g. depth in the example
  below).
 
  At some point I need to transform such a dataset into another  
 format
  where each stomach occupies a single line, and there are columns  
 for
  each prey item.
 
  The reshape function works really well, my program is in fact
  simpler than the SAS equivalent (not shown, don't want to bore you,
  but available on request), except that I need zeros when prey types
  are absent from a stomach instead of NAs, a problem for which I  
 only
  have a shaky solution at the moment:
 
  1) creation of a dummy dataset:
  ###
  nbpc - rep(c(20,34), c(110,90))
  trip - c(rep(1:3, c(40, 40, 30)), rep(1:2, c(60,30)))
  set - c(rep(1:4, c(10, 8, 7, 15)), rep(c(10,12), c(25,15)), rep 
 (1:3,
  rep(10,3)),
   rep(10:12, c(20, 10, 30)), rep(7:8, rep(15,2)))
  depth - c(rep(c(100, 150, 200, 250), c(10, 8, 7, 15)), rep(c
  (100,120), c(25,15)), rep(c(75, 50, 200), rep(10,3)),
   rep(c(200, 150, 50), c(20, 10, 30)), rep(c(100, 250), rep
  (15,2)))
  tagno - rep(round(runif(42,1,200)),
   c(7,3, 4,4, 2,2,3, 5,5,5,  4,6,4,3,5,3, 7,8, 4,6, 5,5,
  7,3,
 6,6,4,4, 4,6, 3,3,4,5,5,6,4, 5,5,5, 8,7))
  prey.codes -c(187, 438, 792, 811)
  prey - sample(prey.codes, 200, replace=T)
  mass - runif(200, 0, 10)
 
  test - data.frame(nbpc, trip, set, depth, tagno, prey, mass)
  
 
  Because there are often multiple occurrences of the same prey in a
  single stomach, I need to sum them for each stomach before using
  reshape. Here I use summarizeBy because my understanding of the
  many variants of apply is not very good:
 
  
  test2 - summaryBy(mass~nbpc+trip+set+tagno+prey, data=test,  
 FUN=sum,
  keep.names=T, id=~depth)
 
  #this messes up sorting order, I fix it
  k - order(test2$nbpc, test2$trip, test2$set, test2$tagno)
  test3 - test2[k,]
  result - reshape(test3, v.names=mass, idvar=c(nbpc, trip,
  set, tagno),
  timevar=prey, direction=wide)
  #
 
  I'm quite happy with this, although you may know of better ways of
  doing it.
  But my problem is with preys that are absent from a stomach. In  
 later
  analyses, I need them to have zero abundance instead of NA.
  My shaky solution is:
  #
  empties - is.na(result)
  result[empties] - 0
  #
 
  which did the job in this example, but it won't always. For  
 instance
  there could have been NAs for depth, which I do not want to  
 become
  zero.
 
  Is there a way to transform NAs into zeros for multiple columns  
 of a
  dataframe in one step, while ignoring some columns?
 
  Or maybe there is another way to achieve this that would have put
  zeros where I need them (i.e. something else than reshape)?
 
  Thanking you in advance,
 
  Denis Chabot
 
  __
  R-help@stat.math.ethz.ch mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide http://www.R-project.org/ 
 posting-guide.html
  and provide commented, minimal, self-contained, reproducible code

[R] reshaping a dataset

2006-09-12 Thread Denis Chabot
Hi,

I'm trying to move to R the last few data handling routines I was  
performing in SAS.

I'm working on stomach content data. In the simplified example I  
provide below, there are variables describing the origin of each prey  
item (nbpc is a ship number, each ship may have been used on  
different trips, each trip has stations, and individual fish (tagno)  
can be caught at each station.

For each stomach the number of lines corresponds to the number of  
prey items. Thus a variable identifies prey type, and others (here  
only one, mass) provide information on prey abundance or size or  
digestion level.

Finally, there can be accompanying variables that are not used but  
that I need to keep for later analyses (e.g. depth in the example  
below).

At some point I need to transform such a dataset into another format  
where each stomach occupies a single line, and there are columns for  
each prey item.

The reshape function works really well, my program is in fact  
simpler than the SAS equivalent (not shown, don't want to bore you,  
but available on request), except that I need zeros when prey types  
are absent from a stomach instead of NAs, a problem for which I only  
have a shaky solution at the moment:

1) creation of a dummy dataset:
###
nbpc - rep(c(20,34), c(110,90))
trip - c(rep(1:3, c(40, 40, 30)), rep(1:2, c(60,30)))
set - c(rep(1:4, c(10, 8, 7, 15)), rep(c(10,12), c(25,15)), rep(1:3,  
rep(10,3)),
  rep(10:12, c(20, 10, 30)), rep(7:8, rep(15,2)))
depth - c(rep(c(100, 150, 200, 250), c(10, 8, 7, 15)), rep(c 
(100,120), c(25,15)), rep(c(75, 50, 200), rep(10,3)),
  rep(c(200, 150, 50), c(20, 10, 30)), rep(c(100, 250), rep 
(15,2)))
tagno - rep(round(runif(42,1,200)),
  c(7,3, 4,4, 2,2,3, 5,5,5,  4,6,4,3,5,3, 7,8, 4,6, 5,5,  
7,3,
6,6,4,4, 4,6, 3,3,4,5,5,6,4, 5,5,5, 8,7))
prey.codes -c(187, 438, 792, 811)
prey - sample(prey.codes, 200, replace=T)
mass - runif(200, 0, 10)

test - data.frame(nbpc, trip, set, depth, tagno, prey, mass)


Because there are often multiple occurrences of the same prey in a  
single stomach, I need to sum them for each stomach before using  
reshape. Here I use summarizeBy because my understanding of the  
many variants of apply is not very good:


test2 - summaryBy(mass~nbpc+trip+set+tagno+prey, data=test, FUN=sum,  
keep.names=T, id=~depth)

#this messes up sorting order, I fix it
k - order(test2$nbpc, test2$trip, test2$set, test2$tagno)
test3 - test2[k,]
result - reshape(test3, v.names=mass, idvar=c(nbpc, trip,  
set, tagno),
 timevar=prey, direction=wide)
#

I'm quite happy with this, although you may know of better ways of  
doing it.
But my problem is with preys that are absent from a stomach. In later  
analyses, I need them to have zero abundance instead of NA.
My shaky solution is:
#
empties - is.na(result)
result[empties] - 0
#

which did the job in this example, but it won't always. For instance  
there could have been NAs for depth, which I do not want to become  
zero.

Is there a way to transform NAs into zeros for multiple columns of a  
dataframe in one step, while ignoring some columns?

Or maybe there is another way to achieve this that would have put  
zeros where I need them (i.e. something else than reshape)?

Thanking you in advance,

Denis Chabot

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] transparent background for PDF

2006-03-25 Thread Denis Chabot
As far as I know, PowerPoint does not import pdfs. Without telling  
you it transforms it in a png. And it is not very smart about making  
a high res png at that. Maybe it is also not very clever about taking  
the transparent background into account?

When I must use PowerPoint, I either transform the pdf in png myself,  
or prepare the presentation in Keynote and export to PowerPoint  
format. This also translates the pdf figures into png figures, but at  
much better resolution than what PowerPoint does.

I tried making both pdf and png from my R program. I had hoped I  
would use the same settings for both devices (just copy and paste the  
plot instructions into a call for each device), unfortunately it did  
not work because to do better than PowerPoint's translation, I had to  
save the png version of my plots at high resolution, and that meant  
that all of the plot adjustments I made so that the pdf version  
looked nice (position of text, size of margins, etc) had to be fine- 
tuned differently for the png version, which was too time consuming...

Denis
Le 06-03-25 à 06:00, [EMAIL PROTECTED] a écrit :

 De : Dennis Fisher [EMAIL PROTECTED]
 Date : 24 mars 2006 13:30:24 HNE
 À : r-help@stat.math.ethz.ch
 Objet : [R] transparent background for PDF


 Colleagues

 Running R2.2.1 on either a Linux (RedHat 9) or Mac (10.4) platform.

 I created a PDF document using pdf(FILENAME.pdf, bg=transparent,
 version=1.4).  I then imported the graphic into PowerPoint -
 background was set to a non-transparent color.  I was hoping that the
 inserted graphic would be transparent - instead, it had a white
 background.

 The help pages indicate:

   The 'version' argument modifies the sort of PDF code that gets
   produced.  At the moment this only concerns the production of
   transparent output.  The version must be greater than 1.4 for
   transparent output to be produced.  Specifying a lower version
   number may be useful if you want to produce PDF output that  
 can be
   viewed on older PDF viewers.

 The archives reveal the following:
 From: Benno Ptz puetz
 Date: Thu Nov 4 14:20:28 2004
 Hello,

 I have run across the following problem:

 Creating PDF files manually by using

 pdf(version=1.4)

 I can make graphs using the new transparency feature of R2.0.
 Note that I used version 1.4.  Attempts to use larger numbers (1.5,
 2.0, etc.) resulted in error messages:

 Error in pdf(FILENAME.pdf, bg = transparent,  :
  invalid PDF version

 I also attempted to address this with par(bg=transparent) without
 success.

 Does anybody have any ideas?

 Dennis


 Dennis Fisher MD
 P  (The P Less Than Company)
 Phone: 1-866-PLessThan (1-866-753-7784)
 Fax: 1-415-564-2220
 www.PLessThan.com

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] legend in bubble plots made with symbols()

2006-03-18 Thread Denis Chabot
Hi,

I have read about the use of symbols() to draw circles of different  
sizes, but I have not been able to find out how to add a legend to  
such a graph, legend that would display some specific sizes and their  
meaning.

Before finding the symbols function in Paul Murrell's book, I had  
rolled by own function where the variable I want to use to control  
circle size was actually used to control cex. I was able to draw a  
legend afterward. Symbols seems a bit simpler and I wanted to see if  
it would be better than my own function. But without legend it is  
less useful. However I'm sure there is a way which I'm not aware of  
to draw a legend for a plot drawn with symbols()...

Thanks in advance,

Denis Chabot

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] predicted values in mgcv gam

2006-03-05 Thread Denis Chabot
Hi,

In fitting GAMs to assess environmental preferences, I use the part  
of the fit where the lower confidence interval is above zero as my  
criterion for positive association between the environmental variable  
and species abundance. However I like to plot this on the original  
scale of species abundance. To do so I extract the fit and SE using  
predict.gam.

Lately I compared more carefully the plots I obtain in this way and  
those obtained with plot.gam and noticed differences which I do not  
understand.

To avoid sending a large dataset I took an example from gam Help to  
illustrate this.

Was I wrong to believe that the fit and its confidence band should  
behave the same way on both scales?

Thanks in advance,

Denis Chabot
###
library(mgcv)
set.seed(0)
n-400
sig-2
x0 - runif(n, 0, 1)
x1 - runif(n, 0, 1)
x2 - runif(n, 0, 1)
x3 - runif(n, 0, 1)
f0 - function(x) 2 * sin(pi * x)
f1 - function(x) exp(2 * x)
f2 - function(x) 0.2*x^11*(10*(1-x))^6+10*(10*x)^3*(1-x)^10
f - f0(x0) + f1(x1) + f2(x2)
g-exp(f/4)
y-rpois(rep(1,n),g)
mean.y - mean(y)

gam2 - gam(y~ s(x2), poisson)

# to plot on the response scale
val.for.pred - data.frame(x2=seq(min(x2), max(x2), length.out=100))
pred.2.resp - predict.gam(gam2, val.for.pred ,type=response,  
se.fit=TRUE)
lower.band - pred.2.resp$fit - 2*pred.2.resp$se.fit
upper.band - pred.2.resp$fit + 2*pred.2.resp$se.fit
pred.2.resp - data.frame(val.for.pred, pred.2.resp, lower.band,  
upper.band)

# same thing on term scale
pred.2.term - predict.gam(gam2, val.for.pred ,type=terms,  
se.fit=TRUE)
lower.band - pred.2.term$fit - 2*pred.2.term$se.fit
upper.band - pred.2.term$fit + 2*pred.2.term$se.fit
pred.2.term - data.frame(val.for.pred, pred.2.term, lower.band,  
upper.band)

# it is easier to compare two plots instead of looking at these two  
data.frames

plot(gam2, residuals=T, pch=1, cex=0.7)
abline(h=0)

plot(y~x2, col=grey(0.5))
lines(fit~x2, col=blue, data=pred.2.resp)
lines(lower.band~x2, col=red, lty=2, data=pred.2.resp)
lines(upper.band~x2, col=red, lty=2, data=pred.2.resp)
abline(h=mean.y,lty=5,col=grey(0.35))

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] transforming data frame for use with persp

2006-02-13 Thread Denis Chabot
Hi,

This is probably documented, but I cannot find the right words or  
expression for a search. My attempts failed.

I have a data frame of 3 vectors (x, y and z) and would like to  
transform this so that I could use persp. Presently I have y-level  
copies of each x level, and a z value for each x-y pair. I need 2  
columns giving the possible levels of x and y, and then a  
transformation of z from a long vector into a matrix of x-level rows  
and y-level columns. How do I accomplish this?

In this example, I made a set of x and y values to get predictions  
from a GAM, then combined them with the predictions into a data  
frame. This is the one I'd like to transform as described above:

My.data - expand.grid(Depth=seq(40,220, 20), Temp=seq(-1, 6, 0.5))
predgam - predict.gam(dxt.gam, My.data, type=response)
pred.data - data.frame(My.data, predgam)

pred.data has 150 lines and 3 columns.

Thanks for your help,

Denis Chabot

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] plotting lines that break if data break

2006-02-07 Thread Denis Chabot
Hi,

Sometimes data series (not necessarily time series) suffer breaks  
where data were expected, but not collected. Often the regular  
lines command to add such data to a plot is what I want, but other  
times I'd like the line to break where the data series is  
interrupted, instead of the line jumping to the next point in the  
series uninterrupted. Usually my data file contain one value of x but  
none of y, but alternatively a break could also appear as a NA value  
for both x and y.

I have found I could use the segments command instead of lines,  
but this seems to require I manipulate my data (which may or may not  
contain breaks, and the number of breaks can vary if there are breaks  
at all).

Is there another command that works like lines but will break the  
line if the data series suffer an interruption?

Sincerely,

Denis Chabot

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] plotting lines that break if data break

2006-02-07 Thread Denis Chabot
I'm sorry, too long without coffee.

Lately I have done numerous plots with superimposed non-linear  
quantile regression fits, and those do not break (which is fine), and  
I wrongly transposed this to lines in my head.

Denis
Le 06-02-07 à 22:06, Gabor Grothendieck a écrit :

 It does break already.  Try this:

 plot(1:10, c(1:5,NA,7:10), type = l)
 plot(c(1:5,NA,7:10), 1:10, type = l)



 On 2/7/06, Denis Chabot [EMAIL PROTECTED] wrote:
 Hi,

 Sometimes data series (not necessarily time series) suffer breaks
 where data were expected, but not collected. Often the regular
 lines command to add such data to a plot is what I want, but other
 times I'd like the line to break where the data series is
 interrupted, instead of the line jumping to the next point in the
 series uninterrupted. Usually my data file contain one value of x but
 none of y, but alternatively a break could also appear as a NA value
 for both x and y.

 I have found I could use the segments command instead of lines,
 but this seems to require I manipulate my data (which may or may not
 contain breaks, and the number of breaks can vary if there are breaks
 at all).

 Is there another command that works like lines but will break the
 line if the data series suffer an interruption?

 Sincerely,

 Denis Chabot

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! http://www.R-project.org/posting- 
 guide.html


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] how to extract predicted values from a quantreg fit?

2006-02-04 Thread Denis Chabot
Hi,

I have used package quantreg to estimate a non-linear fit to the  
lowest part of my data points. It works great, by the way.

But I'd like to extract the predicted values. The help for  
predict.qss1 indicates this:

predict.qss1(object, newdata, ...)
and states that newdata is a data frame describing the observations  
at which prediction is to be made.

I used the same technique I used to extract predicted values of GAM  
fits with mgcv package: if I fit a GAM using x as the x variable on  
which I smooth, I then create such a dataframe like this

MyX - data.frame(x=seq(-1,60))

This works fine with GAM (mgcv) but not with quantreg:

  y - rnorm(500, 10, 5)
  x - rep(seq(1,50,1), 10)
  My.data - data.frame(x, y)
  My.x - data.frame(x=seq(5,45))
 
  fit - rqss(y ~ qss(x, lambda=5), tau=0.05)
  pred - predict.qss1(fit, My.x)

Could someone please help me creating a dataframe newdata that  
would satisfy predict.qss1?

Thanks in advance,

Denis Chabot

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] how to control ticks in plots with yasp or xasp

2005-10-08 Thread Denis Chabot
Hi,

A few times I tried to control the number and position of tick marks  
in plots with the yasp or xasp parameters. For example, a y axis was  
drawn by default with tick marks at 0, 20, 40, 80 and 100. I tried to  
get tick marks every 10 by adding

yasp=(0, 100, 10)

but this had no effect at all. I know I can draw the axis and tick  
marks manually, but often this simple option would suffice if I could  
understand how to make it work.

Thanks in advance,

Denis Chabot

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] how to control ticks in plots with yasp or xasp

2005-10-08 Thread Denis Chabot
Hi, sorry about the bad syntax, though the right syntax would not  
have worked either, according to your tests (Mark, Brian, Peter).

Anyway it is too finicky, I will draw them myself. For instance,

plot(1:100, xaxt=n)
par(xaxp=c(0, 100, 10))  # the value is reset at each plot
axis(1)

Placed tick marks at intervals at 0, 10, ..., 100, as expected, but  
did not place a label under 100...

Thanks for the answers, usufull as always,

Denis


Le 05-10-08 à 11:58, Marc Schwartz a écrit :

 On Sat, 2005-10-08 at 16:37 +0100, Prof Brian Ripley wrote:

 On Sat, 8 Oct 2005, Marc Schwartz wrote:


 On Sat, 2005-10-08 at 09:28 -0400, Denis Chabot wrote:

 Hi,

 A few times I tried to control the number and position of tick  
 marks
 in plots with the yasp or xasp parameters. For example, a y axis  
 was
 drawn by default with tick marks at 0, 20, 40, 80 and 100. I  
 tried to
 get tick marks every 10 by adding

 yasp=(0, 100, 10)

 but this had no effect at all. I know I can draw the axis and tick
 marks manually, but often this simple option would suffice if I  
 could
 understand how to make it work.

 Thanks in advance,

 Denis Chabot



 I suspect that one problem you are having is that there is no
 par(xasp) or par(yasp)unless these are typos and you are  
 trying
 to use par(xaxp) and par(yaxp)?


 In any case, (0, 100, 10) is invalid syntax, and c(0, 100, 10) is  
 needed.


 Indeed.


 There is an 'asp' argument to some of the plot functions (ie.
 plot.default), but this has a different intention.

 par(xaxp) and par(yaxp) are not listed as read only pars in ? 
 par,
 however, I cannot recall an instance where R does not overwrite  
 the user
 settings during the calculation of the axes, whether passed as  
 arguments
 to a plot function or set a priori via a par() call.


 Really?  Try


 plot(1:100, xaxt=n)
 par(xaxp=c(0, 50, 5))  # the value is reset at each plot
 axis(1)


 for how it works (but not inline, which is probably a bug).


 AhI had not thought about that 'par'ticular combination...  ;-)

 Hence, not R.O.  So it must be used _after_ a plot call, which makes
 sense.

 I had reached for my copy of Paul's book and on page 96 (last  
 paragraph
 in section 3.4.5 on Axes), he suggests using the approach I elucidate
 below with axTicks(). I thought he might have some other ideas and  
 that
 I was missing something. This is also referenced on page 70, third
 paragraph in section 3.2.5 on Axes.


 If you want explicit control over the tick marks, you will need  
 to use
 axis(), perhaps in combination with axTicks(), after using 'xaxt  
 = n'
 and/or 'yaxt = n' in the plot call, depending upon the  
 circumstances.


 That is usually as easy.


 Agreed.

 Thanks,

 Marc




__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] how to control ticks in plots with yasp or xasp

2005-10-08 Thread Denis Chabot
Hi Brian,
Le 05-10-08 à 13:21, Prof Brian Ripley a écrit :

 On Sat, 8 Oct 2005, Denis Chabot wrote:


 Hi, sorry about the bad syntax, though the right syntax would not  
 have worked either, according to your tests (Mark, Brian, Peter).


 It DOES work according to my tests!  (Do give us the credit for  
 testing our advice:  we would appreciate your showing equal care.)

I am very sorry, I did not express myself properly. I do appreciate  
your help, and I did take the time to test your solution. What I  
meant was that even if I had use the proper syntax, i.e. yaxp=c 
(10,100,10) instead of y=(10,100,10) [actually in my program I had  
used it but I was careless when I composed the message], it would not  
have worked because you demonstrated that it has to be used within a  
par statement, which I had not done. Therefore my attempt was  
doomed. I did not mean there was no way to make yaxp work, so I  
apologize if you think I take your advice lightly, and I cannot state  
strongly enough that it is not the case at all.


 Anyway it is too finicky, I will draw them myself. For instance,

 plot(1:100, xaxt=n)
 par(xaxp=c(0, 100, 10))  # the value is reset at each plot
 axis(1)

 Placed tick marks at intervals at 0, 10, ..., 100, as expected,  
 but did not place a label under 100...


 Does for me (2.2.0, Windows and X11 on Linux).  Might the text be  
 clipped on your unnamed device, so you need to set xpd?  (If so, it  
 will be clipped however you try to do this, and is an unrelated  
 local problem.)


This is strange.

I work on Mac OS X, so the default device is quartz. Label 100 is  
shown without clipping if plot without changing xaxp. But just in  
case I made more space at the right of the graph with

par(mai=c(0.7, 0.7, 0.5, 0.5))
plot(1:100, xaxt=n)
par(xaxp=c(0, 100, 10))  # the value is reset at each plot
axis(1)

and the 100 does not appear. If I run this again:
plot(1:100, xaxt=n)
axis(1)

I'm back with R's default axis and the 100 is present. Anyone on a  
Mac wants to try?

Denis


 -- 
 Brian D. Ripley,  [EMAIL PROTECTED]
 Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
 University of Oxford, Tel:  +44 1865 272861 (self)
 1 South Parks Road, +44 1865 272866 (PA)
 Oxford OX1 3TG, UKFax:  +44 1865 272595


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] testing non-linear component in mgcv:gam

2005-10-05 Thread Denis Chabot
Hi,

I need further help with my GAMs. Most models I test are very  
obviously non-linear. Yet, to be on the safe side, I report the  
significance of the smooth (default output of mgcv's summary.gam) and  
confirm it deviates significantly from linearity.

I do the latter by fitting a second model where the same predictor is  
entered without the s(), and then use anova.gam to compare the two. I  
thought this was the equivalent of the default output of anova.gam  
using package gam instead of mgcv.

I wonder if this procedure is correct because one of my models  
appears to be linear. In fact mgcv estimates df to be exactly 1.0 so  
I could have stopped there. However I inadvertently repeated the  
procedure outlined above. I would have thought in this case the  
anova.gam comparing the smooth and the linear fit would for sure have  
been not significant. To my surprise, P was 6.18e-09!

Am I doing something wrong when I attempt to confirm the non- 
parametric part a smoother is significant? Here is my example case  
where the relationship does appear to be linear:

library(mgcv)
 This is mgcv 1.3-7
Temp - c(-1.38, -1.12, -0.88, -0.62, -0.38, -0.12, 0.12, 0.38, 0.62,  
0.88, 1.12,
1.38, 1.62, 1.88, 2.12, 2.38, 2.62, 2.88, 3.12, 3.38,  
3.62, 3.88,
4.12, 4.38, 4.62, 4.88, 5.12, 5.38, 5.62, 5.88, 6.12,  
6.38, 6.62, 6.88,
7.12, 8.38, 13.62)
N.sets - c(2, 6, 3, 9, 26, 15, 34, 21, 30, 18, 28, 27, 27, 29, 31,  
22, 26, 24, 23,
 15, 25, 24, 27, 19, 26, 24, 22, 13, 10, 2, 5, 3, 1, 1,  
1, 1, 1)
wm.sed - c(0.0, 0.016129032, 0.0, 0.062046512,  
0.396459596, 0.189082949,
 0.054757925, 0.142810440, 0.168005168, 0.180804428,  
0.111439628, 0.128799505,
 0.193707937, 0.105921610, 0.103497845, 0.028591837,  
0.217894389, 0.020535469,
 0.080389068, 0.105234450, 0.070213450, 0.050771363,  
0.042074434, 0.102348837,
 0.049748344, 0.019100478, 0.005203125, 0.101711864,  
0.0, 0.0,
 0.014808824, 0.0, 0.22200, 0.16700,  
0.0, 0.0,
 0.0)

sed.gam - gam(wm.sed~s(Temp),weight=N.sets)
summary.gam(sed.gam)
 Family: gaussian
 Link function: identity

 Formula:
 wm.sed ~ s(Temp)

 Parametric coefficients:
 Estimate Std. Error t value Pr(|t|)
 (Intercept)  0.084030.01347   6.241 3.73e-07 ***
 ---
 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 Approximate significance of smooth terms:
 edf Est.rank F  p-value
 s(Temp)   11 13.95 0.000666 ***
 ---
 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 R-sq.(adj) =  0.554   Deviance explained = 28.5%
 GCV score = 0.09904   Scale est. = 0.093686  n = 37

# testing non-linear contribution
sed.lin - gam(wm.sed~Temp,weight=N.sets)
summary.gam(sed.lin)
 Family: gaussian
 Link function: identity

 Formula:
 wm.sed ~ Temp

 Parametric coefficients:
  Estimate Std. Error t value Pr(|t|)
 (Intercept)  0.162879   0.019847   8.207 1.14e-09 ***
 Temp-0.023792   0.006369  -3.736 0.000666 ***
 ---
 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


 R-sq.(adj) =  0.554   Deviance explained = 28.5%
 GCV score = 0.09904   Scale est. = 0.093686  n = 37
anova.gam(sed.lin, sed.gam, test=F)
 Analysis of Deviance Table

 Model 1: wm.sed ~ Temp
 Model 2: wm.sed ~ s(Temp)
Resid. Df Resid. Dev Df  Deviance  F   Pr(F)
 1 3.5000e+01  3.279
 2 3.5000e+01  3.279 5.5554e-10 2.353e-11 0.4521 6.18e-09 ***


Thanks in advance,


Denis Chabot

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] testing non-linear component in mgcv:gam

2005-10-05 Thread Denis Chabot
Fair enough, Andy. I thought I was getting both predictive ability  
and confirmation that the phenomenon I was studying was not linear. I  
have two projects, in one prediction is the goal and I don't really  
need to test linearity. In the second I needed to confirm a cycle was  
taking place and I thought my procedure did this. There was no  
theoretical reason to anticipate the shape of the cycle, so GAM was  
an appealing methodology.

Denis

Le 05-10-05 à 09:38, Liaw, Andy a écrit :

 I think you probably should state more clearly the goal of your
 analysis.  In such situation, estimation and hypothesis testing are
 quite different.  The procedure that gives you the `best' estimate is
 not necessarily the `best' for testing linearity of components.  If  
 your
 goal is estimation/prediction, why test linearity of components?   
 If the
 goal is testing linearity, then you probably need to find the  
 procedure
 that gives you a good test, rather than relying on what gam() gives  
 you.

 Just my $0.02...

 Andy


 From: Denis Chabot

 Hi,

 I need further help with my GAMs. Most models I test are very
 obviously non-linear. Yet, to be on the safe side, I report the
 significance of the smooth (default output of mgcv's
 summary.gam) and
 confirm it deviates significantly from linearity.

 I do the latter by fitting a second model where the same
 predictor is
 entered without the s(), and then use anova.gam to compare
 the two. I
 thought this was the equivalent of the default output of anova.gam
 using package gam instead of mgcv.

 I wonder if this procedure is correct because one of my models
 appears to be linear. In fact mgcv estimates df to be exactly 1.0 so
 I could have stopped there. However I inadvertently repeated the
 procedure outlined above. I would have thought in this case the
 anova.gam comparing the smooth and the linear fit would for
 sure have
 been not significant. To my surprise, P was 6.18e-09!

 Am I doing something wrong when I attempt to confirm the non-
 parametric part a smoother is significant? Here is my example case
 where the relationship does appear to be linear:

 library(mgcv)

 This is mgcv 1.3-7

 Temp - c(-1.38, -1.12, -0.88, -0.62, -0.38, -0.12, 0.12,
 0.38, 0.62,
 0.88, 1.12,
 1.38, 1.62, 1.88, 2.12, 2.38, 2.62, 2.88, 3.12, 3.38,
 3.62, 3.88,
 4.12, 4.38, 4.62, 4.88, 5.12, 5.38, 5.62, 5.88, 6.12,
 6.38, 6.62, 6.88,
 7.12, 8.38, 13.62)
 N.sets - c(2, 6, 3, 9, 26, 15, 34, 21, 30, 18, 28, 27, 27, 29, 31,
 22, 26, 24, 23,
  15, 25, 24, 27, 19, 26, 24, 22, 13, 10, 2, 5, 3, 1, 1,
 1, 1, 1)
 wm.sed - c(0.0, 0.016129032, 0.0, 0.062046512,
 0.396459596, 0.189082949,
  0.054757925, 0.142810440, 0.168005168, 0.180804428,
 0.111439628, 0.128799505,
  0.193707937, 0.105921610, 0.103497845, 0.028591837,
 0.217894389, 0.020535469,
  0.080389068, 0.105234450, 0.070213450, 0.050771363,
 0.042074434, 0.102348837,
  0.049748344, 0.019100478, 0.005203125, 0.101711864,
 0.0, 0.0,
  0.014808824, 0.0, 0.22200, 0.16700,
 0.0, 0.0,
  0.0)

 sed.gam - gam(wm.sed~s(Temp),weight=N.sets)
 summary.gam(sed.gam)

 Family: gaussian
 Link function: identity

 Formula:
 wm.sed ~ s(Temp)

 Parametric coefficients:
 Estimate Std. Error t value Pr(|t|)
 (Intercept)  0.084030.01347   6.241 3.73e-07 ***
 ---
 Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 Approximate significance of smooth terms:
 edf Est.rank F  p-value
 s(Temp)   11 13.95 0.000666 ***
 ---
 Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 R-sq.(adj) =  0.554   Deviance explained = 28.5%
 GCV score = 0.09904   Scale est. = 0.093686  n = 37


 # testing non-linear contribution
 sed.lin - gam(wm.sed~Temp,weight=N.sets)
 summary.gam(sed.lin)

 Family: gaussian
 Link function: identity

 Formula:
 wm.sed ~ Temp

 Parametric coefficients:
  Estimate Std. Error t value Pr(|t|)
 (Intercept)  0.162879   0.019847   8.207 1.14e-09 ***
 Temp-0.023792   0.006369  -3.736 0.000666 ***
 ---
 Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


 R-sq.(adj) =  0.554   Deviance explained = 28.5%
 GCV score = 0.09904   Scale est. = 0.093686  n = 37

 anova.gam(sed.lin, sed.gam, test=F)

 Analysis of Deviance Table

 Model 1: wm.sed ~ Temp
 Model 2: wm.sed ~ s(Temp)
Resid. Df Resid. Dev Df  Deviance  F   Pr(F)
 1 3.5000e+01  3.279
 2 3.5000e+01  3.279 5.5554e-10 2.353e-11 0.4521 6.18e-09 ***



 Thanks in advance,


 Denis Chabot

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide!
 http://www.R-project.org/posting-guide.html





 -- 
 
 Notice

Re: [R] testing non-linear component in mgcv:gam

2005-10-05 Thread Denis Chabot
Hi John,

Le 05-10-05 à 09:45, John Fox a écrit :

 Dear Denis,

 Take a closer look at the anova table: The models provide identical  
 fits to
 the data. The differences in degrees of freedom and deviance  
 between the two
 models are essentially zero, 5.5554e-10 and 2.353e-11 respectively.

 I hope this helps,
  John
This is one of my difficulties. In some examples I found on the web,  
the difference in deviance is compared directly against the chi- 
squared distribution. But my y variable has a very small range  
(between 0 and 0.5, most of the time) so the difference in deviance  
is always very small and if I compared it against the chi-squared  
distribution as I have seen done in examples, the non-linear  
component would always be not significant. Yet it is (with one  
exception), tested with both mgcv:gam and gam:gam. I think the  
examples I have read were wrong in this regard, the scale factor  
seen in mgcv output seems to intervene. But exactly how is still  
mysterious to me and I hesitate to judge the size of the deviance  
difference myself.

I agree it is near zero in my example. I guess I need to have more  
experience with these models to better interpret the output...

Denis


 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Denis Chabot
 Sent: Wednesday, October 05, 2005 8:22 AM
 To: r-help@stat.math.ethz.ch
 Subject: [R] testing non-linear component in mgcv:gam

 Hi,

 I need further help with my GAMs. Most models I test are very
 obviously non-linear. Yet, to be on the safe side, I report
 the significance of the smooth (default output of mgcv's
 summary.gam) and confirm it deviates significantly from linearity.

 I do the latter by fitting a second model where the same
 predictor is entered without the s(), and then use anova.gam
 to compare the two. I thought this was the equivalent of the
 default output of anova.gam using package gam instead of mgcv.

 I wonder if this procedure is correct because one of my
 models appears to be linear. In fact mgcv estimates df to be
 exactly 1.0 so I could have stopped there. However I
 inadvertently repeated the procedure outlined above. I would
 have thought in this case the anova.gam comparing the smooth
 and the linear fit would for sure have been not significant.
 To my surprise, P was 6.18e-09!

 Am I doing something wrong when I attempt to confirm the non-
 parametric part a smoother is significant? Here is my example
 case where the relationship does appear to be linear:

 library(mgcv)

 This is mgcv 1.3-7

 Temp - c(-1.38, -1.12, -0.88, -0.62, -0.38, -0.12, 0.12,
 0.38, 0.62, 0.88, 1.12,
 1.38, 1.62, 1.88, 2.12, 2.38, 2.62, 2.88, 3.12,
 3.38, 3.62, 3.88,
 4.12, 4.38, 4.62, 4.88, 5.12, 5.38, 5.62, 5.88,
 6.12, 6.38, 6.62, 6.88,
 7.12, 8.38, 13.62)
 N.sets - c(2, 6, 3, 9, 26, 15, 34, 21, 30, 18, 28, 27, 27,
 29, 31, 22, 26, 24, 23,
  15, 25, 24, 27, 19, 26, 24, 22, 13, 10, 2, 5, 3,
 1, 1, 1, 1, 1) wm.sed - c(0.0, 0.016129032,
 0.0, 0.062046512, 0.396459596, 0.189082949,
  0.054757925, 0.142810440, 0.168005168,
 0.180804428, 0.111439628, 0.128799505,
  0.193707937, 0.105921610, 0.103497845,
 0.028591837, 0.217894389, 0.020535469,
  0.080389068, 0.105234450, 0.070213450,
 0.050771363, 0.042074434, 0.102348837,
  0.049748344, 0.019100478, 0.005203125,
 0.101711864, 0.0, 0.0,
  0.014808824, 0.0, 0.22200,
 0.16700, 0.0, 0.0,
  0.0)

 sed.gam - gam(wm.sed~s(Temp),weight=N.sets)
 summary.gam(sed.gam)

 Family: gaussian
 Link function: identity

 Formula:
 wm.sed ~ s(Temp)

 Parametric coefficients:
 Estimate Std. Error t value Pr(|t|)
 (Intercept)  0.084030.01347   6.241 3.73e-07 ***
 ---
 Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 Approximate significance of smooth terms:
 edf Est.rank F  p-value
 s(Temp)   11 13.95 0.000666 ***
 ---
 Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 R-sq.(adj) =  0.554   Deviance explained = 28.5%
 GCV score = 0.09904   Scale est. = 0.093686  n = 37


 # testing non-linear contribution
 sed.lin - gam(wm.sed~Temp,weight=N.sets)
 summary.gam(sed.lin)

 Family: gaussian
 Link function: identity

 Formula:
 wm.sed ~ Temp

 Parametric coefficients:
  Estimate Std. Error t value Pr(|t|)
 (Intercept)  0.162879   0.019847   8.207 1.14e-09 ***
 Temp-0.023792   0.006369  -3.736 0.000666 ***
 ---
 Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


 R-sq.(adj) =  0.554   Deviance explained = 28.5%
 GCV score = 0.09904   Scale est. = 0.093686  n = 37

 anova.gam(sed.lin, sed.gam, test=F)

 Analysis of Deviance Table

 Model 1: wm.sed ~ Temp
 Model 2: wm.sed ~ s(Temp)
Resid. Df Resid. Dev Df  Deviance  F   Pr(F)
 1 3.5000e+01  3.279
 2 3.5000e+01  3.279

Re: [R] testing non-linear component in mgcv:gam

2005-10-05 Thread Denis Chabot
Thank you everyone for your help, but my introduction to GAM is  
turning my brain to mush. I thought the one part of the output I  
understood the best was r-sq (adj), but now even this is becoming foggy.

In my original message I mentioned a gam fit that turns out to be a  
linear fit. By curiosity I analysed it with a linear predictor only  
with mgcv package, and then as a linear model. The output was  
identical in both, but the r-sq (adj) was 0.55 in mgcv and 0.26 in  
lm. In lm I hope that my interpretation that 26% of the variance in y  
is explained by the linear relationship with x is valid. Then what  
does r2 mean in mgcv?

Denis
  summary.gam(lin)

Family: gaussian
Link function: identity

Formula:
wm.sed ~ Temp

Parametric coefficients:
  Estimate Std. Error t value Pr(|t|)
(Intercept)  0.162879   0.019847   8.207 1.14e-09 ***
Temp-0.023792   0.006369  -3.736 0.000666 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


R-sq.(adj) =  0.554   Deviance explained = 28.5%
GCV score = 0.09904   Scale est. = 0.093686  n = 37


  summary(sed.true.lin)

Call:
lm(formula = wm.sed ~ Temp, weights = N.sets)

Residuals:
 Min  1Q  Median  3Q Max
-0.6138 -0.1312 -0.0325  0.1089  1.1449

Coefficients:
  Estimate Std. Error t value Pr(|t|)
(Intercept)  0.162879   0.019847   8.207 1.14e-09 ***
Temp-0.023792   0.006369  -3.736 0.000666 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3061 on 35 degrees of freedom
Multiple R-Squared: 0.285,Adjusted R-squared: 0.2646
F-statistic: 13.95 on 1 and 35 DF,  p-value: 0.000666


Le 05-10-05 à 09:45, John Fox a écrit :

 Dear Denis,

 Take a closer look at the anova table: The models provide identical  
 fits to
 the data. The differences in degrees of freedom and deviance  
 between the two
 models are essentially zero, 5.5554e-10 and 2.353e-11 respectively.

 I hope this helps,
  John

 
 John Fox
 Department of Sociology
 McMaster University
 Hamilton, Ontario
 Canada L8S 4M4
 905-525-9140x23604
 http://socserv.mcmaster.ca/jfox
 


 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Denis Chabot
 Sent: Wednesday, October 05, 2005 8:22 AM
 To: r-help@stat.math.ethz.ch
 Subject: [R] testing non-linear component in mgcv:gam

 Hi,

 I need further help with my GAMs. Most models I test are very
 obviously non-linear. Yet, to be on the safe side, I report
 the significance of the smooth (default output of mgcv's
 summary.gam) and confirm it deviates significantly from linearity.

 I do the latter by fitting a second model where the same
 predictor is entered without the s(), and then use anova.gam
 to compare the two. I thought this was the equivalent of the
 default output of anova.gam using package gam instead of mgcv.

 I wonder if this procedure is correct because one of my
 models appears to be linear. In fact mgcv estimates df to be
 exactly 1.0 so I could have stopped there. However I
 inadvertently repeated the procedure outlined above. I would
 have thought in this case the anova.gam comparing the smooth
 and the linear fit would for sure have been not significant.
 To my surprise, P was 6.18e-09!

 Am I doing something wrong when I attempt to confirm the non-
 parametric part a smoother is significant? Here is my example
 case where the relationship does appear to be linear:

 library(mgcv)

 This is mgcv 1.3-7

 Temp - c(-1.38, -1.12, -0.88, -0.62, -0.38, -0.12, 0.12,
 0.38, 0.62, 0.88, 1.12,
 1.38, 1.62, 1.88, 2.12, 2.38, 2.62, 2.88, 3.12,
 3.38, 3.62, 3.88,
 4.12, 4.38, 4.62, 4.88, 5.12, 5.38, 5.62, 5.88,
 6.12, 6.38, 6.62, 6.88,
 7.12, 8.38, 13.62)
 N.sets - c(2, 6, 3, 9, 26, 15, 34, 21, 30, 18, 28, 27, 27,
 29, 31, 22, 26, 24, 23,
  15, 25, 24, 27, 19, 26, 24, 22, 13, 10, 2, 5, 3,
 1, 1, 1, 1, 1) wm.sed - c(0.0, 0.016129032,
 0.0, 0.062046512, 0.396459596, 0.189082949,
  0.054757925, 0.142810440, 0.168005168,
 0.180804428, 0.111439628, 0.128799505,
  0.193707937, 0.105921610, 0.103497845,
 0.028591837, 0.217894389, 0.020535469,
  0.080389068, 0.105234450, 0.070213450,
 0.050771363, 0.042074434, 0.102348837,
  0.049748344, 0.019100478, 0.005203125,
 0.101711864, 0.0, 0.0,
  0.014808824, 0.0, 0.22200,
 0.16700, 0.0, 0.0,
  0.0)

 sed.gam - gam(wm.sed~s(Temp),weight=N.sets)
 summary.gam(sed.gam)

 Family: gaussian
 Link function: identity

 Formula:
 wm.sed ~ s(Temp)

 Parametric coefficients:
 Estimate Std. Error t value Pr(|t|)
 (Intercept)  0.084030.01347   6.241 3.73e-07 ***
 ---
 Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 Approximate significance of smooth terms:
 edf Est.rank F  p-value

Re: [R] p-level in packages mgcv and gam

2005-10-02 Thread Denis Chabot
Thank you very much Henric, now I see!

Denis
Le 05-10-02 à 11:47, Henric Nilsson a écrit :

 This test concerns only the non-linear part of the term s(Number,  
 3). In order to simultaneously test both the linear and non-linear  
 part, as mgcv::summary.gam does, you'd

  kyp1.1 - gam(Kyphosis ~ 1, family=binomial, data=kyphosis)
  anova(kyp1.1, kyp1, test = Chisq)
 Analysis of Deviance Table

 Model 1: Kyphosis ~ 1
 Model 2: Kyphosis ~ s(Number, 3)
   Resid. Df Resid. Dev  Df Deviance P(|Chi|)
 1   80. 83.234
 2   76. 71.997  3.0001   11.237 0.011


 HTH,
 Henric



__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] p-level in packages mgcv and gam

2005-09-29 Thread Denis Chabot
   75.44   2e-16 ***
 ---
 Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 Approximate significance of smooth terms:
 edf Est.rank  F  p-value
 s(x0) 5.1739.000  3.785 0.000137 ***
 s(x1) 2.3579.000 34.631   2e-16 ***
 s(x2) 8.5179.000 84.694   2e-16 ***
 s(x3) 1.0001.000  0.444 0.505797
 ---
 Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 R-sq.(adj) =  0.726   Deviance explained = 73.7%
 GCV score =  4.611   Scale est. = 4.4029n = 400


 If we assume that the scale is known and fixed at 4.4029, we get

  summary(b, dispersion = 4.4029)

 Family: gaussian
 Link function: identity

 Formula:
 y ~ s(x0) + s(x1) + s(x2) + s(x3)

 Parametric coefficients:
 Estimate Std. Error z value Pr(|z|)
 (Intercept)   7.9150 0.1049   75.44   2e-16 ***
 ---
 Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 Approximate significance of smooth terms:
 edf Est.rank  Chi.sq p-value
 s(x0) 5.1739.000  34.067 8.7e-05 ***
 s(x1) 2.3579.000 311.679  2e-16 ***
 s(x2) 8.5179.000 762.255  2e-16 ***
 s(x3) 1.0001.000   0.444   0.505
 ---
 Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 R-sq.(adj) =  0.726   Deviance explained = 73.7%
 GCV score =  4.611   Scale est. = 4.4029n = 400

 Note that t/F changed into z/Chi.sq.


 so as far as i m concerned, i use GCV methods, and fix a 5% on the  
 null
 hypothesis (pvalue) to select significant predictor. after, i look  
 at my
 smooth, and if the parametrization look fine to me, i validate.
 generaly, for p-values smaller than 0.001, you can be confident. over
 0.001, you have to check. 2)
 for difference between package gam and mgcv, i sent a mail about this


 The underlying algorithms are very different.


 HTH,
 Henric


 De : Liaw, Andy [EMAIL PROTECTED]
 Date : 28 septembre 2005 14:01:25 HAE
 À : 'Denis Chabot' [EMAIL PROTECTED], Peter Dalgaard  
 [EMAIL PROTECTED]
 Cc : Thomas Lumley [EMAIL PROTECTED], R list r- 
 [EMAIL PROTECTED]
 Objet : RE: [R] p-level in packages mgcv and gam

 Just change the df in what Thomas described to the degree of  
 polynomial, and
 everything he said still applies.  Any good book on regression that  
 covers
 polynomial regression ought to point this out.

 Andy


 From: Denis Chabot

 But what about another analogy, that of polynomials? You may not be
 sure what degree polynomial to use, and you have not decided before
 analysing your data. You fit different polynomials to your data,
 checking if added degrees increase r2 sufficiently by doing F-tests.

 I thought it was the same thing with GAMs. You can fit a
 model with 4
 df, and in some cases it is of interest to see if this is a better
 fit than a linear fit. But why can't you also check if 7df is better
 than 4df? And if you used mgcv first and it tells you that 7df is
 better than 4df, why bother repeating the comparison 7df
 against 4df,
 why not just take the p-value for the model with 7df (fixed)?

 Denis



 De : Peter Dalgaard [EMAIL PROTECTED]
 Date : 28 septembre 2005 12:04:58 HAE
 À : Thomas Lumley [EMAIL PROTECTED]
 Cc : Denis Chabot [EMAIL PROTECTED], R list r- 
 [EMAIL PROTECTED]
 Objet : Rép : [R] p-level in packages mgcv and gam

 Thomas Lumley [EMAIL PROTECTED] writes:

 Bob, on the other hand, chooses the amount of smoothing depending on
 the data. When a 4 df smooth fits best he ends up with the same model
 as Alice and the same p-value.  When some other df fits best he ends
 up with a different model and a *smaller* p-value than Alice.

 This doesn't actually follow, unless the p-value (directly or
 indirectly) found its way into the definition of best fit. It does
 show the danger, though.

 -- 
O__   Peter Dalgaard Øster Farimagsgade 5, Entr.B
   c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
  (*) \(*) -- University of Copenhagen   Denmark  Ph:  (+45)  
 35327918
 ~~ - ([EMAIL PROTECTED])  FAX: (+45)  
 35327907


 De : Thomas Lumley [EMAIL PROTECTED]
 Date : 26 septembre 2005 12:54:43 HAE
 À : Denis Chabot [EMAIL PROTECTED]
 Cc : r-help@stat.math.ethz.ch
 Objet : Rép : [R] p-level in packages mgcv and gam

 On Mon, 26 Sep 2005, Denis Chabot wrote:

 But the mgcv manual warns that p-level for the smooth can be
 underestimated when df are estimated by the model. Most of the time
 my p-levels are so small that even doubling them would not result in
 a value close to the P=0.05 threshold, but I have one case with  
 P=0.033.

 I thought, probably naively, that running a second model with fixed
 df, using the value of df found in the first model. I could not
 achieve this with mgcv: its gam function does not seem to accept
 fractional values of df (in my case 8.377).

 No, this won't work.  The problem is the usual one with model  
 selection: the p-value is calculated as if the df had been fixed,  
 when really it was estimated.

 It is likely to be quite hard to get an honest

Re: [R] p-level in packages mgcv and gam

2005-09-28 Thread Denis Chabot
I only got one reply to my message:

 No, this won't work.  The problem is the usual one with model  
 selection: the p-value is calculated as if the df had been fixed,  
 when really it was estimated.

 It is likely to be quite hard to get an honest p-value out of  
 something that does adaptive smoothing.

 -thomas

I do not understand this: it seems that a lot of people chose df=4  
for no particular reason, but p-levels are correct. If instead I  
choose df=8 because a previous model has estimated this to be an  
optimal df, P-levels are no good because df are estimated?

Furthermore, shouldn't packages gam and mgcv give similar results  
when the same data and df are used? I tried this:

library(gam)
data(kyphosis)
kyp1 - gam(Kyphosis ~ s(Age, 4), family=binomial, data=kyphosis)
kyp2 - gam(Kyphosis ~ s(Number, 4), family=binomial, data=kyphosis)
kyp3 - gam(Kyphosis ~ s(Start, 4), family=binomial, data=kyphosis)
anova.gam(kyp1)
anova.gam(kyp2)
anova.gam(kyp3)

detach(package:gam)
library(mgcv)
kyp4 - gam(Kyphosis ~ s(Age, k=4, fx=T),  family=binomial,  
data=kyphosis)
kyp5 - gam(Kyphosis ~ s(Number, k=4, fx=T),  family=binomial,  
data=kyphosis)
kyp6 - gam(Kyphosis ~ s(Start, k=4, fx=T),  family=binomial,  
data=kyphosis)
anova.gam(kyp4)
anova.gam(kyp5)
anova.gam(kyp6)


P levels for these models, by pair

kyp1 vs kyp4: p= 0.083 and 0.068 respectively (not too bad)
kyp2 vs kyp5: p= 0.445 and 0.03 (wow!)
kyp3 vs kyp6: p= 0.053 and 0.008 (wow again)

Also if you plot all these you find that the mgcv plots are smoother  
than the gam plots, even the same df are used all the time.

I am really confused now!

Denis

Début du message réexpédié :

 De : Denis Chabot [EMAIL PROTECTED]
 Date : 26 septembre 2005 12:25:04 HAE
 À : r-help@stat.math.ethz.ch
 Objet : p-level in packages mgcv and gam


 Hi,

 I am fairly new to GAM and started using package mgcv. I like the  
 fact that optimal smoothing is automatically used (i.e. df are not  
 determined a priori but calculated by the gam procedure).

 But the mgcv manual warns that p-level for the smooth can be  
 underestimated when df are estimated by the model. Most of the  
 time my p-levels are so small that even doubling them would not  
 result in a value close to the P=0.05 threshold, but I have one  
 case with P=0.033.

 I thought, probably naively, that running a second model with  
 fixed df, using the value of df found in the first model. I could  
 not achieve this with mgcv: its gam function does not seem to  
 accept fractional values of df (in my case 8.377).

 So I used the gam package and fixed df to 8.377. The P-value I  
 obtained was slightly larger than with mgcv (0.03655 instead of  
 0.03328), but it is still  0.05.

 Was this a correct way to get around the underestimated P-level?

 Furthermore, although the gam.check function of the mgcv package  
 suggests to me that the gaussian family (and identity link) are  
 adequate for my data, I must say the instructions in R help for  
 family and in Hastie, T. and Tibshirani, R. (1990) Generalized  
 Additive Models are too technical for me. If someone knows a  
 reference that explains how to choose model and link, i.e. what  
 tests to run on your data before choosing, I would really  
 appreciate it.

 Thanks in advance,

 Denis Chabot



__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] p-level in packages mgcv and gam

2005-09-28 Thread Denis Chabot
Hi Yves,
Le 05-09-28 à 11:05, Yves Magliulo a écrit :

 hi,

 i'll try to help you, i send a mail about this subject last week...  
 and i did not have any response...
Sorry, I did not see your message last week.

 I'm using gam from package mgcv.

 1)
 How to interpret the significance of smooth terms is hard for me to  
 understand perfectly :
 using UBRE, you fix df. p-value are estimated by chi-sq distribution
 using GCV, the best df are estimated by GAM. (that's what i want)  
 and p-values
 are estimated by an F distribution But in that case they said use  
 at your own risk in ?summary.gam

 so you can also look at the chi.sq : but i don't know how to choose  
 a criterion like for p-values... for me, chi.sq show the best  
 predictor in a model, but it's hard to reject one with it.

 so as far as i m concerned, i use GCV methods, and fix a 5% on the  
 null hypothesis (pvalue) to select significant predictor. after, i  
 look at my smooth, and if the parametrization look fine to me, i  
 validate.

 generaly, for p-values smaller than 0.001, you can be confident.  
 over 0.001, you have to check.

I think I follow you, but how do you validate? My fit goes very  
nicely in the middle of the data points and appears fine. In most  
cases p is way smaller than 0.001. I have one case that is bimodal in  
shape and more noisy, and p is only 0.03. How do I validate it, how  
do I check?

 2)
 for difference between package gam and mgcv, i sent a mail about  
 this one year ago, here's the response :

 
 - package gam is based very closely on the GAM approach presented in
 Hastie and Tibshirani's  Generalized Additive Models book.  
 Estimation is
 by back-fitting and model selection is based on step-wise regression
 methods based on approximate distributional results. A particular  
 strength
 of this approach is that local regression smoothers (`lo()' terms)  
 can be
 included in GAM models.

 - gam in package mgcv represents GAMs using penalized regression  
 splines.
 Estimation is by direct penalized likelihood maximization with
 integrated smoothness estimation via GCV or related criteria (there is
 also an alternative `gamm' function based on a mixed model approach).
 Strengths of the this approach are that s() terms can be functions  
 of more
 than one variable and that tensor product smooths are available via  
 te()
 terms - these are useful when different degrees of smoothness are
 appropriate relative to different arguments of a smooth.

 (...)

 Basically, if you want integrated smoothness selection, an underlying
 parametric representation, or want smooth interactions in your models
 then mgcv is probably worth a try (but I would say that). If you  
 want to
 use local regression smoothers and/or prefer the stepwise selection
 approach then package gam is for you.
 

It is hard to evaluate the explanations based on the algorithm used  
to fit the data, but it seems to me that the answer, in terms of  
significance of the smooth, should be at least very similar.  
Otherwise, what do you do when an author cites one package? You  
wonder if the fit would have been significant using the other package?

 i think the difference of p-values between :gam and :mgcv, is  
 because you don't have same number of step iteration. mgcv : gam  
 choose the number of step and with gam : gam you have to choose it..

 hope it helps and someone gives us more details...

 Yves
Again, I can see that the p-values could differ a bit considering the  
differences between the 2 packages. But when the differences are huge  
and result in contradictory conclusions, I have a problem. Like you I  
hope more help is forthcoming.

Denis

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] p-level in packages mgcv and gam

2005-09-28 Thread Denis Chabot
But what about another analogy, that of polynomials? You may not be  
sure what degree polynomial to use, and you have not decided before  
analysing your data. You fit different polynomials to your data,  
checking if added degrees increase r2 sufficiently by doing F-tests.

I thought it was the same thing with GAMs. You can fit a model with 4  
df, and in some cases it is of interest to see if this is a better  
fit than a linear fit. But why can't you also check if 7df is better  
than 4df? And if you used mgcv first and it tells you that 7df is  
better than 4df, why bother repeating the comparison 7df against 4df,  
why not just take the p-value for the model with 7df (fixed)?

Denis

Maybe one is in
Le 05-09-28 à 12:04, Peter Dalgaard a écrit :

 Thomas Lumley [EMAIL PROTECTED] writes:


 Bob, on the other hand, chooses the amount of smoothing depending on
 the data. When a 4 df smooth fits best he ends up with the same model
 as Alice and the same p-value.  When some other df fits best he ends
 up with a different model and a *smaller* p-value than Alice.


 This doesn't actually follow, unless the p-value (directly or
 indirectly) found its way into the definition of best fit. It does
 show the danger, though.

 -- 
O__   Peter Dalgaard Øster Farimagsgade 5, Entr.B
   c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
  (*) \(*) -- University of Copenhagen   Denmark  Ph:  (+45)  
 35327918
 ~~ - ([EMAIL PROTECTED])  FAX: (+45)  
 35327907


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] p-level in packages mgcv and gam

2005-09-26 Thread Denis Chabot
Hi,

I am fairly new to GAM and started using package mgcv. I like the  
fact that optimal smoothing is automatically used (i.e. df are not  
determined a priori but calculated by the gam procedure).

But the mgcv manual warns that p-level for the smooth can be  
underestimated when df are estimated by the model. Most of the time  
my p-levels are so small that even doubling them would not result in  
a value close to the P=0.05 threshold, but I have one case with P=0.033.

I thought, probably naively, that running a second model with fixed  
df, using the value of df found in the first model. I could not  
achieve this with mgcv: its gam function does not seem to accept  
fractional values of df (in my case 8.377).

So I used the gam package and fixed df to 8.377. The P-value I  
obtained was slightly larger than with mgcv (0.03655 instead of  
0.03328), but it is still  0.05.

Was this a correct way to get around the underestimated P-level?

Furthermore, although the gam.check function of the mgcv package  
suggests to me that the gaussian family (and identity link) are  
adequate for my data, I must say the instructions in R help for  
family and in Hastie, T. and Tibshirani, R. (1990) Generalized  
Additive Models are too technical for me. If someone knows a  
reference that explains how to choose model and link, i.e. what tests  
to run on your data before choosing, I would really appreciate it.

Thanks in advance,

Denis Chabot

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] inconsistant decimal marker with write.table

2005-09-13 Thread Denis Chabot
Hi,

My problem does not happen all the time, nor with all files I save to  
csv format. I can send my test file (format rda or csv) to whoever  
would like to replicate (and hopefully explain) my problem.

In short, I have a dataset with mostly numerical variables. One of my  
variable is called pfi2 and is definitely numerical, as shown by this:

  summary(test$pfi2)
Min. 1st Qu.  MedianMean 3rd Qu.Max.
0.0 0.0 0.0 0.01208 0.01500 0.08900

Yet, when I export to a text file and asking for comma as the decimal  
separator, I get a period as the separator for this variable, whereas  
all the other variables have the desired comma as separator.

write.table(test, out1.csv, sep = ;, dec = ,, row.names = F,  
col.names = TRUE)
write.csv2(test, out2.csv)

First 5 lines of out1.csv:

trait;tfi;sto_wt;pfi2;pfi_st;focc;sed;mob;date;latitu 
de;longitud;hre_deb;temp_fon;strate;lat.km;long.km;temp_.2 
5C
10;0,764;5,1;0.007;0,213;0,143;0,048;0,095;05/08/99; 
47,49;-58,9267;20:11;4,9;302;-167,7912;230,761439786593;4,88
106;0,566;3,3;0.089;4,762;0,25;0;0,25;14/08/99; 
49,84;-59,655;23:25;4,4;814;93,34084;168,052052432345;4,38
110;1,517;8,3;0.003;0,172;0,25;0;0;15/08/99; 
49,88667;-60,2167;04:57;0,3;833;98,526770403;127,674990442535;0,38
111;1,232;7,5;0.016;1,309;0,25;0;0;15/08/99;49,84;-60,1617;06:14; 
0,1;833;93,34084;131,739909589075;0,12

I'm sorry but I wanted to make this lighter by removing more  
variables from my original dataset, but the problem disappeared, so I  
had to keep these variables to show you the problem SOMETIMES happen.  
You'll notice that the 4th variable has periods for decimal markers.  
The write.csv2 command produced the same problem.

With other files I sometimes get commas for all variables, sometimes  
I get more than one variable with periods. It is frustrating.

So let me know if you'd like the data file,

Sincerely,

Denis Chabot
platform powerpc-apple-darwin7.9.0
arch powerpc
os   darwin7.9.0
system   powerpc, darwin7.9.0
status
major2
minor1.1
year 2005
month06
day  20
language R

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] grep help needed

2005-07-26 Thread Denis Chabot
Thanks for your help, the proposed solutions were much more elegant  
than what I was attempting. I adopted a slight modification of Tom  
Mulholland's solution with a piece from John Fox's solution, but many  
of you had very similar solutions.

require(maptools)
nc - read.shape(system.file(shapes/sids.shp, package = maptools) 
[1])
mappolys - Map2poly(nc, as.character(nc$att.data$FIPSNO))
selected.shapes - which(nc$att.data$SID74  20)
# just to make it a smaller example
submap - subset(mappolys, nc$att.data$SID74  20)

final.data - NULL
for (j in 1:length(selected.shapes)){
 temp.verts - matrix(as.vector(submap[[j]]),ncol = 2)
 n - length(temp.verts[,1])
 temp.order - 1:n
 temp.data - cbind(rep(j,n),temp.order,temp.verts)
 final.data - rbind(final.data,temp.data)
 }
colnames(final.data) - c(PID, POS, X, Y)
final.data
my.data - as.data.frame(final.data)
class(my.data) - c(PolySet, data.frame)
attr(my.data, projection) - LL

meta - nc[2]$att.data[selected.shapes,]
PID - seq(1,length(submap))
meta.data - cbind(PID, meta)
class(meta.data) - c(PolyData, data.frame)
attr(meta.data, projection) - LL

It would be nice if a variant of this was incorporated into  
PBSmapping to make it easier to import data from shapefiles!

Thanks again for your help,

Denis Chabot
Le 05-07-26 à 00:48, Mulholland, Tom a écrit :

 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] Behalf Of Denis Chabot
 Sent: Tuesday, 26 July 2005 10:46 AM
 To: R list
 Subject: [R] grep help needed


 Hi,

 In another thread (PBSmapping and shapefiles) I asked for an easy
 way to read shapefiles and transform them in data that PBSmapping
 could use. One person is exploring some ways of doing this,
 but it is
 possible I'll have to do this manually.

 With package maptools I am able to extract the information I need
 from a shapefile but it is formatted like this:

 [[1]]
 [,1] [,2]
 [1,] -55.99805 51.68817
 [2,] -56.00222 51.68911
 [3,] -56.01694 51.68911
 [4,] -56.03781 51.68606
 [5,] -56.04639 51.68759
 [6,] -56.04637 51.69445
 [7,] -56.03777 51.70207
 [8,] -56.02301 51.70892
 [9,] -56.01317 51.71578
 [10,] -56.00330 51.73481
 [11,] -55.99805 51.73840
 attr(,pstart)
 attr(,pstart)$from
 [1] 1

 attr(,pstart)$to
 [1] 11

 attr(,nParts)
 [1] 1
 attr(,shpID)
 [1] NA

 [[2]]
[,1] [,2]
 [1,] -57.76294 50.88770
 [2,] -57.76292 50.88693
 [3,] -57.76033 50.88163
 [4,] -57.75668 50.88091
 [5,] -57.75551 50.88169
 [6,] -57.75562 50.88550
 [7,] -57.75932 50.88775
 [8,] -57.76294 50.88770
 attr(,pstart)
 attr(,pstart)$from
 [1] 1

 attr(,pstart)$to
 [1] 8

 attr(,nParts)
 [1] 1
 attr(,shpID)
 [1] NA

 I do not quite understand the structure of this data object (list of
 lists I think)
 but at this point I resorted to printing it on the console and
 imported that text into Excel for further cleaning, which is easy
 enough. I'd like to complete the process within R to save
 time and to
 circumvent Excel's limit of around 64000 lines. But I have a hard
 time figuring out how to clean up this text in R.

 What I need to produce for PBSmapping is a file where each block of
 coordinates shares one ID number, called PID, and a variable POS
 indicates the position of each coordinate within a shape.
 All other
 lines must disappear. So the above would become:

 PID POS X Y
 1 1 -55.99805 51.68817
 1 2 -56.00222 51.68911
 1 3 -56.01694 51.68911
 1 4 -56.03781 51.68606
 1 5 -56.04639 51.68759
 1 6 -56.04637 51.69445
 1 7 -56.03777 51.70207
 1 8 -56.02301 51.70892
 1 9 -56.01317 51.71578
 1 10 -56.00330 51.73481
 1 11 -55.99805 51.73840
 2 1 -57.76294 50.88770
 2 2 -57.76292 50.88693
 2 3 -57.76033 50.88163
 2 4 -57.75668 50.88091
 2 5 -57.75551 50.88169
 2 6 -57.75562 50.88550
 2 7 -57.75932 50.88775
 2 8 -57.76294 50.88770

 First I imported this text file into R:
 test - read.csv2(test file.txt,header=F, sep=;, colClasses =
 character)

 I used sep=; to insure there would be only one variable in this
 file, as it contains no ;

 To remove lines that do not contain coordinates, I used the
 fact that
 longitudes are expressed as negative numbers, so with my very
 limited
 knowledge of grep searches, I thought of this, which is probably not
 the best way to go:

 a - rep(-, length(test$V1))
 b - grep(a, test$V1)

 this gives me a warning (Warning message:
 the condition has length  1 and only the first element will be used
 in: if (is.na(pattern)) {
 but seems to do what I need anyway

 c - seq(1, length(test$V1))
 d - c %in% b

 e - test$V1[d]

 Partial victory, now I only have lines that look like
 [1,] -57.76294 50.88770

 But I don't know how to go further: the number in square
 brackets can
 be used for variable POS, after removing the square brackets and the
 comma, but this requires a better knowledge of grep than I have.
 Furthermore, I don't know how to add a PID (polygon ID) variable,
 i.e. all lines of a polygon must have the same ID, as in the example
 above (i.e. each

[R] grep help needed

2005-07-25 Thread Denis Chabot
Hi,

In another thread (PBSmapping and shapefiles) I asked for an easy  
way to read shapefiles and transform them in data that PBSmapping  
could use. One person is exploring some ways of doing this, but it is  
possible I'll have to do this manually.

With package maptools I am able to extract the information I need  
from a shapefile but it is formatted like this:

[[1]]
[,1] [,2]
[1,] -55.99805 51.68817
[2,] -56.00222 51.68911
[3,] -56.01694 51.68911
[4,] -56.03781 51.68606
[5,] -56.04639 51.68759
[6,] -56.04637 51.69445
[7,] -56.03777 51.70207
[8,] -56.02301 51.70892
[9,] -56.01317 51.71578
[10,] -56.00330 51.73481
[11,] -55.99805 51.73840
attr(,pstart)
attr(,pstart)$from
[1] 1

attr(,pstart)$to
[1] 11

attr(,nParts)
[1] 1
attr(,shpID)
[1] NA

[[2]]
   [,1] [,2]
[1,] -57.76294 50.88770
[2,] -57.76292 50.88693
[3,] -57.76033 50.88163
[4,] -57.75668 50.88091
[5,] -57.75551 50.88169
[6,] -57.75562 50.88550
[7,] -57.75932 50.88775
[8,] -57.76294 50.88770
attr(,pstart)
attr(,pstart)$from
[1] 1

attr(,pstart)$to
[1] 8

attr(,nParts)
[1] 1
attr(,shpID)
[1] NA

I do not quite understand the structure of this data object (list of  
lists I think)
but at this point I resorted to printing it on the console and  
imported that text into Excel for further cleaning, which is easy  
enough. I'd like to complete the process within R to save time and to  
circumvent Excel's limit of around 64000 lines. But I have a hard  
time figuring out how to clean up this text in R.

What I need to produce for PBSmapping is a file where each block of  
coordinates shares one ID number, called PID, and a variable POS  
indicates the position of each coordinate within a shape. All other  
lines must disappear. So the above would become:

PID POS X Y
1 1 -55.99805 51.68817
1 2 -56.00222 51.68911
1 3 -56.01694 51.68911
1 4 -56.03781 51.68606
1 5 -56.04639 51.68759
1 6 -56.04637 51.69445
1 7 -56.03777 51.70207
1 8 -56.02301 51.70892
1 9 -56.01317 51.71578
1 10 -56.00330 51.73481
1 11 -55.99805 51.73840
2 1 -57.76294 50.88770
2 2 -57.76292 50.88693
2 3 -57.76033 50.88163
2 4 -57.75668 50.88091
2 5 -57.75551 50.88169
2 6 -57.75562 50.88550
2 7 -57.75932 50.88775
2 8 -57.76294 50.88770

First I imported this text file into R:
test - read.csv2(test file.txt,header=F, sep=;, colClasses =  
character)

I used sep=; to insure there would be only one variable in this  
file, as it contains no ;

To remove lines that do not contain coordinates, I used the fact that  
longitudes are expressed as negative numbers, so with my very limited  
knowledge of grep searches, I thought of this, which is probably not  
the best way to go:

a - rep(-, length(test$V1))
b - grep(a, test$V1)

this gives me a warning (Warning message:
the condition has length  1 and only the first element will be used  
in: if (is.na(pattern)) {
but seems to do what I need anyway

c - seq(1, length(test$V1))
d - c %in% b

e - test$V1[d]

Partial victory, now I only have lines that look like
[1,] -57.76294 50.88770

But I don't know how to go further: the number in square brackets can  
be used for variable POS, after removing the square brackets and the  
comma, but this requires a better knowledge of grep than I have.  
Furthermore, I don't know how to add a PID (polygon ID) variable,  
i.e. all lines of a polygon must have the same ID, as in the example  
above (i.e. each time POS == 1, a new polygon starts and PID needs to  
be incremented by 1, and PID is kept constant for lines where POS ! 1).

Any help will be much appreciated.

Sincerely,

Denis Chabot

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] PBSmapping and shapefiles

2005-07-22 Thread Denis Chabot
Hi,

I got no reply to this:
Le 16-Jul-05 à 2:42 PM, Denis Chabot a écrit :

 Hi,

 Is there a way, preferably with R, to read shapefiles and transform  
 them in a format that I could then use with package PBSmapping?

 I have been able to read such files into R with maptools'  
 read.shape and plot it with plot.Map, but I'd like to bring the  
 data to PBSmapping and plot from there. I also looked at the  
 package shapefile, but it does not seem to do what I want either.

 Sincerely,

 Denis Chabot


but I managed to progress somewhat on my own. Although it does not  
allow one to use shapefiles in PBSmapping, maptools at least makes  
it possible to read such files. In some cases I can extract the  
information I want from not-too-complex shapefiles. For instance, to  
extract all the lines corresponding to 60-m isobath in a shapefile, I  
was able to do:

library(maptools)
test - read.shape(bathy.shp)
test2 - Map2lines(test)
bathy60 - subset(test2, test$att.data$Z == 60)

I do not quite understand the structure of bathy60 (list of lists I  
think)
but at this point I resorted to printing bathy60 on the console and  
imported that text into Excel for further cleaning, which is easy  
enough. I'd like to complete the process within R to save time and to  
circumvent Excel's limit of around 64000 lines. But I have a hard  
time figuring out loops in R, coming from a background of  
observation based programs such as SAS.

The output of bathy60 looks like this:

[[1]]
[,1] [,2]
[1,] -55.99805 51.68817
[2,] -56.00222 51.68911
[3,] -56.01694 51.68911
[4,] -56.03781 51.68606
[5,] -56.04639 51.68759
[6,] -56.04637 51.69445
[7,] -56.03777 51.70207
[8,] -56.02301 51.70892
[9,] -56.01317 51.71578
[10,] -56.00330 51.73481
[11,] -55.99805 51.73840
attr(,pstart)
attr(,pstart)$from
[1] 1

attr(,pstart)$to
[1] 11

attr(,nParts)
[1] 1
attr(,shpID)
[1] NA

[[2]]
   [,1] [,2]
[1,] -57.76294 50.88770
[2,] -57.76292 50.88693
[3,] -57.76033 50.88163
[4,] -57.75668 50.88091
[5,] -57.75551 50.88169
[6,] -57.75562 50.88550
[7,] -57.75932 50.88775
[8,] -57.76294 50.88770
attr(,pstart)
attr(,pstart)$from
[1] 1

attr(,pstart)$to
[1] 8

attr(,nParts)
[1] 1
attr(,shpID)
[1] NA

What I need to produce for PBSmapping is a file where each block of  
coordinates shares one ID number, called PID, and a variable POS  
indicating the number of each coordinate within a shape. All other  
lines must disappear. So the above would become:

PID X Y
1 1 -55.99805 51.68817
1 2 -56.00222 51.68911
1 3 -56.01694 51.68911
1 4 -56.03781 51.68606
1 5 -56.04639 51.68759
1 6 -56.04637 51.69445
1 7 -56.03777 51.70207
1 8 -56.02301 51.70892
1 9 -56.01317 51.71578
1 10 -56.00330 51.73481
1 11 -55.99805 51.73840
2 1 -57.76294 50.88770
2 2 -57.76292 50.88693
2 3 -57.76033 50.88163
2 4 -57.75668 50.88091
2 5 -57.75551 50.88169
2 6 -57.75562 50.88550
2 7 -57.75932 50.88775
2 8 -57.76294 50.88770

I don't know how to do this in R. My algorithm would involve looking  
at the structure of a line, discarding it if not including  
coordinates, and then creating PID and POS for lines with  
coordinates, depending on the content of lines i and i-1. In R?

Thanks in advance,

Denis Chabot

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] PBSmapping and shapefiles

2005-07-16 Thread Denis Chabot
Hi,

Is there a way, preferably with R, to read shapefiles and transform  
them in a format that I could then use with package PBSmapping?

I have been able to read such files into R with maptools' read.shape  
and plot it with plot.Map, but I'd like to bring the data to  
PBSmapping and plot from there. I also looked at the package  
shapefile, but it does not seem to do what I want either.

Sincerely,

Denis Chabot

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] how to create a variable to rank within subgroups

2005-06-07 Thread Denis Chabot
Thank you very much Andy, this is exactly what I was looking for. I  
did not know this function.


Sincerely,

Denis Chabot
Le 06 juin 2005 à 21:21, Liaw, Andy a écrit :


Try something like:



g - gl(4, 5)
x - sample(20)
d - data.frame(g, x)
d


   g  x
1  1 10
2  1  3
3  1 11
4  1 12
5  1 20
6  2 13
7  2  6
8  2  2
9  2  1
10 2 14
11 3 17
12 3 15
13 3 16
14 3 19
15 3  9
16 4  4
17 4  7
18 4  5
19 4  8
20 4 18


d$xrank - ave(d$x, d$g, FUN=rank)
d


   g  x xrank
1  1 10 2
2  1  3 1
3  1 11 3
4  1 12 4
5  1 20 5
6  2 13 4
7  2  6 3
8  2  2 2
9  2  1 1
10 2 14 5
11 3 17 4
12 3 15 2
13 3 16 3
14 3 19 5
15 3  9 1
16 4  4 1
17 4  7 3
18 4  5 2
19 4  8 4
20 4 18 5

Andy



From: Denis Chabot

Hi,

I would like to create a new variable that would be the rank of a
continuous variable within each level of a grouping variable.
Say the
first level of the grouping variable has 5 observations, I'd like
them ranked from one to five and a new variable would hold the rank
value (one to five). So forth for each level of the grouping  
variable.


I'm quite new with R and cannot figure out this one by myself.

Thanks in advance,

Denis Chabot

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html








-- 

Notice:  This e-mail message, together with any attachments,  
contains information of Merck  Co., Inc. (One Merck Drive,  
Whitehouse Station, New Jersey, USA 08889), and/or its affiliates  
(which may be known outside the United States as Merck Frosst,  
Merck Sharp  Dohme or MSD and in Japan, as Banyu) that may be  
confidential, proprietary copyrighted and/or legally privileged. It  
is intended solely for the use of the individual or entity named on  
this message.  If you are not the intended recipient, and have  
received this message in error, please notify us immediately by  
reply e-mail and then delete it from your system.
-- 





__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] how to create a variable to rank within subgroups

2005-06-07 Thread Denis Chabot

I sure agree the name is not very helpful in guessing what it can do.

May I suggest propagate?

Denis Chabot
Le 07 juin 2005 à 06:18, Peter Dalgaard a écrit :


Denis Chabot [EMAIL PROTECTED] writes:



Thank you very much Andy, this is exactly what I was looking for. I
did not know this function.



It's a horrible misnomer though (ave() is originally for replacing  
values

with averages, but obviously has other uses). Any suggestions for a
better name (or alias).


--
   O__   Peter Dalgaard Blegdamsvej 3
  c/ /'_ --- Dept. of Biostatistics 2200 Cph. N
 (*) \(*) -- University of Copenhagen   Denmark  Ph: (+45)  
35327918
~~ - ([EMAIL PROTECTED]) FAX: (+45)  
35327907




__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] how to create a variable to rank within subgroups

2005-06-06 Thread Denis Chabot

Hi,

I would like to create a new variable that would be the rank of a  
continuous variable within each level of a grouping variable. Say the  
first level of the grouping variable has 5 observations, I'd like  
them ranked from one to five and a new variable would hold the rank  
value (one to five). So forth for each level of the grouping variable.


I'm quite new with R and cannot figure out this one by myself.

Thanks in advance,

Denis Chabot

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Objet: [R] New user of R on Mac OS X - Please help

2005-03-13 Thread Denis Chabot
De: Depiereux Constant [EMAIL PROTECTED]
Date: 12 mars 2005 09:44:08 GMT-05:00
À: r-help@stat.math.ethz.ch
Objet: [R] New user of R on Mac OS X - Please help
Brand new Mac OS X user, I am transfering my R stuffs from my windows 
machine.

When porting some of my functions, I got messages such as  :
2005-03-12 15:37:52.456 R[673] *** NSTimer discarding exception 
'NSRangeException' (reason '*** NSRunStorage, 
_NSBlockNumberForIndex(): index (3607) beyond array bounds (2000)') 
that raised during firing of timer with target 3ba850 and selector 
'runRELP:'

Does any body of you have any idea where I should be starting looking 
for a solution for ditto?

Thnaks in advance four your help.
Constant Depièreux
If this is too specific to the Mac platform to receive help here, you 
might also try posting your message on
[EMAIL PROTECTED]

Denis
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Rép : [R] Problem installing Hmisc

2005-02-08 Thread Denis Chabot
Hi,

I do have it installed on 2 Macs as well (OS X 10.2.8 and 10.3.7) and 
what I need does work, however if you do the command check routine some 
problems will likely be revealed. At least there were problems for me.

Denis
Le 08 févr. 2005, à 12:23, [EMAIL PROTECTED] a écrit :

 De: Don MacQueen [EMAIL PROTECTED]
 Date: 07 février 2005 16:05:14 GMT+01:00
 À: Thomas Lumley [EMAIL PROTECTED], Michael Kubovy 
 [EMAIL PROTECTED]
 Cc: r-help@stat.math.ethz.ch
 Objet: Rép : [R] Problem installing Hmisc


 At 10:28 AM -0800 2/5/05, Thomas Lumley wrote:
 On Sat, 5 Feb 2005, Michael Kubovy wrote:

 What am I doing wrong?

 You can look at the console log (eg start Console.app in 
 /Applications/Utilities) to see what the problem is, but the fact 
 that Hmisc is not available as a binary package probably means that 
 it does not compile for the Mac (at least without some modification)

  -thomas

 I have Hmisc installed on a Mac without any modification, i.e., using 
 install.packages().
 R itself was installed from source code.

  version
  _   platform powerpc-apple-darwin6.8.5
 arch powerpc os   darwin6.8.5 
 system   powerpc, darwin6.8.5status   
 major2   minor0.1 year 
 2004month11  day  
 15  language R

  library(Hmisc)
 Hmisc library by Frank E Harrell Jr

 Type library(help='Hmisc'), ?Overview, or ?Hmisc.Overview')
 to see overall documentation.

 NOTE:Hmisc no longer redefines [.factor to drop unused levels when
 subsetting.  To get the old behavior of Hmisc type dropUnusedLevels().

 Attaching package 'Hmisc':


 The following object(s) are masked from package:stats :

  ecdf




 --- snip ---


 Thomas LumleyAssoc. Professor, Biostatistics
 [EMAIL PROTECTED]University of Washington, Seattle




 -Don
 -- 
 --
 Don MacQueen
 Environmental Protection Department
 Lawrence Livermore National Laboratory
 Livermore, CA, USA


[[alternative text/enriched version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Rép : [R] 2 small problems: integer division and the nature of NA

2005-02-05 Thread Denis Chabot
Thanks to the many R users who convinced me that the sum of NAs should 
be zero and gave me a solution if I did not want it to be zero.

Thank you also for the explanations of rounding errors with floating 
point arithmetics. I did not expect it. This small error was a real 
problem for me as I was trying to find a way to recode numeric values 
into intervals. Because I wanted to retain numeric values as a result, 
I tried not to use cut or cut2. Hence to convert a range of 
temperatures into 0.2 degree intervals I had written:

(lets first make a fake temperature variable k for testing)
k - seq(-5,5,0.1)
k1 - ifelse(k0,-0.2*(abs(k) %/% 0.2) - 0.1, 0.2 *(k %/% 0.2) + 0.1)
Note that this works well to quickly recode a numeric variable that 
only takes integer values. But it produces the problem that prompted my 
call for help when there are decimals: some values end up in a 
different class than what you'd expect.

Considering your answers, I found 3 solutions:
k2 - ifelse(k0,-0.2*(abs(round(10*k)) %/% 2) - 0.1, 0.2 *(round(10*k) 
%/% 2) + 0.1)

k3 - (-0.1+min(k)) + 0.2 * as.numeric(cut(k, 
seq(min(k),max(k)+0.2,0.2), right=F, labels=F))

k4 - cut2(k, seq(min(k), max(k)+0.2, 0.2), levels.mean=T)
k5 - as.numeric(levels(k7))[k7]
I could round to 1 decimal to be even more exact but this is good 
enough. If it can be more elegant, please let me know!

Denis
Subject: [R] 2 small problems: integer division and the nature of NA
Hi,
I'm wondering why
48 %/% 2 gives 24
but
4.8 %/% 0.2 gives 23...
I'm not trying to round up here, but to find out how many times
something fits into something else, and the answer should have been the
same for both examples, no?
On a different topic, I like the behavior of NAs better in R than in
SAS (at least they are not considered the smallest value for a
variable), but at the same time I am surprised that the sum of NAs is 0
instead of NA.
The sum of a vector having at least one NA but also valid data gives NA
if we do not specify na.rm=T. But with na.rm=T, we are telling sum to
give the sum of valid data, ignoring NAs that do not tell us anything
about the value of a variable. I found out while getting the sum of
small subsets of my data (such as when subsetting by several
variables), sometimes a cell only contained NAs for my response
variable. I would have expected the sum to be NA in such cases, as I do
not have a single data point telling me the value of my response here.
But R tells me the sum was zero in that cell! Was this behavior
considered desirable when sum was built? If not, any hope it will be
fixed?
Sincerely,
Denis Chabot
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] 2 small problems: integer division and the nature of NA

2005-02-04 Thread Denis Chabot
Hi,
I'm wondering why
48 %/% 2 gives 24
but
4.8 %/% 0.2 gives 23...
I'm not trying to round up here, but to find out how many times 
something fits into something else, and the answer should have been the 
same for both examples, no?

On a different topic, I like the behavior of NAs better in R than in 
SAS (at least they are not considered the smallest value for a 
variable), but at the same time I am surprised that the sum of NAs is 0 
instead of NA.

The sum of a vector having at least one NA but also valid data gives NA 
if we do not specify na.rm=T. But with na.rm=T, we are telling sum to 
give the sum of valid data, ignoring NAs that do not tell us anything 
about the value of a variable. I found out while getting the sum of 
small subsets of my data (such as when subsetting by several 
variables), sometimes a cell only contained NAs for my response 
variable. I would have expected the sum to be NA in such cases, as I do 
not have a single data point telling me the value of my response here. 
But R tells me the sum was zero in that cell! Was this behavior 
considered desirable when sum was built? If not, any hope it will be 
fixed?

Sincerely,
Denis Chabot
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] recoding large number of categories (select in SAS)

2005-01-19 Thread Denis Chabot
Hi,
I have data on stomach contents. Possible prey species are in the 
hundreds, so a list of prey codes has been in used in many labs doing 
this kind of work.

When comes time to do analyses on these data one often wants to regroup 
prey in broader categories, especially for rare prey.

In SAS you can nest a large number of if-else, or do this more 
cleanly with select like this:
select;
  when (149 = prey =150)   preyGr= 150;
  when (186 = prey = 187)  preyGr= 187;
  when (prey= 438) preyGr= 438;
  when (prey= 430) preyGr= 430;
  when (prey= 436) preyGr= 436;
  when (prey= 431) preyGr= 431;
  when (prey= 451) preyGr= 451;
  when (prey= 461) preyGr= 461;
  when (prey= 478) preyGr= 478;
  when (prey= 572) preyGr= 572;
  when (692 = prey =  695 )   
preyGr= 692;
  when (808 = prey =  826, 830 = prey = 832 )  	preyGr= 808;
  when (997 = prey = 998, 792 = prey = 796)  	preyGr= 792;
  when (882 = prey = 909)  		preyGr= 882;
  when (prey in (999, 125, 994))  	   preyGr= 9994;
  otherwise preyGr= 1;
end; *select;

The number of transformations is usually much larger than this short 
example.

What is the best way of doing this in R?
Sincerely,
Denis Chabot
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


RE: [R] subsetting like in SAS

2005-01-17 Thread Denis Chabot
I want to thank Petr Pikal, Robert Balshaw and Na Li for suggesting the 
use of unique or !duplicated on a subset of my data where unwanted 
variables have been removed. This worked perfectly.

Denis Chabot
On 13 Jan 2005 at 11:52, Denis Chabot wrote:
Hi,
Being in the process of translating some of my SAS programs to R, I
encountered one difficulty. I have a solution, but it is not elegant
(and not pleasant to implement).
I have a large dataset with many variables needed to identify the
origin of a sample, many to describe sample characteristics, others to
describe site characteristics.
I want only a (shorter) list of sites and their characteristics.
If origin, ship_cat, ship_nb, trip and set are needed to
identify a site, in SAS you'd sort on those variables, then read the
data with:
data sites;
 set alldata;
 by origin ship_cat ship_nb trip set;
 if first.set;
 keep list-of-variables-detailing-sites;
run;
In R I did this with the Lag function of Hmisc, and the original data
set also needs to be sorted first:
oL - Lag(origin)
scL - Lag(ship_cat)
snL - Lag(ship_nb)
tL - Lag(trip)
sL - Lag(set)
same - origin==oL  ship_cat==scL  ship_nb==snL  trip==tL  set==sL
sites - subset(alldata, !same,
select=c(list-of-variables-detailing-sites)
Could I do better than this?
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] Re: labels attached to variable names

2005-01-09 Thread Denis Chabot
Thank you to the many helpful list members who answered.
I am grateful to have been told about attr and about Hmisc. Actually 
Hmisc also brings me solutions to many things on my wish list!

Cheers,
Denis
Le Vendredi, 7 janv 2005, à 12:11 Europe/Paris, 
[EMAIL PROTECTED] a écrit :

De: Denis Chabot [EMAIL PROTECTED]
Date: Jeud 6 janv 2005  15:40:07 Europe/Paris
À: r-help@stat.math.ethz.ch
Objet: [R] labels attached to variable names
Hi,
Can we attach a more descriptive label (I may use the wrong 
terminology, which would explain why I found nothing on the FAQ) to 
variable names, and later have an easy way to switch to these labels 
in plots? I fear this is not possible and one must enter this by hand 
as ylab and xlab when making plots.

Thanks in advance,
Denis Chabot
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] how to do by-processing using weighted.mean

2005-01-09 Thread Denis Chabot
Hi,
I'd like to compute the weighted mean of some variables, all using the 
same weight variable, for each combination of 3 factor variables.

I found how I could use summarize (from Hmisc) to do normal means for 
combinations of 3 factors, but I cannot find a way of doing weighted 
means. Is it possible in R?

Thanks in advance,
Denis Chabot
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] labels attached to variable names

2005-01-06 Thread Denis Chabot
Hi,
Can we attach a more descriptive label (I may use the wrong  
terminology, which would explain why I found nothing on the FAQ) to  
variable names, and later have an easy way to switch to these labels in  
plots? I fear this is not possible and one must enter this by hand as  
ylab and xlab when making plots.

Thanks in advance,
Denis Chabot
 
---
Denis Chabot, Ph.D.
Chercheur en bioénergétique   Bioenergetics researcher
Institut Maurice-Lamontagne Institut Maurice-Lamontagne
Pêches et Océans CanadaFisheries  Oceans Canada
CP 1000, Mont-Joli, QC  PB 1000, Mont-Joli, QC
G5H 3Z4G5H 3Z4
Canada  Canada

(418) 775-0624   (418) 775-0624
http://www.qc.dfo-mpo.gc.ca/iml
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] position of labels in vis.gam or persp

2005-01-06 Thread Denis Chabot
Hi,
Is there a way to control the position of labels in 3D plots such as  
that produced by vis.gam (in mgvc), or persp? Some of my plots have  
axis labels (titles) that collide with the tick labels. I have  
increased the margin around the plot so that there is plenty of space  
to move the axis labels further away from their axis, but I found no  
option for this in the documentation of either vis.gam or persp. The  
later states that some par options can work, but the one that seemed  
promising (mgp) does not have any effect in vis.gam.

Thanks for any help,
Denis
 
---
Denis Chabot, Ph.D.
Chercheur en bioénergétique   Bioenergetics researcher
Institut Maurice-Lamontagne Institut Maurice-Lamontagne
Pêches et Océans CanadaFisheries  Oceans Canada
CP 1000, Mont-Joli, QC  PB 1000, Mont-Joli, QC
G5H 3Z4G5H 3Z4
Canada  Canada

(418) 775-0624   (418) 775-0624
http://www.qc.dfo-mpo.gc.ca/iml
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html