Re: [Rd] Serial connections?
There is another option I use since a couple of years to pilot scientific devices, to program my chemostats, etc. and that is platform-independent: Tcl. Tcl i/o design is excellent and very robust, and Tcl/Tk is integrated in R with the tcltk package. It is really just a mather of a few lines of code in R to communicate through a serial port from R using Tcl. Something like: require(tcltk) .Tcl('set R_com1 [open com1 r+]') # Works on Windows too! # There are many config parameters available here... just an example .Tcl('fconfigure $R_com1 -mode 9600,n,8,1 -buffering none -blocking 0') # Send a command to your device through the serial port .Tcl('puts -nonewline $R_com1 {my_cmd}') # Read a line of text from the serial port line - tclvalue(.Tcl('gets $R_com1')) With a little bit more code, one can program Tcl to call R code automatically everytime new data is pushed by the connected device through the serial port. This is done using something like: .Tcl('fileevent $R_com1 readable [list Handler_com1 $R_com1]') Here, Handler_com1 is a Tcl function. So, some care must be taken using tcltk's .Tcl.callback() to trigger the event on the R side. One way to deal with this easily is by using tclFun() from the tcltk2 package. In tcltk2, there is also ?tclTaskSchedule that can be of interest in the context of serial port communication to trigger a R function in the background regularly and collect data actively from the serial port. All these tools give me a lot a flexibility to communicate through the serial port from R,... and most importantly, to write my code in a portable way (tested on Windows XP and Linux Ubuntu). If there is some interest in this approach, I could initiate a 'tclcom' R package on R-Forge and place there the code I have. Best, Philippe ..°})) ) ) ) ) ) ( ( ( ( (Prof. Philippe Grosjean ) ) ) ) ) ( ( ( ( (Numerical Ecology of Aquatic Systems ) ) ) ) ) Mons University, Belgium ( ( ( ( ( .. On 21/04/10 03:17, shotwelm wrote: Simon is right of course, there are plenty of sensors that would work just fine at 9600 baud (like a thermistor rigged to an ADC). There's a theorem along these lines (Nyquist sampling theorem?). I think piping the output to R is a clever solution. I added a few lines to the ttys.c program so that the baud rate is a command line option (i.e. -B9600) http://biostatmatt.com/temp/ttys.c and confirmed it will compile in Linux (2.6.30). Maybe it will save a step. Microcontrollers really are addictive! For an ioctl package, I was originally thinking of using file descriptors directly. However, I agree this feels like subverting what could be an extension of the connections API. Given that everything is a file in POSIX systems, there may be an argument for an ioctl package that is independent of the connections implementation, say to do things that connections were not designed to do. For example, interfacing with V4L2 devices usually involves many ioctl calls, an mmap call, but rarely read or write calls. But maybe it would just be better to pipe this type of output to R also... -Matt On Tue, 2010-04-20 at 16:42 -0400, Simon Urbanek wrote: On Apr 20, 2010, at 11:51 AM, shotwelm wrote: I've done some microcontroller work over serial also. Unfortunately, interfacing with a serial port is system dependent, and the mechanisms can be quite different, as you probably know. It appears that Simon has a solution below that will work if you are willing to accept the default baud rate (9600 is way too slow for good sensor data [OT: define good ;) - good doesn't mean fast - besides it won't be any good if it is too fast to be meaningfully processed -- that's a different story, though :P - and it is trivial to change so the solution works in general] ), parity, etc.. or use external tools. On POSIX systems, you would need access to the termios.h header and the system ioctl function in order to change these settings. Although I'm not 100% sure, I don't think R has this capability ... yet. I'm new to the list, but I'd be surprised if the R developers that have been around awhile haven't already considered adding support for ioctls and the POSIX terminal interface. This makes me wonder why it's not there. If there is no good reason, I'm starting to see a series of R packages (or core extensions) developing. Good luck ;). The issue is that connections are inherently backend-independent which implies that packages have no access to connection internals as they can change at any time. This means that you can't enhance them without putting the enhancements into R itself. This implies that you have to make a strong case since you need a volunteer in R-core to maintain that code etc. With a package for ioctls, we could use all sorts of cool stuff, like Video4Linux2 (webcams, HAM radio, tuners)... Ioctls are
[Rd] New CRAN test platform - Sparc Solaris
The CRAN package check page now shows results on Sparc Solaris 10 using a server donated by Sun. This box tests several different aspects from the existing Intel-based test boxes: - the CPU is big-endian. - it does not have extended-precision doubles in FPU registers. (It does support long doubles if selected explicitly.) - the C/C++ header set is from a different tradition. - the Sun Studio compiler is used (as has been on one of the Linux boxes, and we may phase the latter out in due course). - the toolset is ATT-based Unix rather than GNU or BSD. In particular Solaris make, sed and tar and an actual Bourne shell are used. There is only a limited set of additional software installed. A few things have had to be compiled with gcc, notably the packages using Gtk+ (and Solaris' own installation of Gtk+ is too old for RGtk2) and MCMCpack. The server does not have a graphics card and this is causing some problems with the X11 installation, including bitmap devices and rgl, that we are still working on. Amongst the package issues shown up are - unterminated final lines in src/Makevars[.in] cause failures. - there are many packages which require GNU make and fail to declare it. These include Cairo JavaGD Rcpp RcppExamples RInside farmR highlight phyclust png rJava. - the assumption that 'tar' accepts compressed archives (gdata). - 60-odd C++ -using packages are not correct C++, often because they use GNU extensions or have missing headers (and C++ requires headers for prototypes), or use C99 functions that are not in C++ and hence not in the headers when used from C++ -- Solaris is rather strict about the latter. Brian Ripley -- Brian D. Ripley, rip...@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Serial connections?
Philippe, unfortunately that approach has one major drawback - it does not give you a connection. As I said in the previous e-mail it is fairly easy to talk to a tty directly, but the challenge is to turn it into a connection. I don't like the Tcl approach for several reasons, one of them being that it's entirely unnatural since you have to construct strings of code -- you could as well use rJava and Java serial connections which has a little more friendly syntax (but I wouldn't recommend that, either) or any other language R interfaces to. I was thinking about it a bit and I may be tempted to have a dab at a tty connection, but I still would not want to mess with ioctl (the other part Matt mentioned). Cheers, Simon On Apr 21, 2010, at 6:30 AM, Philippe Grosjean wrote: There is another option I use since a couple of years to pilot scientific devices, to program my chemostats, etc. and that is platform-independent: Tcl. Tcl i/o design is excellent and very robust, and Tcl/Tk is integrated in R with the tcltk package. It is really just a mather of a few lines of code in R to communicate through a serial port from R using Tcl. Something like: require(tcltk) .Tcl('set R_com1 [open com1 r+]') # Works on Windows too! # There are many config parameters available here... just an example .Tcl('fconfigure $R_com1 -mode 9600,n,8,1 -buffering none -blocking 0') # Send a command to your device through the serial port .Tcl('puts -nonewline $R_com1 {my_cmd}') # Read a line of text from the serial port line - tclvalue(.Tcl('gets $R_com1')) With a little bit more code, one can program Tcl to call R code automatically everytime new data is pushed by the connected device through the serial port. This is done using something like: .Tcl('fileevent $R_com1 readable [list Handler_com1 $R_com1]') Here, Handler_com1 is a Tcl function. So, some care must be taken using tcltk's .Tcl.callback() to trigger the event on the R side. One way to deal with this easily is by using tclFun() from the tcltk2 package. In tcltk2, there is also ?tclTaskSchedule that can be of interest in the context of serial port communication to trigger a R function in the background regularly and collect data actively from the serial port. All these tools give me a lot a flexibility to communicate through the serial port from R,... and most importantly, to write my code in a portable way (tested on Windows XP and Linux Ubuntu). If there is some interest in this approach, I could initiate a 'tclcom' R package on R-Forge and place there the code I have. Best, Philippe ..°})) ) ) ) ) ) ( ( ( ( (Prof. Philippe Grosjean ) ) ) ) ) ( ( ( ( (Numerical Ecology of Aquatic Systems ) ) ) ) ) Mons University, Belgium ( ( ( ( ( .. On 21/04/10 03:17, shotwelm wrote: Simon is right of course, there are plenty of sensors that would work just fine at 9600 baud (like a thermistor rigged to an ADC). There's a theorem along these lines (Nyquist sampling theorem?). I think piping the output to R is a clever solution. I added a few lines to the ttys.c program so that the baud rate is a command line option (i.e. -B9600) http://biostatmatt.com/temp/ttys.c and confirmed it will compile in Linux (2.6.30). Maybe it will save a step. Microcontrollers really are addictive! For an ioctl package, I was originally thinking of using file descriptors directly. However, I agree this feels like subverting what could be an extension of the connections API. Given that everything is a file in POSIX systems, there may be an argument for an ioctl package that is independent of the connections implementation, say to do things that connections were not designed to do. For example, interfacing with V4L2 devices usually involves many ioctl calls, an mmap call, but rarely read or write calls. But maybe it would just be better to pipe this type of output to R also... -Matt On Tue, 2010-04-20 at 16:42 -0400, Simon Urbanek wrote: On Apr 20, 2010, at 11:51 AM, shotwelm wrote: I've done some microcontroller work over serial also. Unfortunately, interfacing with a serial port is system dependent, and the mechanisms can be quite different, as you probably know. It appears that Simon has a solution below that will work if you are willing to accept the default baud rate (9600 is way too slow for good sensor data [OT: define good ;) - good doesn't mean fast - besides it won't be any good if it is too fast to be meaningfully processed -- that's a different story, though :P - and it is trivial to change so the solution works in general] ), parity, etc.. or use external tools. On POSIX systems, you would need access to the termios.h header and the system ioctl function in order to change these settings. Although I'm not 100% sure, I don't think R
Re: [Rd] Serial connections?
Simon, I see what you mean, and I agree it would be nice to have a connection. That would be the natural way in S. Yet, connections are driven by R. You cannot open a connection and let the connected device trigger some R code when data is available, don't you? Otherwise, I don't understand your problem with strings of code. A .R file also contains a series of strings of code interpreted by the R parser when you source() it. Just that code happens to be S. Here the code is Tcl. You can source as well a .tcl file with the same code if you prefer... Best, Philippe On 21/04/10 14:07, Simon Urbanek wrote: Philippe, unfortunately that approach has one major drawback - it does not give you a connection. As I said in the previous e-mail it is fairly easy to talk to a tty directly, but the challenge is to turn it into a connection. I don't like the Tcl approach for several reasons, one of them being that it's entirely unnatural since you have to construct strings of code -- you could as well use rJava and Java serial connections which has a little more friendly syntax (but I wouldn't recommend that, either) or any other language R interfaces to. I was thinking about it a bit and I may be tempted to have a dab at a tty connection, but I still would not want to mess with ioctl (the other part Matt mentioned). Cheers, Simon On Apr 21, 2010, at 6:30 AM, Philippe Grosjean wrote: There is another option I use since a couple of years to pilot scientific devices, to program my chemostats, etc. and that is platform-independent: Tcl. Tcl i/o design is excellent and very robust, and Tcl/Tk is integrated in R with the tcltk package. It is really just a mather of a few lines of code in R to communicate through a serial port from R using Tcl. Something like: require(tcltk) .Tcl('set R_com1 [open com1 r+]') # Works on Windows too! # There are many config parameters available here... just an example .Tcl('fconfigure $R_com1 -mode 9600,n,8,1 -buffering none -blocking 0') # Send a command to your device through the serial port .Tcl('puts -nonewline $R_com1 {my_cmd}') # Read a line of text from the serial port line- tclvalue(.Tcl('gets $R_com1')) With a little bit more code, one can program Tcl to call R code automatically everytime new data is pushed by the connected device through the serial port. This is done using something like: .Tcl('fileevent $R_com1 readable [list Handler_com1 $R_com1]') Here, Handler_com1 is a Tcl function. So, some care must be taken using tcltk's .Tcl.callback() to trigger the event on the R side. One way to deal with this easily is by using tclFun() from the tcltk2 package. In tcltk2, there is also ?tclTaskSchedule that can be of interest in the context of serial port communication to trigger a R function in the background regularly and collect data actively from the serial port. All these tools give me a lot a flexibility to communicate through the serial port from R,... and most importantly, to write my code in a portable way (tested on Windows XP and Linux Ubuntu). If there is some interest in this approach, I could initiate a 'tclcom' R package on R-Forge and place there the code I have. Best, Philippe ..°})) ) ) ) ) ) ( ( ( ( (Prof. Philippe Grosjean ) ) ) ) ) ( ( ( ( (Numerical Ecology of Aquatic Systems ) ) ) ) ) Mons University, Belgium ( ( ( ( ( .. On 21/04/10 03:17, shotwelm wrote: Simon is right of course, there are plenty of sensors that would work just fine at 9600 baud (like a thermistor rigged to an ADC). There's a theorem along these lines (Nyquist sampling theorem?). I think piping the output to R is a clever solution. I added a few lines to the ttys.c program so that the baud rate is a command line option (i.e. -B9600) http://biostatmatt.com/temp/ttys.c and confirmed it will compile in Linux (2.6.30). Maybe it will save a step. Microcontrollers really are addictive! For an ioctl package, I was originally thinking of using file descriptors directly. However, I agree this feels like subverting what could be an extension of the connections API. Given that everything is a file in POSIX systems, there may be an argument for an ioctl package that is independent of the connections implementation, say to do things that connections were not designed to do. For example, interfacing with V4L2 devices usually involves many ioctl calls, an mmap call, but rarely read or write calls. But maybe it would just be better to pipe this type of output to R also... -Matt On Tue, 2010-04-20 at 16:42 -0400, Simon Urbanek wrote: On Apr 20, 2010, at 11:51 AM, shotwelm wrote: I've done some microcontroller work over serial also. Unfortunately, interfacing with a serial port is system dependent, and the mechanisms can be quite different, as you probably know. It appears that Simon has a solution below that will work if you
[Rd] suggestion how to use memcpy in duplicate.c
From copyVector in duplicate.c : void copyVector(SEXP s, SEXP t) { int i, ns, nt; nt = LENGTH(t); ns = LENGTH(s); switch (TYPEOF(s)) { ... case INTSXP: for (i = 0; i ns; i++) INTEGER(s)[i] = INTEGER(t)[i % nt]; break; ... could that be replaced with : case INTSXP: for (i=0; ins/nt; i++) memcpy((char *)DATAPTR(s)+i*nt*sizeof(int), (char *)DATAPTR(t), nt*sizeof(int)); break; and similar for the other types in copyVector. This won't help regular vector copies, since those seem to be done by the DUPLICATE_ATOMIC_VECTOR macro, see next suggestion below, but it should help copyMatrix which calls copyVector, scan.c which calls copyVector on three lines, dcf.c (once) and dounzip.c (once). For the DUPLICATE_ATOMIC_VECTOR macro there is already a comment next to it : FIXME: surely memcpy would be faster here? which seems to refer to the for loop : else { \ int __i__; \ type *__fp__ = fun(from), *__tp__ = fun(to); \ for (__i__ = 0; __i__ __n__; __i__++) \ __tp__[__i__] = __fp__[__i__]; \ } \ Could that loop be replaced by the following ? else { \ memcpy((char *)DATAPTR(to), (char *)DATAPTR(from), __n__*sizeof(type)); \ }\ In the data.table package, dogroups.c uses this technique, so the principle is tested and works well so far. Are there any road blocks preventing this change, or is anyone already working on it ? If not then I'll try and test it (on Ubuntu 32bit) and submit patch with timings, as before. Comments/pointers much appreciated. Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion how to use memcpy in duplicate.c
Matt, On Apr 21, 2010, at 11:54 AM, Matthew Dowle wrote: From copyVector in duplicate.c : void copyVector(SEXP s, SEXP t) { int i, ns, nt; nt = LENGTH(t); ns = LENGTH(s); switch (TYPEOF(s)) { ... case INTSXP: for (i = 0; i ns; i++) INTEGER(s)[i] = INTEGER(t)[i % nt]; break; ... could that be replaced with : case INTSXP: for (i=0; ins/nt; i++) memcpy((char *)DATAPTR(s)+i*nt*sizeof(int), (char *)DATAPTR(t), nt*sizeof(int)); break; Won't that miss the last incomplete chunk? (and please don't use DATAPTR on INTSXP even though the effect is currently the same) In general it seems that the it depends on nt whether this is efficient or not since calls to short memcpy are expensive (very small nt that is). I ran some empirical tests to compare memcpy vs for() (x86_64, OS X) and the results were encouraging - depending on the size of the copied block the difference could be quite big: tiny block (ca. n = 32 or less) - for() is faster small block (n ~ 1k) - memcpy is ca. 8x faster as the size increases the gap closes (presumably due to RAM bandwidth limitations) so for n = 512M it is ~30%. Of course this is contingent on the implementation of memcpy, compiler, architecture etc. And will only matter if copying is what you do most of the time ... Cheers, Simon and similar for the other types in copyVector. This won't help regular vector copies, since those seem to be done by the DUPLICATE_ATOMIC_VECTOR macro, see next suggestion below, but it should help copyMatrix which calls copyVector, scan.c which calls copyVector on three lines, dcf.c (once) and dounzip.c (once). For the DUPLICATE_ATOMIC_VECTOR macro there is already a comment next to it : FIXME: surely memcpy would be faster here? which seems to refer to the for loop : else { \ int __i__; \ type *__fp__ = fun(from), *__tp__ = fun(to); \ for (__i__ = 0; __i__ __n__; __i__++) \ __tp__[__i__] = __fp__[__i__]; \ } \ Could that loop be replaced by the following ? else { \ memcpy((char *)DATAPTR(to), (char *)DATAPTR(from), __n__*sizeof(type)); \ }\ In the data.table package, dogroups.c uses this technique, so the principle is tested and works well so far. Are there any road blocks preventing this change, or is anyone already working on it ? If not then I'll try and test it (on Ubuntu 32bit) and submit patch with timings, as before. Comments/pointers much appreciated. Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Bugs? when dealing with contrasts
R version 2.10.1 (2009-12-14) Copyright (C) 2009 The R Foundation for Statistical Computing ISBN 3-900051-07-0 R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Below are two cases where I don't seem to be getting contr.sum contrasts even though they were specified. Are these bugs? The first case is an interaction between continuous and factor variables. The second case contrasts= was specified as an arg to lm. The second works ok if we set the contrasts through options but not if we set it through an lm argument. # 1. In this case I don't seem to be getting contr.sum contrasts: options(contrasts = c(contr.sum, contr.poly)) getOption(contrasts) [1] contr.sum contr.poly scores - rep(seq(-2, 2), 3); scores [1] -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 fac - gl(3, 5); fac [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 Levels: 1 2 3 # I get this: model.matrix(~ scores:fac) (Intercept) scores:fac1 scores:fac2 scores:fac3 11 -2 0 0 21 -1 0 0 31 0 0 0 41 1 0 0 51 2 0 0 61 0 -2 0 71 0 -1 0 81 0 0 0 91 0 1 0 10 1 0 2 0 11 1 0 0 -2 12 1 0 0 -1 13 1 0 0 0 14 1 0 0 1 15 1 0 0 2 attr(,assign) [1] 0 1 1 1 attr(,contrasts) attr(,contrasts)$fac [1] contr.sum # But I was expecting this since I am using contr.sum cbind(1, model.matrix(~ fac)[,2:3] * scores) fac1 fac2 1 1 -20 2 1 -10 3 100 4 110 5 120 6 10 -2 7 10 -1 8 100 9 101 10 102 11 122 12 111 13 100 14 1 -1 -1 15 1 -2 -2 # 2. # here I don't get contr.sum but rather get contr.treatment options(contrasts = c(contr.treatment, contr.poly)) getOption(contrasts) [1] contr.treatment contr.poly model.matrix(lm(seq(15) ~ fac, contrasts = c(contr.sum, contr.poly))) (Intercept) fac2 fac3 1100 2100 3100 4100 5100 6110 7110 8110 9110 10 110 11 101 12 101 13 101 14 101 15 101 attr(,assign) [1] 0 1 1 attr(,contrasts) attr(,contrasts)$fac [1] contr.treatment model.matrix(lm(seq(15) ~ fac)) # same (Intercept) fac2 fac3 1100 2100 3100 4100 5100 6110 7110 8110 9110 10 110 11 101 12 101 13 101 14 101 15 101 attr(,assign) [1] 0 1 1 attr(,contrasts) attr(,contrasts)$fac [1] contr.treatment # I was expecting the first one to give me this options(contrasts = c(contr.sum, contr.poly)) model.matrix(lm(seq(15) ~ fac)) (Intercept) fac1 fac2 1110 2110 3110 4110 5110 6101 7101 8101 9101 10 101 11 1 -1 -1 12 1 -1 -1 13 1 -1 -1 14 1 -1 -1 15 1 -1 -1 attr(,assign) [1] 0 1 1 attr(,contrasts) attr(,contrasts)$fac [1] contr.sum R.version.string [1] R version 2.10.1 (2009-12-14) win.version() [1] Windows Vista (build 6002) Service Pack 2 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion how to use memcpy in duplicate.c
On 4/21/10 10:45 AM, Simon Urbanek wrote: Won't that miss the last incomplete chunk? (and please don't use DATAPTR on INTSXP even though the effect is currently the same) In general it seems that the it depends on nt whether this is efficient or not since calls to short memcpy are expensive (very small nt that is). I ran some empirical tests to compare memcpy vs for() (x86_64, OS X) and the results were encouraging - depending on the size of the copied block the difference could be quite big: tiny block (ca. n = 32 or less) - for() is faster small block (n ~ 1k) - memcpy is ca. 8x faster as the size increases the gap closes (presumably due to RAM bandwidth limitations) so for n = 512M it is ~30%. Of course this is contingent on the implementation of memcpy, compiler, architecture etc. And will only matter if copying is what you do most of the time ... Copying of vectors is something that I would expect to happen fairly often in many applications of R. Is for() faster on small blocks by enough that one would want to branch based on size? + seth -- Seth Falcon | @sfalcon | http://userprimary.net/ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion how to use memcpy in duplicate.c
Le 21/04/10 17:54, Matthew Dowle a écrit : From copyVector in duplicate.c : void copyVector(SEXP s, SEXP t) { int i, ns, nt; nt = LENGTH(t); ns = LENGTH(s); switch (TYPEOF(s)) { ... case INTSXP: for (i = 0; i ns; i++) INTEGER(s)[i] = INTEGER(t)[i % nt]; break; ... could that be replaced with : case INTSXP: for (i=0; ins/nt; i++) memcpy((char *)DATAPTR(s)+i*nt*sizeof(int), (char *)DATAPTR(t), nt*sizeof(int)); break; or at least with something like this: int* p_s = INTEGER(s) ; int* p_t = INTEGER(t) ; for( i=0 ; i ns ; i++){ p_s[i] = p_t[i % nt]; } since expanding the INTEGER macro over and over has a price. and similar for the other types in copyVector. This won't help regular vector copies, since those seem to be done by the DUPLICATE_ATOMIC_VECTOR macro, see next suggestion below, but it should help copyMatrix which calls copyVector, scan.c which calls copyVector on three lines, dcf.c (once) and dounzip.c (once). For the DUPLICATE_ATOMIC_VECTOR macro there is already a comment next to it : FIXME: surely memcpy would be faster here? which seems to refer to the for loop : else { \ int __i__; \ type *__fp__ = fun(from), *__tp__ = fun(to); \ for (__i__ = 0; __i__ __n__; __i__++) \ __tp__[__i__] = __fp__[__i__]; \ } \ Could that loop be replaced by the following ? else { \ memcpy((char *)DATAPTR(to), (char *)DATAPTR(from), __n__*sizeof(type)); \ }\ In the data.table package, dogroups.c uses this technique, so the principle is tested and works well so far. Are there any road blocks preventing this change, or is anyone already working on it ? If not then I'll try and test it (on Ubuntu 32bit) and submit patch with timings, as before. Comments/pointers much appreciated. Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Romain Francois Professional R Enthusiast +33(0) 6 28 91 30 30 http://romainfrancois.blog.free.fr |- http://bit.ly/9aKDM9 : embed images in Rd documents |- http://tr.im/OIXN : raster images and RImageJ |- http://tr.im/OcQe : Rcpp 0.7.7 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion how to use memcpy in duplicate.c
On Apr 21, 2010, at 2:15 PM, Seth Falcon wrote: On 4/21/10 10:45 AM, Simon Urbanek wrote: Won't that miss the last incomplete chunk? (and please don't use DATAPTR on INTSXP even though the effect is currently the same) In general it seems that the it depends on nt whether this is efficient or not since calls to short memcpy are expensive (very small nt that is). I ran some empirical tests to compare memcpy vs for() (x86_64, OS X) and the results were encouraging - depending on the size of the copied block the difference could be quite big: tiny block (ca. n = 32 or less) - for() is faster small block (n ~ 1k) - memcpy is ca. 8x faster as the size increases the gap closes (presumably due to RAM bandwidth limitations) so for n = 512M it is ~30%. Of course this is contingent on the implementation of memcpy, compiler, architecture etc. And will only matter if copying is what you do most of the time ... Copying of vectors is something that I would expect to happen fairly often in many applications of R. Is for() faster on small blocks by enough that one would want to branch based on size? Good question. Given that the branching itself adds overhead possibly not. In the best case for() can be ~40% faster (for single-digit n) but that means billions of copies to make a difference (since the operation itself is so fast). The break-even point on my test machine is n=32 and when I added the branching it took 20% hit so I guess it's simply not worth it. The only case that may be worth branching is n:1 since that is likely a fairly common use (the branching penalty in copy routines is lower than comparing memcpy/for implementations since the branching can be done before the outer for loop so this may vary case-by-case). Cheers, Simon __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion how to use memcpy in duplicate.c
On Apr 21, 2010, at 3:32 PM, Romain Francois wrote: Le 21/04/10 17:54, Matthew Dowle a écrit : From copyVector in duplicate.c : void copyVector(SEXP s, SEXP t) { int i, ns, nt; nt = LENGTH(t); ns = LENGTH(s); switch (TYPEOF(s)) { ... case INTSXP: for (i = 0; i ns; i++) INTEGER(s)[i] = INTEGER(t)[i % nt]; break; ... could that be replaced with : case INTSXP: for (i=0; ins/nt; i++) memcpy((char *)DATAPTR(s)+i*nt*sizeof(int), (char *)DATAPTR(t), nt*sizeof(int)); break; or at least with something like this: int* p_s = INTEGER(s) ; int* p_t = INTEGER(t) ; for( i=0 ; i ns ; i++){ p_s[i] = p_t[i % nt]; } since expanding the INTEGER macro over and over has a price. ... in packages, yes, but not in core R. Cheers, Simon and similar for the other types in copyVector. This won't help regular vector copies, since those seem to be done by the DUPLICATE_ATOMIC_VECTOR macro, see next suggestion below, but it should help copyMatrix which calls copyVector, scan.c which calls copyVector on three lines, dcf.c (once) and dounzip.c (once). For the DUPLICATE_ATOMIC_VECTOR macro there is already a comment next to it : FIXME: surely memcpy would be faster here? which seems to refer to the for loop : else { \ int __i__; \ type *__fp__ = fun(from), *__tp__ = fun(to); \ for (__i__ = 0; __i__ __n__; __i__++) \ __tp__[__i__] = __fp__[__i__]; \ } \ Could that loop be replaced by the following ? else { \ memcpy((char *)DATAPTR(to), (char *)DATAPTR(from), __n__*sizeof(type)); \ }\ In the data.table package, dogroups.c uses this technique, so the principle is tested and works well so far. Are there any road blocks preventing this change, or is anyone already working on it ? If not then I'll try and test it (on Ubuntu 32bit) and submit patch with timings, as before. Comments/pointers much appreciated. Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Romain Francois Professional R Enthusiast +33(0) 6 28 91 30 30 http://romainfrancois.blog.free.fr |- http://bit.ly/9aKDM9 : embed images in Rd documents |- http://tr.im/OIXN : raster images and RImageJ |- http://tr.im/OcQe : Rcpp 0.7.7 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion how to use memcpy in duplicate.c
Le 21/04/10 21:39, Simon Urbanek a écrit : On Apr 21, 2010, at 3:32 PM, Romain Francois wrote: Le 21/04/10 17:54, Matthew Dowle a écrit : From copyVector in duplicate.c : void copyVector(SEXP s, SEXP t) { int i, ns, nt; nt = LENGTH(t); ns = LENGTH(s); switch (TYPEOF(s)) { ... case INTSXP: for (i = 0; i ns; i++) INTEGER(s)[i] = INTEGER(t)[i % nt]; break; ... could that be replaced with : case INTSXP: for (i=0; ins/nt; i++) memcpy((char *)DATAPTR(s)+i*nt*sizeof(int), (char *)DATAPTR(t), nt*sizeof(int)); break; or at least with something like this: int* p_s = INTEGER(s) ; int* p_t = INTEGER(t) ; for( i=0 ; i ns ; i++){ p_s[i] = p_t[i % nt]; } since expanding the INTEGER macro over and over has a price. ... in packages, yes, but not in core R. Hmm. I was not talking about the overhead of the INTEGER function: int *(INTEGER)(SEXP x) { if(TYPEOF(x) != INTSXP TYPEOF(x) != LGLSXP) error(%s() can only be applied to a '%s', not a '%s', INTEGER, integer, type2char(TYPEOF(x))); return INTEGER(x); } but the one related to the macro. #define INTEGER(x) ((int *) DATAPTR(x)) #define DATAPTR(x) (((SEXPREC_ALIGN *) (x)) + 1) so the loop expands to : for (i = 0; i ns; i++) ((int *) (((SEXPREC_ALIGN *) (s)) + 1))[i] = ((int *) (((SEXPREC_ALIGN *) (t)) + 1))[i % nt]; I still believe grabbing the pointer just once for s and once for t is more efficient ... Cheers, Simon and similar for the other types in copyVector. This won't help regular vector copies, since those seem to be done by the DUPLICATE_ATOMIC_VECTOR macro, see next suggestion below, but it should help copyMatrix which calls copyVector, scan.c which calls copyVector on three lines, dcf.c (once) and dounzip.c (once). For the DUPLICATE_ATOMIC_VECTOR macro there is already a comment next to it : FIXME: surely memcpy would be faster here? which seems to refer to the for loop : else { \ int __i__; \ type *__fp__ = fun(from), *__tp__ = fun(to); \ for (__i__ = 0; __i__ __n__; __i__++) \ __tp__[__i__] = __fp__[__i__]; \ } \ Could that loop be replaced by the following ? else { \ memcpy((char *)DATAPTR(to), (char *)DATAPTR(from), __n__*sizeof(type)); \ }\ In the data.table package, dogroups.c uses this technique, so the principle is tested and works well so far. Are there any road blocks preventing this change, or is anyone already working on it ? If not then I'll try and test it (on Ubuntu 32bit) and submit patch with timings, as before. Comments/pointers much appreciated. Matthew -- Romain Francois Professional R Enthusiast +33(0) 6 28 91 30 30 http://romainfrancois.blog.free.fr |- http://bit.ly/9aKDM9 : embed images in Rd documents |- http://tr.im/OIXN : raster images and RImageJ |- http://tr.im/OcQe : Rcpp 0.7.7 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion how to use memcpy in duplicate.c
On Apr 21, 2010, at 4:13 PM, Romain Francois wrote: Le 21/04/10 21:39, Simon Urbanek a écrit : On Apr 21, 2010, at 3:32 PM, Romain Francois wrote: Le 21/04/10 17:54, Matthew Dowle a écrit : From copyVector in duplicate.c : void copyVector(SEXP s, SEXP t) { int i, ns, nt; nt = LENGTH(t); ns = LENGTH(s); switch (TYPEOF(s)) { ... case INTSXP: for (i = 0; i ns; i++) INTEGER(s)[i] = INTEGER(t)[i % nt]; break; ... could that be replaced with : case INTSXP: for (i=0; ins/nt; i++) memcpy((char *)DATAPTR(s)+i*nt*sizeof(int), (char *)DATAPTR(t), nt*sizeof(int)); break; or at least with something like this: int* p_s = INTEGER(s) ; int* p_t = INTEGER(t) ; for( i=0 ; i ns ; i++){ p_s[i] = p_t[i % nt]; } since expanding the INTEGER macro over and over has a price. ... in packages, yes, but not in core R. Hmm. I was not talking about the overhead of the INTEGER function: int *(INTEGER)(SEXP x) { if(TYPEOF(x) != INTSXP TYPEOF(x) != LGLSXP) error(%s() can only be applied to a '%s', not a '%s', INTEGER, integer, type2char(TYPEOF(x))); return INTEGER(x); } but the one related to the macro. #define INTEGER(x)((int *) DATAPTR(x)) #define DATAPTR(x)(((SEXPREC_ALIGN *) (x)) + 1) so the loop expands to : for (i = 0; i ns; i++) ((int *) (((SEXPREC_ALIGN *) (s)) + 1))[i] = ((int *) (((SEXPREC_ALIGN *) (t)) + 1))[i % nt]; I still believe grabbing the pointer just once for s and once for t is more efficient ... Nope, since everything involved is loop invariant so the pointer values don't change (you'd have to declare s or t volatile to prevent that). Try using gcc -s and you'll see that the code is the same (depending on the version of gcc the order of the first comparison can change so technically the INTEGER(x) version can save one add instruction in the degenerate case and be faster(!) in old gcc). Cheers, Simon and similar for the other types in copyVector. This won't help regular vector copies, since those seem to be done by the DUPLICATE_ATOMIC_VECTOR macro, see next suggestion below, but it should help copyMatrix which calls copyVector, scan.c which calls copyVector on three lines, dcf.c (once) and dounzip.c (once). For the DUPLICATE_ATOMIC_VECTOR macro there is already a comment next to it : FIXME: surely memcpy would be faster here? which seems to refer to the for loop : else { \ int __i__; \ type *__fp__ = fun(from), *__tp__ = fun(to); \ for (__i__ = 0; __i__ __n__; __i__++) \ __tp__[__i__] = __fp__[__i__]; \ } \ Could that loop be replaced by the following ? else { \ memcpy((char *)DATAPTR(to), (char *)DATAPTR(from), __n__*sizeof(type)); \ }\ In the data.table package, dogroups.c uses this technique, so the principle is tested and works well so far. Are there any road blocks preventing this change, or is anyone already working on it ? If not then I'll try and test it (on Ubuntu 32bit) and submit patch with timings, as before. Comments/pointers much appreciated. Matthew -- Romain Francois Professional R Enthusiast +33(0) 6 28 91 30 30 http://romainfrancois.blog.free.fr |- http://bit.ly/9aKDM9 : embed images in Rd documents |- http://tr.im/OIXN : raster images and RImageJ |- http://tr.im/OcQe : Rcpp 0.7.7 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Bugs? when dealing with contrasts
On Wed, Apr 21, 2010 at 4:26 PM, Peter Dalgaard pda...@gmail.com wrote: As for case #1, the rules are tricky in cases where interactions are present without main effects, but AFAICS, what you observe is essentially the same effect as model.matrix(~fac-1, contrasts=list(fac=contr.sum)) fac1 fac2 fac3 1 1 0 0 2 1 0 0 3 1 0 0 4 1 0 0 5 1 0 0 6 0 1 0 7 0 1 0 8 0 1 0 9 0 1 0 10 0 1 0 11 0 0 1 12 0 0 1 13 0 0 1 14 0 0 1 15 0 0 1 attr(,assign) [1] 1 1 1 attr(,contrasts) attr(,contrasts)$fac [1] contr.sum I.e., that R reverts to using indicator variables when the intercept is absent. Is there any nice way of getting contr.sum coding for the interaction as opposed to the ugly code in my post that I used to force it? i.e. cbind(1, model.matrix(~ fac)[,2:3] * scores) __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion how to use memcpy in duplicate.c
On Apr 21, 2010, at 4:39 PM, Simon Urbanek wrote: On Apr 21, 2010, at 4:13 PM, Romain Francois wrote: Le 21/04/10 21:39, Simon Urbanek a écrit : On Apr 21, 2010, at 3:32 PM, Romain Francois wrote: Le 21/04/10 17:54, Matthew Dowle a écrit : From copyVector in duplicate.c : void copyVector(SEXP s, SEXP t) { int i, ns, nt; nt = LENGTH(t); ns = LENGTH(s); switch (TYPEOF(s)) { ... case INTSXP: for (i = 0; i ns; i++) INTEGER(s)[i] = INTEGER(t)[i % nt]; break; ... could that be replaced with : case INTSXP: for (i=0; ins/nt; i++) memcpy((char *)DATAPTR(s)+i*nt*sizeof(int), (char *)DATAPTR(t), nt*sizeof(int)); break; or at least with something like this: int* p_s = INTEGER(s) ; int* p_t = INTEGER(t) ; for( i=0 ; i ns ; i++){ p_s[i] = p_t[i % nt]; } since expanding the INTEGER macro over and over has a price. ... in packages, yes, but not in core R. Hmm. I was not talking about the overhead of the INTEGER function: int *(INTEGER)(SEXP x) { if(TYPEOF(x) != INTSXP TYPEOF(x) != LGLSXP) error(%s() can only be applied to a '%s', not a '%s', INTEGER, integer, type2char(TYPEOF(x))); return INTEGER(x); } but the one related to the macro. #define INTEGER(x) ((int *) DATAPTR(x)) #define DATAPTR(x) (((SEXPREC_ALIGN *) (x)) + 1) so the loop expands to : for (i = 0; i ns; i++) ((int *) (((SEXPREC_ALIGN *) (s)) + 1))[i] = ((int *) (((SEXPREC_ALIGN *) (t)) + 1))[i % nt]; I still believe grabbing the pointer just once for s and once for t is more efficient ... Nope, since everything involved is loop invariant so the pointer values don't change (you'd have to declare s or t volatile to prevent that). Try using gcc -s Sorry, I meant gcc -S of course. and you'll see that the code is the same (depending on the version of gcc the order of the first comparison can change so technically the INTEGER(x) version can save one add instruction in the degenerate case and be faster(!) in old gcc [the reason being that INTEGER(x) does not need to be evaluated if the loop is not entered whereas p_s and p_t are populated unconditionally - smarter compilers will notice that p_s/p_t are not used in that case and thus generate identical code in both cases] Cheers, Simon and similar for the other types in copyVector. This won't help regular vector copies, since those seem to be done by the DUPLICATE_ATOMIC_VECTOR macro, see next suggestion below, but it should help copyMatrix which calls copyVector, scan.c which calls copyVector on three lines, dcf.c (once) and dounzip.c (once). For the DUPLICATE_ATOMIC_VECTOR macro there is already a comment next to it : FIXME: surely memcpy would be faster here? which seems to refer to the for loop : else { \ int __i__; \ type *__fp__ = fun(from), *__tp__ = fun(to); \ for (__i__ = 0; __i__ __n__; __i__++) \ __tp__[__i__] = __fp__[__i__]; \ } \ Could that loop be replaced by the following ? else { \ memcpy((char *)DATAPTR(to), (char *)DATAPTR(from), __n__*sizeof(type)); \ }\ In the data.table package, dogroups.c uses this technique, so the principle is tested and works well so far. Are there any road blocks preventing this change, or is anyone already working on it ? If not then I'll try and test it (on Ubuntu 32bit) and submit patch with timings, as before. Comments/pointers much appreciated. Matthew -- Romain Francois Professional R Enthusiast +33(0) 6 28 91 30 30 http://romainfrancois.blog.free.fr |- http://bit.ly/9aKDM9 : embed images in Rd documents |- http://tr.im/OIXN : raster images and RImageJ |- http://tr.im/OcQe : Rcpp 0.7.7 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion how to use memcpy in duplicate.c
If I were worried about the time this loop takes, I would avoid using i%nt. For the attached C code compile with gcc 4.3.3 with -O2 I get # INTEGER() in loop system.time( r1 - .Call(my_rep1, 1:3, 1e7) ) user system elapsed 0.060 0.012 0.071 # INTEGER() before loop system.time( r2 - .Call(my_rep2, 1:3, 1e7) ) user system elapsed 0.076 0.008 0.086 # replace i%src_length in loop with j=0 before loop and #if(++j==src_length) j=0 ; # in the loop. system.time( r3 - .Call(my_rep3, 1:3, 1e7) ) user system elapsed 0.024 0.028 0.050 identical(r1,r2) identical(r2,r3) [1] TRUE The C code is: #define USE_RINTERNALS /* pretend we are in the R kernel */ #include R.h #include Rinternals.h SEXP my_rep1(SEXP s_src, SEXP s_dest_length) { int src_length = length(s_src) ; int dest_length = asInteger(s_dest_length) ; int i,j ; SEXP s_dest ; PROTECT(s_dest = allocVector(INTSXP, dest_length)) ; if(TYPEOF(s_src) != INTSXP) error(src must be integer data) ; for(i=0;idest_length;i++) { INTEGER(s_dest)[i] = INTEGER(s_src)[i % src_length] ; } UNPROTECT(1) ; return s_dest ; } SEXP my_rep2(SEXP s_src, SEXP s_dest_length) { int src_length = length(s_src) ; int dest_length = asInteger(s_dest_length) ; int *psrc = INTEGER(s_src) ; int *pdest ; int i ; SEXP s_dest ; PROTECT(s_dest = allocVector(INTSXP, dest_length)) ; pdest = INTEGER(s_dest) ; if(TYPEOF(s_src) != INTSXP) error(src must be integer data) ; /* end of boilerplate */ for(i=0;idest_length;i++) { pdest[i] = psrc[i % src_length] ; } UNPROTECT(1) ; return s_dest ; } SEXP my_rep3(SEXP s_src, SEXP s_dest_length) { int src_length = length(s_src) ; int dest_length = asInteger(s_dest_length) ; int *psrc = INTEGER(s_src) ; int *pdest ; int i,j ; SEXP s_dest ; PROTECT(s_dest = allocVector(INTSXP, dest_length)) ; pdest = INTEGER(s_dest) ; if(TYPEOF(s_src) != INTSXP) error(src must be integer data) ; /* end of boilerplate */ for(j=0,i=0;idest_length;i++) { *pdest++ = psrc[j++] ; if (j==src_length) { j = 0 ; } } UNPROTECT(1) ; return s_dest ; } Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com -Original Message- From: r-devel-boun...@r-project.org [mailto:r-devel-boun...@r-project.org] On Behalf Of Romain Francois Sent: Wednesday, April 21, 2010 12:32 PM To: Matthew Dowle Cc: r-de...@stat.math.ethz.ch Subject: Re: [Rd] suggestion how to use memcpy in duplicate.c Le 21/04/10 17:54, Matthew Dowle a écrit : From copyVector in duplicate.c : void copyVector(SEXP s, SEXP t) { int i, ns, nt; nt = LENGTH(t); ns = LENGTH(s); switch (TYPEOF(s)) { ... case INTSXP: for (i = 0; i ns; i++) INTEGER(s)[i] = INTEGER(t)[i % nt]; break; ... could that be replaced with : case INTSXP: for (i=0; ins/nt; i++) memcpy((char *)DATAPTR(s)+i*nt*sizeof(int), (char *)DATAPTR(t), nt*sizeof(int)); break; or at least with something like this: int* p_s = INTEGER(s) ; int* p_t = INTEGER(t) ; for( i=0 ; i ns ; i++){ p_s[i] = p_t[i % nt]; } since expanding the INTEGER macro over and over has a price. and similar for the other types in copyVector. This won't help regular vector copies, since those seem to be done by the DUPLICATE_ATOMIC_VECTOR macro, see next suggestion below, but it should help copyMatrix which calls copyVector, scan.c which calls copyVector on three lines, dcf.c (once) and dounzip.c (once). For the DUPLICATE_ATOMIC_VECTOR macro there is already a comment next to it : FIXME: surely memcpy would be faster here? which seems to refer to the for loop : else { \ int __i__; \ type *__fp__ = fun(from), *__tp__ = fun(to); \ for (__i__ = 0; __i__ __n__; __i__++) \ __tp__[__i__] = __fp__[__i__]; \ } \ Could that loop be replaced by the following ? else { \ memcpy((char *)DATAPTR(to), (char *)DATAPTR(from), __n__*sizeof(type)); \ }\ In the data.table package, dogroups.c uses this technique, so the principle is tested and works well so far. Are there any road blocks preventing this change, or is anyone already working on it ? If not then I'll try and test it (on Ubuntu 32bit) and submit patch with timings, as before. Comments/pointers much appreciated. Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Romain Francois Professional R Enthusiast +33(0) 6 28 91 30 30 http://romainfrancois.blog.free.fr |- http://bit.ly/9aKDM9 : embed images in Rd documents |- http://tr.im/OIXN : raster images and RImageJ |- http://tr.im/OcQe : Rcpp
[Rd] RUnit bug?
There appears to be a bug in RUnit. Given a testsuite testsuite.math, say, when I run: runTestSuite(testsuite.math) this works fine, provided there are no extraneous files in the unit test subdirectory. But if there are any Emacs temp files (with names that end with '~') then runTestSuite gets confused and tries to run functions from the temp files as well. [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Question of R CMD check
Hi all, Today, i just installed the newest R version 2.10.1 and other necessary tools for building R package under windows,e.g. Rtools, perl. All are the newest version. After the correct configuration under windows (configuration should be correct), i use it to re-check my old package. I found the following prolem when checking EXAMPLEs in each function, which did not exist before this re-installation. * checking examples ... ERROR Running examples in 'stam-Ex.R' failed. I used \dontrun{} % enddontrun in all the examples of my functions that should be no problem, i think. I checked my package before and did not find errors. I also browsed the checking results in 'stam-Ex.R'. It listed all the example codes in that file, something like this, cleanEx(); nameEx(stcdt) ### * stcdt flush(stderr()); flush(stdout()) ###example codes I did not met this problem before. Any ideas on solving this? Thanks a lot. -- - Jane Chang Queen's [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Bugs? when dealing with contrasts
G'day Gabor, On Wed, 21 Apr 2010 14:00:33 -0400 Gabor Grothendieck ggrothendi...@gmail.com wrote: Below are two cases where I don't seem to be getting contr.sum contrasts even though they were specified. Are these bugs? Short answer: no. :) The first case is an interaction between continuous and factor variables. The second case contrasts= was specified as an arg to lm. The second works ok if we set the contrasts through options but not if we set it through an lm argument. # 1. In this case I don't seem to be getting contr.sum contrasts: options(contrasts = c(contr.sum, contr.poly)) getOption(contrasts) [1] contr.sum contr.poly scores - rep(seq(-2, 2), 3); scores [1] -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 fac - gl(3, 5); fac [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 Levels: 1 2 3 # I get this: model.matrix(~ scores:fac) (Intercept) scores:fac1 scores:fac2 scores:fac3 11 -2 0 0 21 -1 0 0 31 0 0 0 41 1 0 0 51 2 0 0 61 0 -2 0 71 0 -1 0 81 0 0 0 91 0 1 0 10 1 0 2 0 11 1 0 0 -2 12 1 0 0 -1 13 1 0 0 0 14 1 0 0 1 15 1 0 0 2 attr(,assign) [1] 0 1 1 1 attr(,contrasts) attr(,contrasts)$fac [1] contr.sum When creating the model matrix, the various levels of a factor are first coded as indicator variables (for creation of the main effects and for creation of interactions). The contrast matrix is then applied to remove redundant columns from the design matrix; that is, columns that are known to be redundant due to the *design* (i.e. model formula as a whole), depending on whether data is available for all levels of a factor (or combinations of levels if several factors are in the model) you can still end up with a design matrix that does not have full column rank. In this case, since there is no main effect fac, the three columns that are put into the design matrix due to the score:fac term do not create a singularity, hence the contrast matrix is not applied to these three columns. The above description is probably a bit rough and lacking in detail (if not even wrong in some details). IMHO, the best explanation of how R goes from the model formula to the actual design matrix (for linear models) can still be found in MASS; though, if I am not mistaken, current versions of R seem to proceed slightly different to that description in more complicated models (i.e. several factors and continuous explanatory variables). # But I was expecting this since I am using contr.sum cbind(1, model.matrix(~ fac)[,2:3] * scores) fac1 fac2 1 1 -20 2 1 -10 3 100 4 110 5 120 6 10 -2 7 10 -1 8 100 9 101 10 102 11 122 12 111 13 100 14 1 -1 -1 15 1 -2 -2 If you wish to have the design matrix that contains the interaction terms as if the model had the main effects too but without the columns corresponding to the main effects, then just instruct R that you want to have such a matrix: R model.matrix(~ scores*fac)[,-(2:4)] (Intercept) scores:fac1 scores:fac2 11 -2 0 21 -1 0 31 0 0 41 1 0 51 2 0 61 0 -2 71 0 -1 81 0 0 91 0 1 10 1 0 2 11 1 2 2 12 1 1 1 13 1 0 0 14 1 -1 -1 15 1 -2 -2 Though, I am not sure why one wants to fit such a model. # 2. # here I don't get contr.sum but rather get contr.treatment options(contrasts = c(contr.treatment, contr.poly)) getOption(contrasts) [1] contr.treatment contr.poly model.matrix(lm(seq(15) ~ fac, contrasts = c(contr.sum, contr.poly))) (Intercept) fac2 fac3 1100 2100 3100 4100 5100 6110 7110 8110 9110 10 110 11 101 12 101 13 101 14 101 15
Re: [Rd] Bugs? when dealing with contrasts
On Wed, Apr 21, 2010 at 11:38 PM, Berwin A Turlach ber...@maths.uwa.edu.au wrote: # But I was expecting this since I am using contr.sum cbind(1, model.matrix(~ fac)[,2:3] * scores) fac1 fac2 1 1 -2 0 2 1 -1 0 3 1 0 0 4 1 1 0 5 1 2 0 6 1 0 -2 7 1 0 -1 8 1 0 0 9 1 0 1 10 1 0 2 11 1 2 2 12 1 1 1 13 1 0 0 14 1 -1 -1 15 1 -2 -2 If you wish to have the design matrix that contains the interaction terms as if the model had the main effects too but without the columns corresponding to the main effects, then just instruct R that you want to have such a matrix: R model.matrix(~ scores*fac)[,-(2:4)] (Intercept) scores:fac1 scores:fac2 1 1 -2 0 2 1 -1 0 3 1 0 0 4 1 1 0 5 1 2 0 6 1 0 -2 7 1 0 -1 8 1 0 0 9 1 0 1 10 1 0 2 11 1 2 2 12 1 1 1 13 1 0 0 14 1 -1 -1 15 1 -2 -2 Though, I am not sure why one wants to fit such a model. To save on degrees of freedom. I had reduced it to the minimum needed to illustrate but in fact its closer to this (except the coding is not the one needed): options(contrasts = c(contr.sum, contr.poly)) tab - as.table(matrix(1:21, 7)) dimnames(tab) = list(X = letters[1:7], Y = LETTERS[1:3]) rr - factor(row(tab)) cc - factor(col(tab)) scores - rep(seq(-3,3), 3) model.matrix( ~ rr + cc + scores:cc) so the main effects are rr and cc but scores takes the place of rr in the interaction. Your description of the process seems right since it would predict that the following gives the required coding and it does: model.matrix(~ scores*cc + rr)[,-2] Thanks. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel