Re: [R] matching-case sensitivity
Alternatively, you could use casefold. This would make your code more compatible with S-Plus. For me, toupper and tolower are easier names to remember and easier to read. However, if you think that someone might want to try using your code with S-Plus, then casefold might be the better choice. hope this helps. spencer graves Marc Schwartz wrote: On Tue, 2003-08-26 at 15:09, Jablonsky, Nikita wrote: Hi All, I am trying to match two character arrays (email lists) using either pmatch(), match() or charmatch() functions. However the function is missing some matches due to differences in the cases of some letters between the two arrays. Is there any way to disable case sensitivity or is there an entirely better way to match two character arrays that have identical entries but written in different case? Thanks Nikita At least two options for case insensitive matching: 1. use grep(), which has an 'ignore.case' argument that you can set to TRUE. See ?grep 2. use the function toupper() to convert both character vectors to all upper case. See ?toupper. Conversely, tolower() would do the opposite. A quick solution using the second option would be: Vector1[toupper(Vector1) %in% toupper(Vector2)] which would return the elements that match in both vectors. A more formal example with some data: Vector1 - letters[1:10] Vector1 [1] a b c d e f g h i j Vector2 - c(toupper(letters[5:8]), letters[9:15]) Vector2 [1] E F G H i j k l m n o Vector1[toupper(Vector1) %in% toupper(Vector2)] [1] e f g h i j HTH, Marc Schwartz __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] Package for Numerical Integral?
Did you consider integrate? hope this helps. spencer graves Yao, Minghua wrote: Dear all, Is there any package for numerically calculating an integral? Thanks in advance. -Minghua __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] R on Linux/Opteron?
Dirk Eddelbuettel [EMAIL PROTECTED] writes: On Tue, Aug 26, 2003 at 03:17:19PM -0400, Liaw, Andy wrote: Has anyone tried using R on the the AMD Opteron in either 64- or 32-bit mode? If so, any good/bad experiences, comments, etc? We are considering getting this hardware, and would like to know if R can run smoothly on such a beast. Any comment much appreciated. http://buildd.debian.org/build.php?pkg=r-basearch=ia64file=log has logs of R builds on ia64 since Nov 2001, incl. the outcome of make check. We do not run the torture tests -- though I guess we could on some of the beefier hardware such as ia64. I don't think that's quite the same beast, though. Opterons are the x86-64 (or amd64) architecture and ia64 is Intel's, aka Itanium. Debian appears to be just warming up to including this architecture: http://lists.debian.org/debian-x86-64/2003/debian-x86-64-200308/threads.html whereas they have had ia64 out for a while. SuSE has an Opteron option and Luke said he tried it. Apparently it has a functioning 64-bit compiler toolchain - I weren't sure earlier whether they were just running a 64bit kernel and 32bit applications, but when Luke says so, I believe it... -- O__ Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907 __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] R on Linux/Opteron?
On Tue, Aug 26, 2003 at 03:17:19PM -0400, Liaw, Andy wrote: Has anyone tried using R on the the AMD Opteron in either 64- or 32-bit mode? If so, any good/bad experiences, comments, etc? We are considering getting this hardware, and would like to know if R can run smoothly on such a beast. Any comment much appreciated. http://buildd.debian.org/build.php?pkg=r-basearch=ia64file=log has logs of R builds on ia64 since Nov 2001, incl. the outcome of make check. We do not run the torture tests -- though I guess we could on some of the beefier hardware such as ia64. Dirk -- Those are my principles, and if you don't like them... well, I have others. -- Groucho Marx __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] matching-case sensitivity
On Tue, 26 Aug 2003, Jablonsky, Nikita wrote: Hi All, I am trying to match two character arrays (email lists) using either pmatch(), match() or charmatch() functions. However the function is missing some matches due to differences in the cases of some letters between the two arrays. Is there any way to disable case sensitivity or is there an entirely better way to match two character arrays that have identical entries but written in different case? You could use tolower() or toupper() to remove case differences. -thomas __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] matching-case sensitivity
On Tue, 2003-08-26 at 15:09, Jablonsky, Nikita wrote: Hi All, I am trying to match two character arrays (email lists) using either pmatch(), match() or charmatch() functions. However the function is missing some matches due to differences in the cases of some letters between the two arrays. Is there any way to disable case sensitivity or is there an entirely better way to match two character arrays that have identical entries but written in different case? Thanks Nikita At least two options for case insensitive matching: 1. use grep(), which has an 'ignore.case' argument that you can set to TRUE. See ?grep 2. use the function toupper() to convert both character vectors to all upper case. See ?toupper. Conversely, tolower() would do the opposite. A quick solution using the second option would be: Vector1[toupper(Vector1) %in% toupper(Vector2)] which would return the elements that match in both vectors. A more formal example with some data: Vector1 - letters[1:10] Vector1 [1] a b c d e f g h i j Vector2 - c(toupper(letters[5:8]), letters[9:15]) Vector2 [1] E F G H i j k l m n o Vector1[toupper(Vector1) %in% toupper(Vector2)] [1] e f g h i j HTH, Marc Schwartz __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] GWplot
For anyone wishing to see R-Tcl/Tk-MySQL in action (Windows XP)... http://moffett.isis.ucla.edu/gwplot/ Examples were by far the most useful learning tool during my programming endeavors so I hope this may help in your own projects. Thanks, Jason _ Enter for your chance to IM with Bon Jovi, Seal, Bow Wow, or Mary J Blige using MSN Messenger http://entertainment.msn.com/imastar __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] Seeking Packaging advice
I have two questions about packaging up code. 1) Weave/tangle advisable? In the course of extending some C code already in S, I had to work out the underlying math. It seems to me useful to keep this information with the code, using Knuth's tangle/weave type tools. I know there is some support for this in R code, but my question is about the wisdow of doing this with C (or Fortran, or other source) code. Against the advantage of having the documentation and code nicely integrated are the drawbacks of added complexity in the build process and portability concerns. Some of this is mitigated by the existing dependence on TeX. An intermediate approach would be to provide both the web (in the Knuth sense) source and the C output; the latter could be used directly by those not wishing to hassle with web. This isn't ideal, since the resulting C is likely to be a bit cryptic, and if someone edits the C without changing the web source confusion will reign. So do people have any thoughts about whether introducing this is a step forward or back? 2) Modifications of existing packages. I modified the survival package (I'm not sure if that's properly called a base package, but it's close). I know in this particular case, if I'm serious, I probably should contact the package maintainer. But this kind of operation will probably be pretty common for me; I imagine many on this list have already done it. In general, is the best thing to do a) package the new routines as a small additional package, with a dependence on the base package if necessary (the particular change I've made actually produces a few distinct files, slight tweaks of existing ones, that can stand on their own) b) package the new things in with the old under the same name as the old (obviously requires working with package maintainter) c) package the new things with the old and give it a new name. I'm also curious about what development strategy is best; I did b), and it seemed to work OK. But I kept expecting it to cause disaster (it probably helped that I usually didn't load the baseline survival packages; clearly that wouldn't be an option if working with one of the automatically loaded packages). Thanks. -- Ross Boylan wk: (415) 502-4031 530 Parnassus Avenue (Library) rm 115-4 [EMAIL PROTECTED] Dept of Epidemiology and Biostatistics fax: (415) 476-9856 University of California, San Francisco San Francisco, CA 94143-0840 hm: (415) 550-1062 __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] Generating routine for Poisson random numbers
You can generate Poisson random numbers from a Poisson process like this: rfishy-function(lambda){ t - 0 i - -1 while(t=lambda){ t-t-log(runif(1)) i-i+1 } return(i) } This is a nice compact algorithm for generating Poisson random numbers. It appears to work. I am used to seeing a Poisson counting process implemented using a num frames parameter, a success probability per frame, and counting the unform numbers in the 0 to 1 range that fall below the success probability over num frames. It is interesting to see an implementation that bypasses having to use a num_frames parameter, just the lambda value, and relies instead on incrementing a t value on each iteration and counting the number of times you can do so (while t = lambda). The name of this generator is descriptive, not a pub. It is very slow for large lambda, and incorrect for extremely large lambda (and possibly for extremely small lambda). If you only wanted a fairly small number of random variates with, say, 1e-6lambda100, then it's not too bad. One could impose a lambda range check so that you can only invoke the function using a lamba range where the Poisson RNG is expected to be reasonably accurate. The range you are giving is probably the most commonly used range where a Poisson random number generator might be used? Brian Ripley also mentioned that the counting process based implementation would not work well for large lambdas. Do you encounter such large lambdas in practice? Can't you always, in theory, avoid such large lambdas by changing the size of the time interval you want to consider? But why would anyone *want* to code their own Poisson random number generator, except perhaps as an interesting student exercise? Yes this is meant as an interesting exercise for someone who wants to understand how to implement probability distributions in an object oriented way (I am writing an article introducing people to probability modelling). I am looking for a compact algorithm that I can easily explain to people how it works and which will be a good enough rpois() approximation in many cases. I don't want to be blown out of the water for suggesting such an algorithm to represent a Poisson RNG so if you think it is inappropriate to learn about what how a Poisson RNG works using the above described generating process, then I would be interested in your views. Thank you for your thoughts on this matter. Regards, Paul Meagher -thomas __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] Viewing function source
On Tue, 26 Aug 2003 [EMAIL PROTECTED] wrote: Thomas Lumley wrote: The name of this generator is descriptive, not a pun. It is very slow for large lambda, and incorrect for extremely large lambda (and possibly for extremely small lambda). Not sure why it should be incorrect. It's certainly not theoretically incorrect. Rounding errors? Problem with runif()? Yes. Both. But it should work as long as lambda is much smaller than 2^53 and much larger than 2^-32 and much smaller than the period of whatever generator you are using, so it's going to be too slow before it's inaccurate. If you only wanted a fairly small number of random variates with, say, 1e-6lambda100, then it's not too bad. In the above vectorised form, if it's fast for n=1 it stays pretty fast for large n: try it with z-rfishy(1,5); even rfishy(10,5) only takes a few seconds. If you were going to implement in another language (which I thought was the point) then vectorising won't help. In R, yes. -thomas __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] Re: Generating routine for Poisson random numbers
One could impose a lambda range check so that you can only invoke the function using a lamba range where the Poisson RNG is expected to be reasonably accurate. The range you are giving is probably the most commonly used range where a Poisson random number generator might be used? Brian Ripley also mentioned that the counting process based implementation would not work well for large lambdas. Do you encounter such large lambdas in practice? Can't you always, in theory, avoid such large lambdas by changing the size of the time interval you want to consider? Personally, I'd probably change the question by approximating by a Normal for large lambda and a Bernoulli for very small lambda. The algorithm gets slow well before it gets inaccurate, though. But why would anyone *want* to code their own Poisson random number generator, except perhaps as an interesting student exercise? Yes this is meant as an interesting exercise for someone who wants to understand how to implement probability distributions in an object oriented way (I am writing an article introducing people to probability modelling). I am looking for a compact algorithm that I can easily explain to people how it works and which will be a good enough rpois() approximation in many cases. I don't want to be blown out of the water for suggesting such an algorithm to represent a Poisson RNG so if you think it is inappropriate to learn about what how a Poisson RNG works using the above described generating process, then I would be interested in your views. No, that's why I gave that as the exception. There are lots of things worth doing as a learning exercise that aren't worth doing otherwise. I do think that in an article you should also point out to people that there is a lot of numerical code available out there, written by people who know a lot more than we do about what they are doing. It's often easier than writing your own code and the results are better. One advantage of an object-oriented approach is that you can just rip out your implementation and slot in a new one if it is better. -thomas __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] R tools for large files
Duncan Murdoch [EMAIL PROTECTED] wrote: For example, if you want to read lines 1000 through 1100, you'd do it like this: lines - readLines(foo.txt, 1100)[1000:1100] I created a dataset thus: # file foo.awk: BEGIN { s = 01 for (i = 2; i = 41; i++) s = sprintf(%s %02d, s, i) n = (27 * 1024 * 1024) / (length(s) + 1) for (i = 1; i = n; i++) print s exit 0 } # shell command: mawk -f foo.awk /dev/null BIG That is, each record contains 41 2-digit integers, and the number of records was chosen so that the total size was approximately 27 dimegabytes. The number of records turns out to be 230,175. system.time(v - readLines(BIG)) [1] 7.75 0.17 8.13 0.00 0.00 # With BIG already in the file system cache... system.time(v - readLines(BIG, 20)[199001:20]) [1] 11.73 0.16 12.27 0.00 0.00 What's the importance of this? First, experiments I shall not weary you with showed that the time to read N lines grows faster than N. Second, if you want to select the _last_ thousand lines, you have to read _all_ of them into memory. For real efficiency here, what's wanted is a variant of readLines where n is an index vector (a vector of non-negative integers, a vector of non-positive integers, or a vector of logicals) saying which lines should be kept. The function that would need changing is do_readLines() in src/main/connections.c, unfortunately I don't understand R internals well enough to do it myself (yet). As a matter of fact, that _still_ wouldn't yield real efficiency, because every character would still have to be read by the modified readLines(), and it reads characters using Rconn_fgetc(), which is what gives readLines() its power and utility, but certainly doesn't give it wings. (One of the fundamental laws of efficient I/O library design is to base it on block- or line- at-a-time transfers, not character-at-a-time.) The AWK program NR = 199000 { next } {print} NR == 20 { exit } extracts lines 199001:2 in just 0.76 seconds, about 15 times faster. A C program to the same effect, using fgets(), took 0.39 seconds, or about 30 times faster than R. There are two fairly clear sources of overhead in the R code: (1) the overhead of reading characters one at a time through Rconn_fgetc() instead of a block or line at a time. mawk doesn't use fgets() for reading, and _does_ have the overhead of repeatedly checking a regular expression to determine where the end of the line is, which it is sensible enough to fast-path. (2) the overhead of allocating, filling in, and keeping, a whole lot of memory which is of no use whatever in computing the final result. mawk is actually fairly careful here, and only keeps one line at a time in the program shown above. Let's change it: NR = 199000 {next} {a[NR] = $0} NR == 20 {exit} END {for (i in a) print a[i]} That takes the time from 0.76 seconds to 0.80 seconds The simplest thing that could possibly work would be to add a function skipLines(con, n) which simply read and discarded n lines. result - scan(textConnection(lines), list( )) system.time(m - scan(textConnection(v), integer(41))) Read 41000 items [1] 0.99 0.00 1.01 0.00 0.00 One whole second to read 41,000 numbers on a 500 MHz machine? vv - rep(v, 240) Is there any possibility of storing the data in (platform) binary form? Binary connections (R-data.pdf, section 6.5 Binary connections) can be used to read binary-encoded data. I wrote a little C program to save out the 230175 records of 41 integers each in native binary form. Then in R I did system.time(m - readBin(BIN, integer(), n=230175*41, size=4)) [1] 0.57 0.52 1.11 0.00 0.00 system.time(m - matrix(data=m, ncol=41, byrow=TRUE)) [1] 2.55 0.34 2.95 0.00 0.00 Remember, this doesn't read a *sample* of the data, it reads *all* the data. It is so much faster than the alternatives in R that it just isn't funny. Trying scan() on the file took nearly 10 minutes before I killed it the other day, using readBin() is a thousand times faster than a simple scan() call on this particular data set. There has *got* to be a way of either generating or saving the data in binary form, using only approved Windows tools. Heck, it can probably be done using VBA. By the way, I've read most of the .pdf files I could find on the CRAN site, but haven't noticed any description of the R save-file format. Where should I have looked? (Yes, I know about src/main/saveload.c; I was hoping for some documentation, with maybe some diagrams.) __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] Re: Generating routine for Poisson random numbers
I do think that in an article you should also point out to people that there is a lot of numerical code available out there, written by people who know a lot more than we do about what they are doing. It's often easier than writing your own code and the results are better. One advantage of an object-oriented approach is that you can just rip out your implementation and slot in a new one if it is better. Exactly. That is another reason I do not feel the need to implement the RNG algorithm perfectly. If someone really wants a more fool proof rpois-like algorithm with better running time characteristics they can reimplement method using the rpois.c normal deviates approach. BTW, here is what my final Poisson RNG method looks like coded in PHP - it is modelled after the JSci library approach. I am only showing the constructor and the RNG method: class PoissonDistribution extends ProbabilityDistribution { var $lambda; function PoissonDistribution($lambda=1) { if($lambda = 0.0) { die(Lamda parameter should be positive.); } $this-lambda = $interval; } function RNG($num_vals=1) { if ($num_vals 1) { die(Number of random numbers to return must be 1 or greater); } for ($i=0; $i $num_vals; $i++) { $temp = 0; $count = -1; while($temp = $this-lambda) { $rand_val = mt_rand() / mt_getrandmax(); $temp = $temp - log($rand_val); $count++; } // record count value(s) if ($num_vals == 1) { $counts = $count; } else { $counts[$i] = $count; } } return $counts; } My simple eyeball tests indicate that the algorithm appears to generate unbiased estimates of the expected mean and variance given lambas in the range of .02 and 900. I guess to confirm the unbiasedness I would need to generate a bunch of sample estimates of the mean and variance from my Poisson random number sequences, plot the relative frequency of these estimates, and see if the central tendency of the estimates correspond to the mean and variance expected theoretically for a poisson random variable (i.e., mean and variance = lambda). I see why the performance characteristics get bad when lambda is big - the counting process involves more iterations. Most of the text book examples never use a lambda this big, often lambda is less than 100 and often not less than .02 or so. In other words, the typical parameter space for the algorithm may be such that areas where it breaks down are not that common in practice. I think this will be a perfectly acceptable RNG for a Poisson random variable provided you don't use unusually large or small lambda values - if I knew the break down range, I could implement a check-range test to disallow usage of the function for that range. Not sure yet exactly what characteristics of the algorithm would lead it to behave incorrectly at extremely small or large lambda values? BTW, is this simple method of generating a poisson random number discussed in detail in any other books or papers that I might consult? Regards, Paul Meagher __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] How to do leave-n-out cross validation in R?
Seems crossval from library(bootstrap) can only be used for leave-one-out and k-fold cross validation? Here is a dumb question, suppose n=80, how to do exactly leave-50-out cross validation? K-fold cross validation is not eligible for this case since n/ngroup is not an integer. Thanks! __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
RE: [R] R tools for large files
If we are going to use unix tools to create a new dataset before calling into R, why not simply use cat my_big_bad_file | tail +1001 | head -100 to read lines 1000-1100 (assuming one header row). Or if you have the shortlisted rownames in one file, you can use join after sort. A working example follows. # #!/bin/bash # match.sh last modified 10/07/03 # Does the same thing as egrep 'a|b|c|...' file but in batch mode # A script that matches all occurances of shortlist in data using the first column as common key if [ $# -ne 2 ]; then echo Usage: ${0/*\/} shortlist data exit fi TEMP1=/tmp/temp1.`date +%y%m%d-%H%M%S` TEMP2=/tmp/temp2.`date +%y%m%d-%H%M%S` TEMP3=/tmp/temp3.`date +%y%m%d-%H%M%S` TEMP4=/tmp/temp4.`date +%y%m%d-%H%M%S` TEMP5=/tmp/temp5.`date +%y%m%d-%H%M%S` grep -n . $1 | cut -f1 -d: | paste - $1 $TEMP1 sort -k 2 $TEMP1 $TEMP2 tail +2 $2 | sort -k 1 $TEMP3 # Assume data file has header headerRow=`head -1 $2` join -j1 2 -j2 1 -a 1 -t\$TEMP2 $TEMP3 $TEMP4 sort -n -k 2 $TEMP4 $TEMP5 /bin/echo $headerRow cut -f1,3- $TEMP5# column 2 contains orderings rm $TEMP1 $TEMP2 $TEMP3 $TEMP4 # -Original Message- From: Richard A. O'Keefe [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 27, 2003 9:04 AM To: [EMAIL PROTECTED] Subject: Re: [R] R tools for large files Duncan Murdoch [EMAIL PROTECTED] wrote: For example, if you want to read lines 1000 through 1100, you'd do it like this: lines - readLines(foo.txt, 1100)[1000:1100] I created a dataset thus: # file foo.awk: BEGIN { s = 01 for (i = 2; i = 41; i++) s = sprintf(%s %02d, s, i) n = (27 * 1024 * 1024) / (length(s) + 1) for (i = 1; i = n; i++) print s exit 0 } # shell command: mawk -f foo.awk /dev/null BIG That is, each record contains 41 2-digit integers, and the number of records was chosen so that the total size was approximately 27 dimegabytes. The number of records turns out to be 230,175. system.time(v - readLines(BIG)) [1] 7.75 0.17 8.13 0.00 0.00 # With BIG already in the file system cache... system.time(v - readLines(BIG, 20)[199001:20]) [1] 11.73 0.16 12.27 0.00 0.00 What's the importance of this? First, experiments I shall not weary you with showed that the time to read N lines grows faster than N. Second, if you want to select the _last_ thousand lines, you have to read _all_ of them into memory. For real efficiency here, what's wanted is a variant of readLines where n is an index vector (a vector of non-negative integers, a vector of non-positive integers, or a vector of logicals) saying which lines should be kept. The function that would need changing is do_readLines() in src/main/connections.c, unfortunately I don't understand R internals well enough to do it myself (yet). As a matter of fact, that _still_ wouldn't yield real efficiency, because every character would still have to be read by the modified readLines(), and it reads characters using Rconn_fgetc(), which is what gives readLines() its power and utility, but certainly doesn't give it wings. (One of the fundamental laws of efficient I/O library design is to base it on block- or line- at-a-time transfers, not character-at-a-time.) The AWK program NR = 199000 { next } {print} NR == 20 { exit } extracts lines 199001:2 in just 0.76 seconds, about 15 times faster. A C program to the same effect, using fgets(), took 0.39 seconds, or about 30 times faster than R. There are two fairly clear sources of overhead in the R code: (1) the overhead of reading characters one at a time through Rconn_fgetc() instead of a block or line at a time. mawk doesn't use fgets() for reading, and _does_ have the overhead of repeatedly checking a regular expression to determine where the end of the line is, which it is sensible enough to fast-path. (2) the overhead of allocating, filling in, and keeping, a whole lot of memory which is of no use whatever in computing the final result. mawk is actually fairly careful here, and only keeps one line at a time in the program shown above. Let's change it: NR = 199000 {next} {a[NR] = $0} NR == 20 {exit} END {for (i in a) print a[i]} That takes the time from 0.76 seconds to 0.80 seconds The simplest thing that could possibly work would be to add a function skipLines(con, n) which simply read and discarded n lines. result - scan(textConnection(lines), list( )) system.time(m - scan(textConnection(v), integer(41))) Read 41000 items [1] 0.99 0.00 1.01 0.00 0.00 One whole second to read 41,000 numbers on a 500 MHz machine? vv - rep(v, 240) Is there any possibility of storing the data in (platform) binary form? Binary connections
Re: [R] R tools for large files
I'm bored, but just to point out the obvious fact: to skip n lines in a text file you have to read *all* the characters in between to find the line separators. I have known for 30 years that reading text files of numbers is slow and inefficient. So do it only once and dump the results to a binary format, or a RDBMS or On Wed, 27 Aug 2003, Richard A. O'Keefe wrote: Duncan Murdoch [EMAIL PROTECTED] wrote: For example, if you want to read lines 1000 through 1100, you'd do it like this: lines - readLines(foo.txt, 1100)[1000:1100] I created a dataset thus: # file foo.awk: BEGIN { s = 01 for (i = 2; i = 41; i++) s = sprintf(%s %02d, s, i) n = (27 * 1024 * 1024) / (length(s) + 1) for (i = 1; i = n; i++) print s exit 0 } # shell command: mawk -f foo.awk /dev/null BIG That is, each record contains 41 2-digit integers, and the number of records was chosen so that the total size was approximately 27 dimegabytes. The number of records turns out to be 230,175. system.time(v - readLines(BIG)) [1] 7.75 0.17 8.13 0.00 0.00 # With BIG already in the file system cache... system.time(v - readLines(BIG, 20)[199001:20]) [1] 11.73 0.16 12.27 0.00 0.00 What's the importance of this? First, experiments I shall not weary you with showed that the time to read N lines grows faster than N. Second, if you want to select the _last_ thousand lines, you have to read _all_ of them into memory. For real efficiency here, what's wanted is a variant of readLines where n is an index vector (a vector of non-negative integers, a vector of non-positive integers, or a vector of logicals) saying which lines should be kept. The function that would need changing is do_readLines() in src/main/connections.c, unfortunately I don't understand R internals well enough to do it myself (yet). As a matter of fact, that _still_ wouldn't yield real efficiency, because every character would still have to be read by the modified readLines(), and it reads characters using Rconn_fgetc(), which is what gives readLines() its power and utility, but certainly doesn't give it wings. (One of the fundamental laws of efficient I/O library design is to base it on block- or line- at-a-time transfers, not character-at-a-time.) The AWK program NR = 199000 { next } {print} NR == 20 { exit } extracts lines 199001:2 in just 0.76 seconds, about 15 times faster. A C program to the same effect, using fgets(), took 0.39 seconds, or about 30 times faster than R. There are two fairly clear sources of overhead in the R code: (1) the overhead of reading characters one at a time through Rconn_fgetc() instead of a block or line at a time. mawk doesn't use fgets() for reading, and _does_ have the overhead of repeatedly checking a regular expression to determine where the end of the line is, which it is sensible enough to fast-path. (2) the overhead of allocating, filling in, and keeping, a whole lot of memory which is of no use whatever in computing the final result. mawk is actually fairly careful here, and only keeps one line at a time in the program shown above. Let's change it: NR = 199000 {next} {a[NR] = $0} NR == 20 {exit} END {for (i in a) print a[i]} That takes the time from 0.76 seconds to 0.80 seconds The simplest thing that could possibly work would be to add a function skipLines(con, n) which simply read and discarded n lines. result - scan(textConnection(lines), list( )) system.time(m - scan(textConnection(v), integer(41))) Read 41000 items [1] 0.99 0.00 1.01 0.00 0.00 One whole second to read 41,000 numbers on a 500 MHz machine? vv - rep(v, 240) Is there any possibility of storing the data in (platform) binary form? Binary connections (R-data.pdf, section 6.5 Binary connections) can be used to read binary-encoded data. I wrote a little C program to save out the 230175 records of 41 integers each in native binary form. Then in R I did system.time(m - readBin(BIN, integer(), n=230175*41, size=4)) [1] 0.57 0.52 1.11 0.00 0.00 system.time(m - matrix(data=m, ncol=41, byrow=TRUE)) [1] 2.55 0.34 2.95 0.00 0.00 Remember, this doesn't read a *sample* of the data, it reads *all* the data. It is so much faster than the alternatives in R that it just isn't funny. Trying scan() on the file took nearly 10 minutes before I killed it the other day, using readBin() is a thousand times faster than a simple scan() call on this particular data set. There has *got* to be a way of either generating or saving the data in binary form, using only approved Windows tools. Heck, it can probably be done using VBA. By the way, I've read most of the .pdf files I could find on the CRAN site, but haven't noticed any description of the R save-file format. Where should I have looked? (Yes, I
Re: [R] How to do leave-n-out cross validation in R?
On Tue, 26 Aug 2003, Lily wrote: Seems crossval from library(bootstrap) can only be used for leave-one-out and k-fold cross validation? Here is a dumb question, suppose n=80, how to do exactly leave-50-out cross validation? K-fold cross validation is not eligible for this case since n/ngroup is not an integer. Thanks! First, _you_ have to say exactly what _you_ mean by leave-n-out CV. If you can specify the algorithm, you can program it in R (or we may be able to help). As I have never encountered this, I don't know the definition, nor do I see the point. I suspect it is not really cross-validation at all (the term is widely misused in the machine-learning/neural nets communities to mean the use of a validation set). -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] Seeking Packaging advice
On Tue, 26 Aug 2003, Ross Boylan wrote: I have two questions about packaging up code. 1) Weave/tangle advisable? In the course of extending some C code already in S, I had to work out the underlying math. It seems to me useful to keep this information with the code, using Knuth's tangle/weave type tools. I know there is some support for this in R code, but my question is about the wisdow of doing this with C (or Fortran, or other source) code. Against the advantage of having the documentation and code nicely integrated are the drawbacks of added complexity in the build process and portability concerns. Some of this is mitigated by the existing dependence on TeX. There is none. We don't assume a working latex/tex, although some manuals will not be produced without working (pdf)latex (or texinfo-pdf). One quick comment: the pre-compiled packages (for Windows now and MacOS X for the next release) are produced automatically without user intervention. So if you want to have a package on CRAN, it needs to work out of the box, and there is no dependence on TeX, let alone weave/tangle, in the standard procedure. An intermediate approach would be to provide both the web (in the Knuth sense) source and the C output; the latter could be used directly by those not wishing to hassle with web. This isn't ideal, since the resulting C is likely to be a bit cryptic, and if someone edits the C without changing the web source confusion will reign. So do people have any thoughts about whether introducing this is a step forward or back? A useful analogue: we now distribute Fortran code not the original Ratfor. 2) Modifications of existing packages. I modified the survival package (I'm not sure if that's properly called a base package, but it's close). I know in this particular case, if It's a `recommended' package, as the DESCRIPTION file says. There is a base package, and several standard packages bundled with R, which have priority base and are often call `base packages'. I'm serious, I probably should contact the package maintainer. But this kind of operation will probably be pretty common for me; I imagine many on this list have already done it. In general, is the best thing to do a) package the new routines as a small additional package, with a dependence on the base package if necessary (the particular change I've made actually produces a few distinct files, slight tweaks of existing ones, that can stand on their own) b) package the new things in with the old under the same name as the old (obviously requires working with package maintainter) c) package the new things with the old and give it a new name. I'm also curious about what development strategy is best; I did b), and it seemed to work OK. But I kept expecting it to cause disaster (it probably helped that I usually didn't load the baseline survival packages; clearly that wouldn't be an option if working with one of the automatically loaded packages). I think a) is the best, including changing the names of any R functions you alter, and changing the entry points in any compiled code you alter. Package maintainers may have very good reasons not to go along with b), including their not being the original authors (true for survival), workload, lack of interest in the proposed changes, complications of ownership and copyright, c) is I believe unwise. It may be allowed by the licence (or may not) but in the couple of cases where I have seen it done it did not give anything like adequate credit to the original authors (who were never consulted) and the modified code distributed was out-of-date when originally released, let alone now. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] selecting by variable
Hi, I'm a recent R convert so I haven't quite figured out the details yet... How do I select one variable by another one? Ie if I want to draw the histogram of variable X only for those individuals that also have a value Y in a certain range? In STATA I would give something like: histogram X if ((Y=A Y=B)) (The data is for individuals and each individual has a number of characteristics including X and Y). thanks, eugene. __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
RE: [R] selecting by variable
Eugene, R allows indexing with logical vectors, so your example would look like hist(X[(Y=A) (Y=B)]) See the manual An Introduction to R for details. HTH Thomas -Original Message- From: Eugene Salinas [mailto:[EMAIL PROTECTED] Sent: 27 August 2003 09:49 To: [EMAIL PROTECTED] Subject: [R] selecting by variable Hi, I'm a recent R convert so I haven't quite figured out the details yet... How do I select one variable by another one? Ie if I want to draw the histogram of variable X only for those individuals that also have a value Y in a certain range? In STATA I would give something like: histogram X if ((Y=A Y=B)) (The data is for individuals and each individual has a number of characteristics including X and Y). thanks, eugene. __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help --- Thomas Hotz Research Associate in Medical Statistics University of Leicester United Kingdom Department of Epidemiology and Public Health 22-28 Princess Road West Leicester LE1 6TP Tel +44 116 252-5410 Fax +44 116 252-5423 Division of Medicine for the Elderly Department of Medicine The Glenfield Hospital Leicester LE3 9QP Tel +44 116 256-3643 Fax +44 116 232-2976 __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] matching-case sensitivity
Hallo On 26 Aug 2003 at 13:09, Jablonsky, Nikita wrote: Hi All, I am trying to match two character arrays (email lists) using either pmatch(), match() or charmatch() functions. However the function is missing some matches due to differences in the cases of some letters try toupper or tolower ttt-toupper(differences in the cases) ttt [1] DIFFERENCES IN THE CASES tolower(ttt) [1] differences in the cases between the two arrays. Is there any way to disable case sensitivity or is there an entirely better way to match two character arrays that have identical entries but written in different case? Thanks Nikita __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help Petr Pikal [EMAIL PROTECTED] __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] selecting by variable
Eugene Salinas [EMAIL PROTECTED] wrote: How do I select one variable by another one? Ie if I want to draw the histogram of variable X only for those individuals that also have a value Y in a certain range? In STATA I would give something like: histogram X if ((Y=A Y=B)) hist(x[Y=A Y=B]) See: ?Subscript ? -- Philippe Glaziou __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] selecting by variable
On Wed, 27 Aug 2003, Eugene Salinas wrote: I'm a recent R convert so I haven't quite figured out the details yet... Usually it is good to read the manuals when you use a unfamiliar software... How do I select one variable by another one? Ie if I want to draw the histogram of variable X only for those individuals that also have a value Y in a certain range? e.g. x = rnorm(100) y = 1:100 x[y = 20:50] will give you the value of x when y is between 20 and 50. To do a histogram, type: ?hist -- Cheers, Kevin -- On two occasions, I have been asked [by members of Parliament], 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able to rightly apprehend the kind of confusion of ideas that could provoke such a question. -- Charles Babbage (1791-1871) From Computer Stupidities: http://rinkworks.com/stupid/ -- Ko-Kang Kevin Wang Master of Science (MSc) Student SLC Tutor and Lab Demonstrator Department of Statistics University of Auckland New Zealand Homepage: http://www.stat.auckland.ac.nz/~kwan022 Ph: 373-7599 x88475 (City) x88480 (Tamaki) __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] selecting by variable
hist(X[Y=A Y=B]) `An Introduction to R' explains such things, as do (in more detail) the introductory texts (see the R FAQ). On Wed, 27 Aug 2003, Eugene Salinas wrote: I'm a recent R convert so I haven't quite figured out the details yet... How do I select one variable by another one? Ie if I want to draw the histogram of variable X only for those individuals that also have a value Y in a certain range? In STATA I would give something like: histogram X if ((Y=A Y=B)) (The data is for individuals and each individual has a number of characteristics including X and Y). -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] GWplot
Jason == Jason Fisher [EMAIL PROTECTED] on Tue, 26 Aug 2003 14:08:26 -0700 writes: Jason For anyone wishing to see R-Tcl/Tk-MySQL in action (Windows XP)... Jason http://moffett.isis.ucla.edu/gwplot/ Jason Examples were by far the most useful learning tool Jason during my programming endeavors so I hope this may Jason help in your own projects. Thank you, Jason. I like this spirit of sharing! This looks very interesting for a project we will start here in a few weeks. From reading the above web page, it's not clear why you say The current version of GWplot is designed for Windows 2000/XP when all the tools you say you are using are typically part of every Linux distribution (and also available probably for every platform R runs apart from classic MacOS). What problems do you see using this outside of Win-Xp? Regards, Martin Maechler [EMAIL PROTECTED] http://stat.ethz.ch/~maechler/ Seminar fuer Statistik, ETH-Zentrum LEO C16Leonhardstr. 27 ETH (Federal Inst. Technology) 8092 Zurich SWITZERLAND phone: x-41-1-632-3408 fax: ...-1228 __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] R tools for large files
On Wed, 27 Aug 2003 13:03:39 +1200 (NZST), you wrote: For real efficiency here, what's wanted is a variant of readLines where n is an index vector (a vector of non-negative integers, a vector of non-positive integers, or a vector of logicals) saying which lines should be kept. I think that's too esoteric to be worth doing. Most often in cases where you aren't reading every line, you don't know which lines to read until you've read earlier ones. There are two fairly clear sources of overhead in the R code: (1) the overhead of reading characters one at a time through Rconn_fgetc() instead of a block or line at a time. mawk doesn't use fgets() for reading, and _does_ have the overhead of repeatedly checking a regular expression to determine where the end of the line is, which it is sensible enough to fast-path. One complication with reading a block at a time is what to do when you read too far. Not all connections can use seek() to reposition to the beginning, so you'd need to read them one character at a time, (or attach a buffer somehow, but then what about rw connections?) The simplest thing that could possibly work would be to add a function skipLines(con, n) which simply read and discarded n lines. result - scan(textConnection(lines), list( )) That's probably worth doing. Duncan Murdoch __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] seeking help with with()
I tried to define a function like: fnx - function(x, by.vars=Month) print(by(x, by.vars, summary)) But this doesn't work (does not find x$Month; unlike other functions, such as subset(), the INDICES argument to by does not look for variables in dataset x. Is fully documented, but I forget every time). So I tried using with: fnxx - function(x, by.vars=Month) print(with(x, by(x, by.vars, summary))) Still fails to find object x$Month. I DO have a working solution (below) - this post is just to ask: Can anyone explain what happened to the with()? FYI solutions are to call like this: fnx(airquality, airquality$Month) but this will not work generically - e.g. in my real application the dataset gets subsetted and by.vars needs to refer to the subsets. So redefine like this: fny - function(x, by.vars=Month) { attach(x) print(by(x, by.vars, summary)) detach(x) } Simon Fear Senior Statistician Syne qua non Ltd Tel: +44 (0) 1379 69 Fax: +44 (0) 1379 65 email: [EMAIL PROTECTED] web: http://www.synequanon.com Number of attachments included with this message: 0 This message (and any associated files) is confidential and\...{{dropped}} __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] seeking help with with()
On Wed, 27 Aug 2003, Simon Fear wrote: I tried to define a function like: fnx - function(x, by.vars=Month) print(by(x, by.vars, summary)) But this doesn't work (does not find x$Month; unlike other functions, such as subset(), the INDICES argument to by does not look for variables in dataset x. Is fully documented, but I forget every time). So I tried using with: fnxx - function(x, by.vars=Month) print(with(x, by(x, by.vars, summary))) Still fails to find object x$Month. That's not the actual error message, is it? I DO have a working solution (below) - this post is just to ask: Can anyone explain what happened to the with()? Nothing! by.vars is a variable passed to fnxx, so despite lazy evaluation, it is going to be evaluated in the environment calling fnxx(). If that fails to find it, it looks for the default value, and evaluates that in the environment of the body of fnxx. It didn't really get as far as with. (I often forget where default args are evaluated, but I believe that is correct in R as well as in S.) I think you intended Months to be a name and not a variable. With X - data.frame(z=rnorm(20), Month=factor(rep(1:2, each=10))) fnx - function(x, by.vars=Month) print(by(x, x[by.vars], summary)) will work, as will fnx - function(x, by.vars=Month) print(by(x, x[deparse(substitute(by.vars))], summary)) -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] seeking help with with()
Simon Fear [EMAIL PROTECTED] writes: I tried to define a function like: fnx - function(x, by.vars=Month) print(by(x, by.vars, summary)) But this doesn't work (does not find x$Month; unlike other functions, such as subset(), the INDICES argument to by does not look for variables in dataset x. Is fully documented, but I forget every time). So I tried using with: fnxx - function(x, by.vars=Month) print(with(x, by(x, by.vars, summary))) Still fails to find object x$Month. I DO have a working solution (below) - this post is just to ask: Can anyone explain what happened to the with()? Nothing, but by.vars is evaluated in the function frame where it is not defined. I think you're looking for something like function(x, by.vars) { if (missing(by.vars)) by.vars - as.name(Month) print(eval.parent(substitute(with(x, by(x, by.vars, summary) } (Defining the default arg requires a bit of sneakiness...) -- O__ Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907 __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
RE: [R] seeking help with with()
Thank you so much for that fix (to my understanding). I would be willing to add such an example to the help page for future releases - though I'm sure others would do it better - there are currently no examples where INDICES is a name. In fact in my real application it is more or less essential that INDICES is a name or at least deparse(substituted as a subscript; in a slight elaboration of my previous fix fnz - function(dframe, by.vars=treat) for (pop in 1:2) { dframe.pop - subset(dframe, ITT==pop) attach(dframe.pop) print(by(dframe.pop, by.vars, summary)) detach(dframe.pop) } the second call (when pop=2) to by() will crash because by.vars is not re-evaluated afresh - it retains its value from the first loop. So, my fix was wrong and I am happy to stand corrected. -Original Message- From: Prof Brian Ripley [mailto:[EMAIL PROTECTED] Sent: 27 August 2003 14:08 To: Simon Fear Cc: [EMAIL PROTECTED] Subject: Re: [R] seeking help with with() Security Warning: If you are not sure an attachment is safe to open please contact Andy on x234. There are 0 attachments with this message. On Wed, 27 Aug 2003, Simon Fear wrote: I tried to define a function like: fnx - function(x, by.vars=Month) print(by(x, by.vars, summary)) But this doesn't work (does not find x$Month; unlike other functions, such as subset(), the INDICES argument to by does not look for variables in dataset x. Is fully documented, but I forget every time). So I tried using with: fnxx - function(x, by.vars=Month) print(with(x, by(x, by.vars, summary))) Still fails to find object x$Month. That's not the actual error message, is it? I DO have a working solution (below) - this post is just to ask: Can anyone explain what happened to the with()? Nothing! by.vars is a variable passed to fnxx, so despite lazy evaluation, it is going to be evaluated in the environment calling fnxx(). If that fails to find it, it looks for the default value, and evaluates that in the environment of the body of fnxx. It didn't really get as far as with. (I often forget where default args are evaluated, but I believe that is correct in R as well as in S.) I think you intended Months to be a name and not a variable. With X - data.frame(z=rnorm(20), Month=factor(rep(1:2, each=10))) fnx - function(x, by.vars=Month) print(by(x, x[by.vars], summary)) will work, as will fnx - function(x, by.vars=Month) print(by(x, x[deparse(substitute(by.vars))], summary)) -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 Simon Fear Senior Statistician Syne qua non Ltd Tel: +44 (0) 1379 69 Fax: +44 (0) 1379 65 email: [EMAIL PROTECTED] web: http://www.synequanon.com Number of attachments included with this message: 0 This message (and any associated files) is confidential and\...{{dropped}} __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] how to calculate Rsquare
Can anybody send these articles for me? NagelKerke, N. J. D. (1991) A note on a general definition of the coefficient of determination, Biometrika 78: 691-2. Cox, D. R. and Wermuth, N. (1992) A comment on the coefficient of determination for binary responses, The American Statistician 46: 1-4. Thanks Ronaldo -- Of __course it's the murder weapon. Who would frame someone with a fake? -- | // | \\ [***] | ( õ õ ) [Ronaldo Reis Júnior] | V [UFV/DBA-Entomologia] |/ \ [36571-000 Viçosa - MG ] | /(.''`.)\ [Fone: 31-3899-2532 ] | /(: :' :)\ [EMAIL PROTECTED]] |/ (`. `'` ) \[ICQ#: 5692561 | LinuxUser#: 205366 ] |( `- ) [***] | _/ \_Powered by GNU/Debian Woody/Sarge __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] Seeking Packaging advice
On Wed, 27 Aug 2003, Prof Brian Ripley wrote: On Tue, 26 Aug 2003, Ross Boylan wrote: So do people have any thoughts about whether introducing this is a step forward or back? A useful analogue: we now distribute Fortran code not the original Ratfor. As a footnote to Brian's comment, I would just say that those hardy few of us who still write ratfor can and do include it in a subdirectory under src since it tends to be vastly more readable than its automatically produced fortran translation. But we have also learned from hard experience that one can't always rely on the ratfor preprocessing that is provided by systems even when it exists. url:www.econ.uiuc.edu/~roger/my.htmlRoger Koenker email [EMAIL PROTECTED] Department of Economics vox:217-333-4558University of Illinois fax:217-244-6678Champaign, IL 61820 __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] Seeking Packaging advice
On Tue, 26 Aug 2003, Ross Boylan wrote: 2) Modifications of existing packages. I modified the survival package (I'm not sure if that's properly called a base package, but it's close). I know in this particular case, if I'm serious, I probably should contact the package maintainer. But this kind of operation will probably be pretty common for me; I imagine many on this list have already done it. In general, is the best thing to do a) package the new routines as a small additional package, with a dependence on the base package if necessary (the particular change I've made actually produces a few distinct files, slight tweaks of existing ones, that can stand on their own) I think that's best b) package the new things in with the old under the same name as the old (obviously requires working with package maintainter) The problem in this case is that the package maintainer is not the author. Additional functionality might well be ok, but that could easily be done with method (a). Substantial changes to existing functions are going cause problems when the next few thousand lines of diffs arrive from Mayo Clinic. c) package the new things with the old and give it a new name. Keeping this in sync is hard. -thomas __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
RE: [R] how to calculate Rsquare
I think you've badly misinterpreted the purpose of the R listserv with this request: https://www.stat.math.ethz.ch/mailman/listinfo/r-help says The `main' R mailing list, for announcements about the development of R and the availability of new code, questions and answers about problems and solutions using R, enhancements and patches to the source code and documentation of R, comparison and compatibility with S and S-plus, and for the posting of nice examples and benchmarks. -Original Message- From: Ronaldo Reis Jr. [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 27, 2003 10:04 AM To: R-Help Subject: Re: [R] how to calculate Rsquare Can anybody send these articles for me? NagelKerke, N. J. D. (1991) A note on a general definition of the coefficient of determination, Biometrika 78: 691-2. Cox, D. R. and Wermuth, N. (1992) A comment on the coefficient of determination for binary responses, The American Statistician 46: 1-4. Thanks Ronaldo -- Of __course it's the murder weapon. Who would frame someone with a fake? -- | // | \\ [***] | ( õ õ ) [Ronaldo Reis Júnior] | V [UFV/DBA-Entomologia] |/ \ [36571-000 Viçosa - MG ] | /(.''`.)\ [Fone: 31-3899-2532 ] | /(: :' :)\ [EMAIL PROTECTED]] |/ (`. `'` ) \[ICQ#: 5692561 | LinuxUser#: 205366 ] |( `- ) [***] | _/ \_Powered by GNU/Debian Woody/Sarge __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] Exporting R graphs (review)
On Wed, 2003-08-27 at 09:21, [EMAIL PROTECTED] wrote: Hi guys, Yesterday I posted my first couple of questions (see bottom of this message) to this forum and I would like to thank you guys for all the useful feedback I got. I just would like to make some comments: 1. Exporting R graphs as vector graphics: The best answer came from Thomas Lumley [EMAIL PROTECTED] He suggested using the RSvgDevice package. As far as I know SVG graphics can be manipulated with OpenOffice and also with sodipodi, I'll check this package out asap. This should apply to linux and win users. Just a quick heads up that you can export SVG format files from OOo Draw. However, there is no present ability to import them into the OOo apps. According to OOo's IssueZilla, there are no plans to suport SVG import prior to version 2.0. There is however a fair amount of pressure to do so as SVG formats become more prevalent as a cross-platform vector format, especially now that web apps like Mozilla/Firebird are building support for it. This is one of the reasons that I have stayed with bitmaps for screen display and EPS for printing when using OOo. Also, you may be aware that OOo V1.1 (which is at RC3 right now) can export PDF files directly. However, if you have EPS images embedded in a document or slide show, they print as you see them on the screen (blank objects with the embedded title). Thus you need to print them to a PS file and then use ps2pdf if you want a proper PDF file generated. Lastly, for those interested, there is a java based OOo Writer to LaTeX CLI conversion utility and Writer export filter in development. Amazingly, it is called Writer2LaTex... ;-) Info is available at http://www.hj-gym.dk/~hj/writer2latex/ HTH, Marc Schwartz __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] how to calculate Rsquare
The Battelle Institute surely should have access to a library with such popular and prestigious journals as Biometrika and The American Statistians. If you don't have time for that, you surely should have money to purchase a copy from, e.g., www.lindahall.org/docserv. hope this helps. spencer graves Paul, David A wrote: I think you've badly misinterpreted the purpose of the R listserv with this request: https://www.stat.math.ethz.ch/mailman/listinfo/r-help says The `main' R mailing list, for announcements about the development of R and the availability of new code, questions and answers about problems and solutions using R, enhancements and patches to the source code and documentation of R, comparison and compatibility with S and S-plus, and for the posting of nice examples and benchmarks. -Original Message- From: Ronaldo Reis Jr. [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 27, 2003 10:04 AM To: R-Help Subject: Re: [R] how to calculate Rsquare Can anybody send these articles for me? NagelKerke, N. J. D. (1991) A note on a general definition of the coefficient of determination, Biometrika 78: 691-2. Cox, D. R. and Wermuth, N. (1992) A comment on the coefficient of determination for binary responses, The American Statistician 46: 1-4. Thanks Ronaldo __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] Exporting R graphs (review)
Marc Schwartz [EMAIL PROTECTED] writes: On Wed, 2003-08-27 at 09:21, [EMAIL PROTECTED] wrote: Hi guys, Yesterday I posted my first couple of questions (see bottom of this message) to this forum and I would like to thank you guys for all the useful feedback I got. I just would like to make some comments: 1. Exporting R graphs as vector graphics: The best answer came from Thomas Lumley [EMAIL PROTECTED] He suggested using the RSvgDevice package. As far as I know SVG graphics can be manipulated with OpenOffice and also with sodipodi, I'll check this package out asap. This should apply to linux and win users. Just a quick heads up that you can export SVG format files from OOo Draw. However, there is no present ability to import them into the OOo apps. According to OOo's IssueZilla, there are no plans to suport SVG import prior to version 2.0. There is however a fair amount of pressure to do so as SVG formats become more prevalent as a cross-platform vector format, especially now that web apps like Mozilla/Firebird are building support for it. There is also the option of writing a driver specifically for oodraw's format (zipped XML files, mainly). AFAICT, this is mainly a whole lot of red tape, with the actual plotting specified in sections like draw:polyline draw:style-name=gr6 draw:layer=layout svg:width=6.272cm svg:height=5.269cm draw:transform=rotate (-0.767770337952161) translate (16.15cm 9.809cm) svg:viewBox=0 0 6272 5269 draw:points=0,3261 325,6 3206,0 3755,5268 6271,3261/ I.e. it doesn't look impossible, but might require a bit of stamina... In principle you could also try xfig()-CGM-oodraw and maybe other routes using fig2dev but I can't vouch for the quality. -- O__ Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907 __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] selecting by variable
On Wed, 27 Aug 2003, Petr Pikal wrote: Hallo On 27 Aug 2003 at 1:49, Eugene Salinas wrote: Hi, I'm a recent R convert so I haven't quite figured out the details yet... How do I select one variable by another one? Ie if I want to draw the histogram of variable X only for those individuals that also have a value Y in a certain range? In STATA I would give something like: histogram X if ((Y=A Y=B)) hist(X[(Y=A)(Y=B)]) if A and B are objects storing your limits ?Logic ?[ (The data is for individuals and each individual has a number of characteristics including X and Y). thanks, eugene. __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help Cheers Petr Pikal [EMAIL PROTECTED] __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help __ Salvador Alcaraz Carrasco http://www.umh.es Arquitectura y Tecnología de Computadores http://obelix.umh.es Dpto. Física y Arquitectura de Computadores[EMAIL PROTECTED] Universidad Miguel Hernández [EMAIL PROTECTED] Avda. del ferrocarril, s/n Telf. +34 96 665 8495 Elche, Alicante (Spain) __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] How to test a model with two unkown constants
Hi all, suppose I've got a vector y with some data (from a repeated measure design) observed given the conditions in f1 and f2. I've got a model with two unknown fix constants a and b which tries to predict y with respect to the values in f1 and f2. Here is an exsample # data y - c(runif(10, -1,0), runif(10,0,1)) # f1 f1 - rep(c(-1.4, 1.4), rep(10,2)) # f2 f2 - rep(c(-.5, .5), rep(10,2)) Suppose my simple model looks like y = a/f1 + b*f2 Is there a function in R which can compute the estimates for a and b? And is it possible to test the model, eg how good the fits of the model are? Thanks, Sven __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] how to calculate Rsquare
On Wed, 27 Aug 2003 11:04:21 -0300 Ronaldo Reis Jr. [EMAIL PROTECTED] wrote: Can anybody send these articles for me? NagelKerke, N. J. D. (1991) A note on a general definition of the coefficient of determination, Biometrika 78: 691-2. The fitting functions lrm, psm, cph in the Design package compute Nagelkerke's measures. -F Harrell Cox, D. R. and Wermuth, N. (1992) A comment on the coefficient of determination for binary responses, The American Statistician 46: 1-4. Thanks Ronaldo -- Of __course it's the murder weapon. Who would frame someone with a fake? -- | // | \\ [***] | ( õ õ ) [Ronaldo Reis Júnior] | V [UFV/DBA-Entomologia] |/ \ [36571-000 Viçosa - MG ] | /(.''`.)\ [Fone: 31-3899-2532 ] | /(: :' :)\ [EMAIL PROTECTED]] |/ (`. `'` ) \[ICQ#: 5692561 | LinuxUser#: 205366 ] |( `- ) [***] | _/ \_Powered by GNU/Debian Woody/Sarge __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help --- Frank E Harrell Jr Prof. of Biostatistics Statistics Div. of Biostatistics Epidem. Dept. of Health Evaluation Sciences U. Virginia School of Medicine http://hesweb1.med.virginia.edu/biostat __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] Basic GLM: residuals definition
Dear R Users, I suppose this is a school boy question, but here it is anyway. I'm trying to re-create the residuals for a poisson GLM with simulated data; x-rpois(1000,5) model-glm(x~1,poisson) my.resids-(log(x)- summary(model)$coefficients[1]) plot(my.resids,residuals(model)) This shows that my calculated residuals (my.resids) are not the same as residuals(model). p 65 of Annette Dobson's book says that GLM (unstandardised) residuals are calculated by analogy with the Normal case. So where am I going wrong? Thanks for your attention. Martin. Martin Hoyle, School of Life and Environmental Sciences, University of Nottingham, University Park, Nottingham, NG7 2RD, UK Webpage: http://myprofile.cos.com/martinhoyle __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] How to test a model with two unkown constants
That's the linear model lm(y ~ I(1/f1) + f2), so yes, yes and fuller answers can be found in most of the books and guides mentioned in R's FAQ. Note that how `good' the fit is will have to be relative, unless you really can assume a uniform error with range 1, when you could do a maximum-likelihood fit (and watch out for the non-standard distribution theory). On 27 Aug 2003, Sven Garbade wrote: Hi all, suppose I've got a vector y with some data (from a repeated measure design) observed given the conditions in f1 and f2. I've got a model with two unknown fix constants a and b which tries to predict y with respect to the values in f1 and f2. Here is an exsample # data y - c(runif(10, -1,0), runif(10,0,1)) # f1 f1 - rep(c(-1.4, 1.4), rep(10,2)) # f2 f2 - rep(c(-.5, .5), rep(10,2)) Suppose my simple model looks like y = a/f1 + b*f2 Is there a function in R which can compute the estimates for a and b? And is it possible to test the model, eg how good the fits of the model are? -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] How to test a model with two unkown constants
Sven Garbade [EMAIL PROTECTED] writes: Hi all, suppose I've got a vector y with some data (from a repeated measure design) observed given the conditions in f1 and f2. I've got a model with two unknown fix constants a and b which tries to predict y with respect to the values in f1 and f2. Here is an exsample # data y - c(runif(10, -1,0), runif(10,0,1)) # f1 f1 - rep(c(-1.4, 1.4), rep(10,2)) # f2 f2 - rep(c(-.5, .5), rep(10,2)) Suppose my simple model looks like y = a/f1 + b*f2 Is there a function in R which can compute the estimates for a and b? And is it possible to test the model, eg how good the fits of the model are? f2 and 1/f1 are exactly collinear, so no, not in R, nor any other way. Apart from that, the model is linear in a and b so lm() can fit it (with different f1 and f2) if you're not too squeamish about the error distribution. -- O__ Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907 __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] how to calculate Rsquare
Spencer Graves [EMAIL PROTECTED] writes: The Battelle Institute surely should have access to a library with such popular and prestigious journals as Biometrika and The American Statistians. If you don't have time for that, you surely should have money to purchase a copy from, e.g., www.lindahall.org/docserv. Battelle is not the issue, Entomology Dept. at Univ.Fed.de Viçosa is. That is presumably a somewhat poorer place. Still, you (Ronaldo) should check whether there is JSTOR access from somewhere around you, as I'm sure those recipients of r-help who have it will be unsure of what licences they might break by sending you free copies. And David's right: This is outside the scope of r-help. From: Ronaldo Reis Jr. [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 27, 2003 10:04 AM To: R-Help Subject: Re: [R] how to calculate Rsquare Can anybody send these articles for me? NagelKerke, N. J. D. (1991) A note on a general definition of the coefficient of determination, Biometrika 78: 691-2. Cox, D. R. and Wermuth, N. (1992) A comment on the coefficient of determination for binary responses, The American Statistician 46: 1-4. -- O__ Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907 __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] read.spss (package foreign) and character columns
Dear R users! I am using R Version 1.7.1, Windows XP, package foreign (Version: 0.6-1), SPSS 11.5.1. There is one thing I noticed with read.spss, and I'd like to ask if this is considered to be a feature, or possibly a bug: When reading character columns, character strings seem to get filled with blanks at the end. Simple example: In SPSS, create a file with one variable called xchar of type A5 (character of length 5), and 3 values (a, ab, abcde), save it as test.sav. In R: library(foreign) test - read.spss(test.sav, to.data.frame=T) test XCHAR 1 a 2 ab 3 abcde levels(test$XCHAR) [1] a ababcde Shouldn't it rather be a ab abcde (no blanks)? -Heinrich. __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] how to calculate Rsquare
Please excuse my reference to Batelle: I confused who was asking and who answering the question. Thanks, Peter, for the clarification and for the alternative suggestions. Spencer Graves Peter Dalgaard BSA wrote: Spencer Graves [EMAIL PROTECTED] writes: The Battelle Institute surely should have access to a library with such popular and prestigious journals as Biometrika and The American Statistians. If you don't have time for that, you surely should have money to purchase a copy from, e.g., www.lindahall.org/docserv. Battelle is not the issue, Entomology Dept. at Univ.Fed.de Viçosa is. That is presumably a somewhat poorer place. Still, you (Ronaldo) should check whether there is JSTOR access from somewhere around you, as I'm sure those recipients of r-help who have it will be unsure of what licences they might break by sending you free copies. And David's right: This is outside the scope of r-help. From: Ronaldo Reis Jr. [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 27, 2003 10:04 AM To: R-Help Subject: Re: [R] how to calculate Rsquare Can anybody send these articles for me? NagelKerke, N. J. D. (1991) A note on a general definition of the coefficient of determination, Biometrika 78: 691-2. Cox, D. R. and Wermuth, N. (1992) A comment on the coefficient of determination for binary responses, The American Statistician 46: 1-4. __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
RE: [R] Basic GLM: residuals definition
As ?residuals.glm reveals, it's got an argument type: type: the type of residuals which should be returned. The alternatives are: `deviance' (default), `pearson', `working', `response', and `partial'. You calculated response residuals, R gives deviance residuals by default. The different types are covered by most books on general linear models. HTH Thomas -Original Message- From: Martin Hoyle [mailto:[EMAIL PROTECTED] Sent: 27 August 2003 17:01 To: [EMAIL PROTECTED] Subject: [R] Basic GLM: residuals definition Dear R Users, I suppose this is a school boy question, but here it is anyway. I'm trying to re-create the residuals for a poisson GLM with simulated data; x-rpois(1000,5) model-glm(x~1,poisson) my.resids-(log(x)- summary(model)$coefficients[1]) plot(my.resids,residuals(model)) This shows that my calculated residuals (my.resids) are not the same as residuals(model). p 65 of Annette Dobson's book says that GLM (unstandardised) residuals are calculated by analogy with the Normal case. So where am I going wrong? Thanks for your attention. Martin. Martin Hoyle, School of Life and Environmental Sciences, University of Nottingham, University Park, Nottingham, NG7 2RD, UK Webpage: http://myprofile.cos.com/martinhoyle __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help --- Thomas Hotz Research Associate in Medical Statistics University of Leicester United Kingdom Department of Epidemiology and Public Health 22-28 Princess Road West Leicester LE1 6TP Tel +44 116 252-5410 Fax +44 116 252-5423 Division of Medicine for the Elderly Department of Medicine The Glenfield Hospital Leicester LE3 9QP Tel +44 116 256-3643 Fax +44 116 232-2976 __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] Basic GLM: residuals definition
On Wed, 27 Aug 2003, Martin Hoyle wrote: Dear R Users, I suppose this is a school boy question, but here it is anyway. I'm trying to re-create the residuals for a poisson GLM with simulated data; x-rpois(1000,5) model-glm(x~1,poisson) my.resids-(log(x)- summary(model)$coefficients[1]) plot(my.resids,residuals(model)) This shows that my calculated residuals (my.resids) are not the same as residuals(model). p 65 of Annette Dobson's book says that GLM (unstandardised) residuals are calculated by analogy with the Normal case. So where am I going wrong? Not reading the help page. Hint: what is the default for the type argument for the glm method for residual? A much better reference for this is Davison, A.~C. and Snell, E.~J. (1991) Residuals and diagnostics. \newblock Chapter~4 of \cite{Hinkley.ZZ.91}. Hinkley, D.~V., Reid, N. and Snell, E.~J. eds (1991) \emph{Statistical Theory and Modelling. In Honour of Sir David Cox, {FRS}}. \newblock London: Chapman \ Hall. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] how to calculate Rsquare
Hi, Ronaldo: Have you talked with anyone in the Math Department in the Univ.Fed.de Viçosa? They offer courses in Statistics there, and I would expect that someone there could help you get copies of the articles of interest. I wonder if such contacts might help you with other statistics-related issues as well. hope this helps. Spencer Graves Peter Dalgaard BSA wrote: Spencer Graves [EMAIL PROTECTED] writes: The Battelle Institute surely should have access to a library with such popular and prestigious journals as Biometrika and The American Statistians. If you don't have time for that, you surely should have money to purchase a copy from, e.g., www.lindahall.org/docserv. Battelle is not the issue, Entomology Dept. at Univ.Fed.de Viçosa is. That is presumably a somewhat poorer place. Still, you (Ronaldo) should check whether there is JSTOR access from somewhere around you, as I'm sure those recipients of r-help who have it will be unsure of what licences they might break by sending you free copies. And David's right: This is outside the scope of r-help. From: Ronaldo Reis Jr. [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 27, 2003 10:04 AM To: R-Help Subject: Re: [R] how to calculate Rsquare Can anybody send these articles for me? NagelKerke, N. J. D. (1991) A note on a general definition of the coefficient of determination, Biometrika 78: 691-2. Cox, D. R. and Wermuth, N. (1992) A comment on the coefficient of determination for binary responses, The American Statistician 46: 1-4. __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] read.spss (package foreign) and character columns
On Wed, 27 Aug 2003, RINNER Heinrich wrote: Dear R users! I am using R Version 1.7.1, Windows XP, package foreign (Version: 0.6-1), SPSS 11.5.1. There is one thing I noticed with read.spss, and I'd like to ask if this is considered to be a feature, or possibly a bug: When reading character columns, character strings seem to get filled with blanks at the end. Simple example: In SPSS, create a file with one variable called xchar of type A5 (character of length 5), and 3 values (a, ab, abcde), save it as test.sav. In R: library(foreign) test - read.spss(test.sav, to.data.frame=T) test XCHAR 1 a 2 ab 3 abcde levels(test$XCHAR) [1] a ababcde Shouldn't it rather be a ab abcde (no blanks)? You said it was a character string of length 5, not =5. It's easy to strip trailing blanks (?sub has several ways). -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] read.spss (package foreign) and character columns
RINNER Heinrich [EMAIL PROTECTED] writes: Dear R users! I am using R Version 1.7.1, Windows XP, package foreign (Version: 0.6-1), SPSS 11.5.1. There is one thing I noticed with read.spss, and I'd like to ask if this is considered to be a feature, or possibly a bug: When reading character columns, character strings seem to get filled with blanks at the end. Simple example: In SPSS, create a file with one variable called xchar of type A5 (character of length 5), and 3 values (a, ab, abcde), save it as test.sav. In R: library(foreign) test - read.spss(test.sav, to.data.frame=T) test XCHAR 1 a 2 ab 3 abcde levels(test$XCHAR) [1] a ababcde Shouldn't it rather be a ab abcde (no blanks)? I believe they are being saved as fixed length strings in the SPSS file and R is just reading what it was given. __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] discriminant function
Thank you all for the quick responses. However, I'm not sure I unterstand the scaling matrix (denote S henceforth) correcty. An observation x will be transformed by Sx into a new vector space with the properties given by the description. What is now the direction perpendicular to the seperating plane as estimated in the process of the lda? That direction is what im primarily interested in. When plotting a lda object I see diagrams with observations when the plane (here the line) of separation is chosen canonically to be {ax | a \in R}. Thanks, best wishes, Stefan On Tue, 2003-08-26 at 15:59, Torsten Hothorn wrote: On 26 Aug 2003, Stefan [ISO-8859-1] Böhringer wrote: How can I extract the linear discriminant functions resulting from a LDA analysis? The coefficients are listed as a result from the analysis but I have not found a way to extract these programmatically. No refrences in the archives were found. ?lda tells you about the object returned by `lda', especially it element: scaling: a matrix which transforms observations to discriminant functions, normalized so that within groups covariance matrix is spherical. Torsten Thank you very much, Stefan __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] Re: diamond graphs
Drs. Harrell and O'Keefe, Thank you for your suggestions. Regarding your comments about the content of the paper, I respectfully disagree that categorizing continuous variables is a fundamental violation of statistical graphics, nor are you to assume that all categorizations are arbitrary. In any case, the discussion section of our paper contains text acknowledging that contour plots are a preferred option when the continuity of variables is desired to be preserved. The hexagons we proposed seem, at first glance, to be unnecessarily complex but they fulfill properties that none of the other considered alternatives do (Table 1 and Figure 1 in paper and Figure 6 using Trellis). It is unfortunate that the comments from Dr. O'Keefe were based on a press release and not on the manuscript itself. I apologize for the press release implying no graphical progress in the 20th century. Many of his points are addressed in the manuscript. Regarding the extension of the methods to outcomes taking negative values (e.g., changes in markers), the use of two colors is an alternative but the plotting of 0.5*[1+(outcome/max(|outcome|)] and using the option E of Figure 1 in the paper will result in negative and positive values having opposite topology (much as the contrast of negative/positive bars in the unidimensional case). I will be happy to expedite a reprint to Dr. O'Keefe. If you so desire, please email the address to which it should be sent. Although it is at odds with your beliefs, University staff working on licensing and technology transfer believe that a patent may be a vehicle to achieve a wide use. The audience of the proposed methods would be the end users who are not sophisticated programmers and, therefore, the hope is that it would be available in widely used software which is not the case of the high end software (e.g., R). The proposed graph of 2D equiponderant display of two predictors is just a display procedure, not an inferential tool. The sophisticated analyst has little or no need for the proposed method. It does overcome the pitfalls of 3D bar graphs and, therefore, has the potential of improving the way we communicate our findings. Needless to say, were the predictions of Dr. Harrell to be on target, we will change course as the staff working on the licensing have planned from the start. We will be happy to share the code we wrote to produce the figures in The American Statistician paper with individuals wanting to use the software for academic purposes. Please send request for it to [EMAIL PROTECTED] In summary, our idea is a simple one (one that I refer as needing only 8th grade geometry) and it is its simplicity which has been fun to peruse. Alvaro Muñoz __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] Re: diamond graphs
On Wed, 27 Aug 2003 13:40:59 -0400 Alvaro Muñoz [EMAIL PROTECTED] wrote: Drs. Harrell and O'Keefe, Thank you for your suggestions. Regarding your comments about the content of the paper, I respectfully disagree that categorizing continuous variables is a fundamental violation of statistical graphics, nor are you to assume that all categorizations are arbitrary. In any case, the discussion section of our paper contains text acknowledging that contour plots are a preferred option when the continuity of variables is desired to be preserved. The hexagons we proposed seem, at first glance, to be unnecessarily complex but they fulfill properties that none of the other considered alternatives do (Table 1 and Figure 1 in paper and Figure 6 using Trellis). I appreciate your reply Dr Munoz. I will have to disagree with you about the above although I think you made some good points. I have seen many, many examples where categorization results in low-precision estimates and slight changes in the bins results in a significantly different landscape. I have also seen many examples in epidemiology where stratified estimates have been misinterpreted. Even though thermometer and similar plots have defects that you mentioned in your paper, they much more intuitively and precisely map values into the human brain. The same is true of Cleveland's dot plots although one has to be careful, as you said in your article, about the ordering of stratifiers. It is unfortunate that the comments from Dr. O'Keefe were based on a press release and not on the manuscript itself. I apologize for the press release implying no graphical progress in the 20th century. Many of his points are addressed in the manuscript. Regarding the extension of the methods to outcomes taking negative values (e.g., changes in markers), the use of two colors is an alternative but the plotting of 0.5*[1+(outcome/max(|outcome|)] and using the option E of Figure 1 in the paper will result in negative and positive values having opposite topology (much as the contrast of negative/positive bars in the unidimensional case). I will be happy to expedite a reprint to Dr. O'Keefe. If you so desire, please email the address to which it should be sent. Although it is at odds with your beliefs, University staff working on licensing and technology transfer believe that a patent may be a vehicle to achieve a wide use. The audience of the proposed methods would be the end users who are not sophisticated programmers and, therefore, the hope is that it would be available in widely used software which is not the case of the high end software (e.g., R). The proposed graph of 2D equiponderant display of two predictors is just a display procedure, not an inferential tool. The sophisticated analyst has little or no need for the proposed method. It does overcome the pitfalls of 3D bar graphs and, therefore, has the potential of improving the way we communicate our findings. Needless to say, were the predictions of Dr. Harrell to be on target, we will change course as the staff working on the licensing have planned from the start. Their belief that a patent on an idea may help achieve a wide use is sadly mistaken and is almost comical. The statement it would be available in widely used software which is not the case of the high end software is very difficult to comprehend (especially in view of easy to use GUIs such as Rcmdr now available for R, as well as web interfaces). There are several books I could recommend to your university staff. We will be happy to share the code we wrote to produce the figures in The American Statistician paper with individuals wanting to use the software for academic purposes. Please send request for it to [EMAIL PROTECTED] Unfortunately, I think that once the patent announcement was made, the number of individuals interested in the method lessened considerably. In summary, our idea is a simple one (one that I refer as needing only 8th grade geometry) and it is its simplicity which has been fun to peruse. Alvaro Muñoz Again I do thank you for your note. Sincerely, Frank Harrell --- Frank E Harrell Jr Prof. of Biostatistics Statistics Div. of Biostatistics Epidem. Dept. of Health Evaluation Sciences U. Virginia School of Medicine http://hesweb1.med.virginia.edu/biostat __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] RMySQL crashing R
Hi, There have been a number of reports of RMySQL crashing R when attempting to connect to a MySQL server using dbConnect(). The problem appears to be in some binary versions of the MySQL client library. Known instances include (1) Red Hat MySQL binary RPM client library 3.23.32, but updating to 3.23.56 solved the problem. (2) Debian MySQL binary client library 3.23.49, but updating to 3.23.56 solved the problem. Moreover, the change logs in Appendix D of the MySQL manual (www.mysql.com) indicate that two bugs consistent with the crashes we've seen were fixed in 3.23.50, namely, a buffer overflow problem when reading startup parameters and a memory allocation bug in the glibc library used to build Linux binaries. If you experience this problem, could you let me know the version information (R, MySQL, and the operating system)? Also, I'd like to know if updating the MySQL client library and re-installing RMySQL fix your problem. Thanks, -- David PS Thanks to Deepayan Sarkar, John Heuer, and Matthew Kelly for helping me track this problem. __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] Minard's Challenge: Re-Visioning Minard Contest
In a recent talk ('Visions of the Past, Present Future of Statistical Graphics'), I talked about, among other things, the lessons Minard's March on Moscow graphic had for modern statistical graphics, and illustrated aspects of power and simplicity in several programming languages where this graphic had been recreated. I referred to 'elegance factors' of various programming languages in terms of the power, simplicity and transparency of data representations and procedural or declarative specifications required to program a re-creation (or extension) of this famous graph. It occurred to me that it might be of interest, perhaps fun, and hopefully illuminating to pose this as a formal challenge to the R community and others. Several exisiting exemplars are shown on my 'Re-visions of Minard' web page http://www.math.yorku.ca/SCS/Gallery/re-minard.html (in the Gallery of Data Visualization, ../) These include programming examples in Mathematica, SAS/IML Workshop, Wilkinson's Grammar of Graphics, images created in other data visualization systems, raw materials (images, data), etc. There are no formal rules for this Re-Visioning Minard Contest, but each entry should ideally include: (a) an image file in web-friendly format (.jpg, .gif, .png, etc), (b) the program and data used to draw the image, (c) a 'what they were thinking' description of the process used in constructing the graph. To save bandwidth on r-help, I'll ask responders to reply to the list only with reactions to this challenge and what they deem useful to share with all readers. Other ways to reply include posting a web URL where readers can view the details or a direct email reply to me. -- Michael Friendly Email: [EMAIL PROTECTED] Professor, Psychology Dept. York University Voice: 416 736-5115 x66249 Fax: 416 736-5814 4700 Keele Streethttp://www.math.yorku.ca/SCS/friendly.html Toronto, ONT M3J 1P3 CANADA __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] R on Linux/Opteron?
On 26 Aug 2003, Peter Dalgaard BSA wrote: Dirk Eddelbuettel [EMAIL PROTECTED] writes: On Tue, Aug 26, 2003 at 03:17:19PM -0400, Liaw, Andy wrote: Has anyone tried using R on the the AMD Opteron in either 64- or 32-bit mode? If so, any good/bad experiences, comments, etc? We are considering getting this hardware, and would like to know if R can run smoothly on such a beast. Any comment much appreciated. http://buildd.debian.org/build.php?pkg=r-basearch=ia64file=log has logs of R builds on ia64 since Nov 2001, incl. the outcome of make check. We do not run the torture tests -- though I guess we could on some of the beefier hardware such as ia64. I don't think that's quite the same beast, though. Opterons are the x86-64 (or amd64) architecture and ia64 is Intel's, aka Itanium. Debian appears to be just warming up to including this architecture: http://lists.debian.org/debian-x86-64/2003/debian-x86-64-200308/threads.html whereas they have had ia64 out for a while. SuSE has an Opteron option and Luke said he tried it. Apparently it has a functioning 64-bit compiler toolchain - I weren't sure earlier whether they were just running a 64bit kernel and 32bit applications, but when Luke says so, I believe it... I wasn't sure either, especially about default settings, but 'file' says luke/R file bin/R.bin bin/R.bin: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), dynamically linked (uses shared libs), not stripped and in R Sys.info()[machine] machine x86_64 .Machine$sizeof.pointer [1] 8 So it looks like a functional 64-bit setup so far. luke -- Luke Tierney University of Iowa Phone: 319-335-3386 Department of Statistics andFax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: [EMAIL PROTECTED] Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
RE: [R] R on Linux/Opteron?
Thanks to all (and especially Prof. Tierney) for the response. The box we are considering will spend probably over 90% of CPU time in R, so it's comforting to know that R compiles and pass all the test (at least once) on such platform. (I switched my attention from Itanium to Opteron when I read that Itanium is slower than P4 for number crunching...) Best, Andy From: Luke Tierney [mailto:[EMAIL PROTECTED] On 26 Aug 2003, Peter Dalgaard BSA wrote: Dirk Eddelbuettel [EMAIL PROTECTED] writes: On Tue, Aug 26, 2003 at 03:17:19PM -0400, Liaw, Andy wrote: Has anyone tried using R on the the AMD Opteron in either 64- or 32-bit mode? If so, any good/bad experiences, comments, etc? We are considering getting this hardware, and would like to know if R can run smoothly on such a beast. Any comment much appreciated. http://buildd.debian.org/build.php?pkg=r-basearch=ia64file=log has logs of R builds on ia64 since Nov 2001, incl. the outcome of make check. We do not run the torture tests -- though I guess we could on some of the beefier hardware such as ia64. I don't think that's quite the same beast, though. Opterons are the x86-64 (or amd64) architecture and ia64 is Intel's, aka Itanium. Debian appears to be just warming up to including this architecture: http://lists.debian.org/debian-x86- 64/2003/debian-x86-64-200308/thread s.html whereas they have had ia64 out for a while. SuSE has an Opteron option and Luke said he tried it. Apparently it has a functioning 64-bit compiler toolchain - I weren't sure earlier whether they were just running a 64bit kernel and 32bit applications, but when Luke says so, I believe it... I wasn't sure either, especially about default settings, but 'file' says luke/R file bin/R.bin bin/R.bin: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), dynamically linked (uses shared libs), not stripped and in R Sys.info()[machine] machine x86_64 .Machine$sizeof.pointer [1] 8 So it looks like a functional 64-bit setup so far. luke -- Luke Tierney University of Iowa Phone: 319-335-3386 Department of Statistics andFax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: [EMAIL PROTECTED] Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu -- Notice: This e-mail message, together with any attachments,...{{dropped}} __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] testing if two multivariate samples are from the samedistribution
Hello, everyone! I wonder if any R package can do the multivariate Smirnov test. Specifically, let x_1,..,x_n and y_1,...,y_m be multivariate vectors. I would like to test if the two samples are from the same underlying multivariate distribution. Thanks in advance. Jason = Jason G. Liao, Ph.D. Division of Biometrics University of Medicine and Dentistry of New Jersey 335 George Street, Suite 2200 New Brunswick, NJ 08903-2688 phone (732) 235-8611, or (732)-235-5429 http://www.geocities.com/jg_liao __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] Re: diamond graphs
Alvaro Muñoz wrote: Drs. Harrell and O'Keefe, Although it is at odds with your beliefs, University staff working on licensing and technology transfer believe that a patent may be a vehicle to achieve a wide use. The audience of the proposed methods would be the end users who are not sophisticated programmers and, therefore, the hope is that it would be available in widely used software which is not the case of the high end software (e.g., R). The proposed graph of 2D equiponderant display of two predictors is just a display procedure, not an inferential tool. The sophisticated analyst has little or no need for the proposed method. It does overcome the pitfalls of 3D bar graphs and, therefore, has the potential of improving the way we communicate our findings. Needless to say, were the predictions of Dr. Harrell to be on target, we will change course as the staff working on the licensing have planned from the start. Perhaps I can add some personal experience, as opposed to belief. After Robert Gentleman and I had made some initial progress in implementing R, we had to make some decisions about what we would do with it. We looked at a number of options ranging from something commercial to free software. After some research, personal introspection and prompting from others (hi Martin :-) we decided to release under GPL. For me personally this turned out to be far harder than I thought it would be. My institution has a particularly diabolical policy on intellectual property, especially on software. While we could have quietly released the software and just said oops later on, I chose to get approval for free release of my work. This took a number of years, several threats of resignation and a couple of salary cuts. The reason I mention this is not as a part of a personal campaign for sainthood, but rather because it has utimately turned out to have been far more than worth the effort. The effect of making R free has been see it picked up and vastly improved and extended by a very talented group of researchers. We've now reached a point which Robert and I and other early R adopters and contributors couldn't have anticipated in our wildest imaginings. It's truly amazing to see this software being used for all sorts of cool things. What we are seeing represents the best of what being an academic is all about - the free exchange of ideas with researchers collaborating and building on each other's work. On the other hand, I'm currently writing what will possibly become a book on visualization and graphics (publication mechanism uncertain). The techniques discussed in the book are implemented in a certain dialog of a particular computer language developed at Bell Labs. I intend to include code libraries for all the graphical techniques discussed. The fact that you have sought to patent your idea means that, whatever its merits, it's pointless for me to even mention it because I can't distribute code for it. I'm sure the licensing gnomes at your institution have expounded on how patenting will help achieve wider use, but in reality they are simply thinking revenue stream. The likely real effect of of constraining access to your work in this way will be to have it sink into obscurity. Take it from one who's been there, the payoff from free dissemination is much higher. -- Ross Ihaka Email: [EMAIL PROTECTED] Department of Statistics Phone: (64-9) 373-7599 x 85054 University of Auckland Fax:(64-9) 373-7018 Private Bag 92019, Auckland New Zealand __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] Newbie graphing questions
Hi everyone. R is new to me and I'm very impressed with its capabilities but still cannot figure out how to do some basic things. There seems to be no lack of documentation but finding what I need has proven difficult. Perhaps you can help. Here's what I'm after: 1. How do I create a new plot without erasing the prior one i.e., have a new window pop up with the new graph? I'm on MacOSX using the Carbon port. 2. How do I pause between plot renderings i.e., in such a way that it will draw the subsequent graph after pressing the space bar (or any other key). 3. Illustrating critical regions. Say I wanted to illustrate the critical region of a standard normal. I would need to draw a vertical line from the critical point to the curve and then shade the critical region. How do I do this in R? Thanks! -Francisco __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help