Re: [Rd] Question re: NA, NaNs in R

2014-02-10 Thread Tim Hesterberg
This isn't quite what you were asking, but might inform your choice.

R doesn't try to maintain the distinction between NA and NaN when
doing calculations, e.g.:
 NA + NaN
[1] NA
 NaN + NA
[1] NaN
So for the aggregate package, I didn't attempt to treat them differently.

The aggregate package is available at
http://www.timhesterberg.net/r-packages

Here is the inst/doc/missingValues.txt file from that package:

--
Copyright 2012 Google Inc. All Rights Reserved.
Author: Tim Hesterberg roc...@google.com
Distributed under GPL 2 or later.


Handling of missing values and not-a-numbers.


Here I'll note how this package handles missing values.
I do it the way R handles them, rather than the more strict way that S+ does.

First, for terminology,
  NaN = not-a-number, e.g. the result of 0/0
  NA  = missing value or true missing value, e.g. survey non-response
  xx  = I'll uses this for the union of those, or missing value of any kind.

For background, at the hardware level there is an IEEE standard that
specifies that certain bit patterns are NaN, and specifies that
operations involving an NaN result in another NaN.

That standard doesn't say anything about missing values, which are
important in statistics.

So what R and S+ do is to pick one of the bit patterns and declare
that to be a NA.  In other words, the NA bit pattern is a subset of
the NaN bit patterns.

At the user level, the reverse seems to hold.
You can assign either NA or NaN to an object.
But:
is.na(x) returns TRUE for both
is.nan(x) returns TRUE for NaN and FALSE for NA
Based on that, you'd think that NaN is a subset of NA.
To tell whether something is a true missing value do:
(is.na(x)  !is.nan(x))

The S+ convention is that any operation involving NA results in an NA;
otherwise any operation involving NaN results in NaN.

The R convention is that any operation involving xx results in an xx;
a missing value of any kind results in another missing value of any
kind.  R considers NA and NaN equivalent for testing purposes:
all.equal(NA_real_, NaN)
gives TRUE.

Some R functions follow the S+ convention, e.g. the Math2 functions
in src/main/arithmetic.c use this macro:
#define if_NA_Math2_set(y,a,b)  \
if  (ISNA (a) || ISNA (b)) y = NA_REAL; \
else if (ISNAN(a) || ISNAN(b)) y = R_NaN;

Other R functions, like the basic arithmetic operations +-/*^,
do not (search for PLUSOP in src/main/arithmetic.c).
They just let the hardware do the calculations.
As a result, you can get odd results like
 is.nan(NA_real_ + NaN)
[1] FALSE
 is.nan(NaN + NA_real_)
[1] TRUE

The R help files help(is.na) and help(is.nan) suggest that
computations involving NA and NaN are indeterminate.

It is faster to use the R convention; most operations are just
handled by the hardware, without extra work.

In cases like sum(x, na.rm=TRUE), the help file specifies that both NA
and NaN are removed.




There is one NA but mulitple NaNs.

And please re-read 'man memcmp': your cast is wrong.

On 10/02/2014 06:52, Kevin Ushey wrote:
 Hi R-devel,

 I have a question about the differentiation between NA and NaN values
 as implemented in R. In arithmetic.c, we have

 int R_IsNA(double x)
 {
  if (isnan(x)) {
 ieee_double y;
 y.value = x;
 return (y.word[lw] == 1954);
  }
  return 0;
 }

 ieee_double is just used for type punning so we can check the final
 bits and see if they're equal to 1954; if they are, x is NA, if
 they're not, x is NaN (as defined for R_IsNaN).

 My question is -- I can see a substantial increase in speed (on my
 computer, in certain cases) if I replace this check with

 int R_IsNA(double x)
 {
  return memcmp(
  (char*)(x),
  (char*)(NA_REAL),
  sizeof(double)
  ) == 0;
 }

 IIUC, there is only one bit pattern used to encode R NA values, so
 this should be safe. But I would like to be sure:

 Is there any guarantee that the different functions in R would return
 NA as identical to the bit pattern defined for NA_REAL, for a given
 architecture? Similarly for NaN value(s) and R_NaN?

 My guess is that it is possible some functions used internally by R
 might encode NaN values differently; ie, setting the lower word to a
 value different than 1954 (hence being NaN, but potentially not
 identical to R_NaN), or perhaps this is architecture-dependent.
 However, NA should be one specific bit pattern (?). And, I wonder if
 there is any guarantee that the different functions used in R would
 return an NaN value as identical to R_NaN (which appears to be the
 'IEEE NaN')?

 (interested parties can see + run a simple benchmark from the gist at
 https://gist.github.com/kevinushey/8911432)

 Thanks,
 Kevin

 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel



--
Brian D. Ripley,  rip...@stats.ox.ac.uk
Professor

[Rd] Suggest adding a testing keyword

2013-06-13 Thread Tim Hesterberg
I suggest adding this to R_HOME/doc/KEYWORDS.db:

Programming|testing: Software testing

and add a corresponding entry in R_HOME/doc/KEYWORDS.

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] R 3.0.0 memory use

2013-04-14 Thread Tim Hesterberg
I did some benchmarking of data frame code, and
it appears that R 3.0.0 is far worse than earlier versions of R
in terms of how many large objects it allocates space for,
for data frame operations - creation, subscripting, subscript replacement.
For a data frame with n rows, it makes either 2 or 4 extra copies of
all of:
8n bytes (e.g. double precision)
24n bytes
32n bytes
E.g., for as.data.frame(numeric vector), instead of allocations
totalling ~8n bytes, it allocates 33 times that much.

Here, compare columns 3 and 5
(columns 2 and 4 are with the dataframe package).

# Summary
#   R-2.14.2R-2.15.3R-3.0.0
#   w/o withw/o withw/o
#   as.data.frame(y)3   1   1   1   5;4;4
#   data.frame(y)   7   3   4   2   6;2;2
#   data.frame(y, z)7 each  3 each  4   2   8;4;4
#   as.data.frame(l)8   3   5   2   9;4;4
#   data.frame(l)   13  5   8   3   12;4;4
#   d$z - z3,2 1,1 3,1 2,1 7;4;4,1
#   d[[z]] - z   4,3 1,1 3,1 2,1 7;4;4,1
#   d[, z] - z   6,4,2   2,2,1   4,2,2   3,2,1   8;4;4,2,2
#   d[z] - z 6,5,2   2,2,1   4,2,2   3,2,1   8;4;4,2,2
#   d[z] - list(z=z) 6,3,2   2,2,1   4,2,2   3,2,1   8;4;4,2,2
#   d[z] - Z #list(z=z)  6,2,2   2,1,1   4,1,2   3,1,1   8;4;4,1,2
#   a - d[y] 2   1   2   1   6;4;4
#   a - d[, y, drop=F]   2   1   2   1   6;4;4

# Where two numbers are given, they refer to:
#   (copies of the old data frame),
#   (copies of the new column)
# A third number refers to numbers of
#   (copies made of an integer vector of row names)

# For R 3.0.0, I'm getting astounding results - many more copies,
# and also some copies of larger objects; in addition to the data
# vectors of size 80K and 160K, also 240K and 320K.
# Where three numbers are given in form a;c;d, they refer to
#   (copies of 80K; 240K; 320K)

The benchmarks are at
http://www.timhesterberg.net/r-packages/memory.R

I'm using versions of R I installed from source on a Linux box, using e.g.
./configure --prefix=(my path) --enable-memory-profiling --with-readline=no
make
make install

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] R 3.0.0 memory use

2013-04-14 Thread Tim Hesterberg
When I change the data set size, the extra allocations do
not change in size. This supports Luke and Martin's diagnosis.

The extra allocations are either 2 or 4 allocations each of size
 80040
240048
320040

Details (you may skip):

(Fresh session of R 3.0.0)
 y - 1:10^4 + 0.0
 Rprofmem(temp.out, threshold = 10^4)
 d - as.data.frame(y)
 Rprofmem(NULL); system(cat temp.out)
320040 :80040 :240048 :320040 :80040 :240048 :80040 :as.data.frame.numeric 
as.data.frame 
320040 :80040 :240048 :320040 :80040 :240048 : 
 # Try increasing size by a factor of 10
 y - 1:10^5 + 0.0
 Rprofmem(temp.out, threshold = 10^4)
 d - as.data.frame(y)
 Rprofmem(NULL); system(cat temp.out)
320040 :80040 :240048 :320040 :80040 :240048 :800040 :as.data.frame.numeric 
as.data.frame 
320040 :80040 :240048 :320040 :80040 :240048 : 

The number of allocations shown, of different sizes:

3.0.0   3.0.0   2.15.3  2.15.3
first   second  first   second
240048  4   4   0   0
320040  4   4   0   0
 80040  5   4   1   0
800040  0   1   0   1

So it looks like both R 2.15.3 and R 3.0.0 are making
one copy of the data, plus extra allocations.

(Fresh session of R 2.15.3)
 y - 1:10^4 + 0.0
 Rprofmem(temp.out, threshold = 10^4)
 d - as.data.frame(y)
 Rprofmem(NULL); system(cat temp.out)
80040 :as.data.frame.numeric as.data.frame 
 # Increase size by factor of 10
 y - 1:10^5 + 0.0
 Rprofmem(temp.out, threshold = 10^4)
 d - as.data.frame(y)
 Rprofmem(NULL); system(cat temp.out)
800040 :as.data.frame.numeric as.data.frame 




On Sun, 14 Apr 2013 19:15:45 -0700 Martin Morgan mtmor...@fhcrc.org wrote:
On 04/14/2013 07:11 PM, luke-tier...@uiowa.edu wrote:
 There were a couple of bug fixes to somewhat obscure compound
 assignment related bugs that required bumping up internal reference
 counts. It's possible that one or more of these are responsible. If so
 it is unavoidable for now, but it's worth finding out for sure. With
 some stripped down test examples it should be possible to identify
 when things changed. I won't have time to look for some time, but if
 someone else wanted to nail this down that would be useful.

I can't quite tell from Tim's script what he's documenting. In R-2.15.3 I have

  Rprofmem(); Rprofmem(NULL); readLines(Rprofmem.out, warn=FALSE)
character(0)

(or sometimes [1] new page:new page:\Rprofmem\ )

whereas in R-3.0.0

  Rprofmem(); Rprofmem(NULL); readLines(Rprofmem.out, warn=FALSE)
[1] 320040 :80040 :240048 :320040 :80040 :240048 :

I think these are the allocations Tim is seeing. They're from the parser (see
below) rather than as.data.frame. For Tim's example

   y - 1:10^4 + 0.0
   Rprofmem(); d - as.data.frame(y); Rprofmem(NULL); readLines(Rprofmem.out)

[1] 320040 :80040 :240048 :320040 :80040 :240048 :80040
:\as.data.frame.numeric\ \as.data.frame\ 
[2] 320040 :80040 :240048 :320040 :80040 :240048 :

only the allocation 80040 is from as.data.frame (from the call stack output).

Under R -d gdb

   (gdb) b R_OutputStackTrace
   (gdb) r
Rprofmem(); Rprofmem(NULL)

   Breakpoint 1, R_OutputStackTrace (file=0xbd43f0) at
/home/mtmorgan/src/R-3-0-branch/src/main/memory.c:3434
   3434{
   (gdb) bt
   #0  R_OutputStackTrace (file=0xbd43f0) at
/home/mtmorgan/src/R-3-0-branch/src/main/memory.c:3434
   #1  0x7792ff83 in R_ReportAllocation (size=320040) at
/home/mtmorgan/src/R-3-0-branch/src/main/memory.c:3456
   #2  Rf_allocVector (type=13, length=8) at
/home/mtmorgan/src/R-3-0-branch/src/main/memory.c:2478
   #3  0x7790bedf in growData () at gram.y:3391

and the memory allocations are from these lines in the parser gram.y

   PROTECT( bigger = allocVector( INTSXP, data_size * DATA_ROWS ) ) ;
   PROTECT( biggertext = allocVector( STRSXP, data_size ) );

I'm not sure why these show up under R 3.0.0, though.

$ R-2-15-branch/bin/R --version
R version 2.15.3 Patched (2013-03-13 r62579) -- Security Blanket
Copyright (C) 2013 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)

R-3-0-branch$ bin/R --version
R version 3.0.0 Patched (2013-04-14 r62579) -- Masked Marvel
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-unknown-linux-gnu (64-bit)

Martin




 Best,

 luke

 On Sun, 14 Apr 2013, Tim Hesterberg wrote:

 I did some benchmarking of data frame code, and
 it appears that R 3.0.0 is far worse than earlier versions of R
 in terms of how many large objects it allocates space for,
 for data frame operations - creation, subscripting, subscript replacement.
 For a data frame with n rows, it makes either 2 or 4 extra copies of
 all of:
8n bytes (e.g. double precision)
24n bytes
32n bytes
 E.g., for as.data.frame(numeric vector), instead of allocations
 totalling ~8n bytes, it allocates 33 times that much.

 Here, compare columns 3 and 5
 (columns 2 and 4 are with the dataframe package).

 # Summary

Re: [Rd] Suggest adding a 'pivot' argument to qr.R

2012-09-11 Thread Tim Hesterberg
On Sep 11, 2012, at 16:02 , Warnes, Gregory wrote:


 On 9/7/12 2:42 PM, peter dalgaard pda...@gmail.com wrote:


 On Sep 7, 2012, at 17:16 , Tim Hesterberg wrote:

 I suggest adding a 'pivot' argument to qr.R, to obtain columns in the
 same order as the original x, so that
 a - qr(x)
 qr.Q(a) %*% qr.R(a, pivot=TRUE)
 returns x.

 That would come spiraling down in flames the first time someone tried to
 use backsolve on it, wouldn't it? I mean, a major point of QR is that R
 is triangular; doesn't make much sense to permute the columns without
 retaining the pivoting permutation.

 As I understand Tim's proposal, the pivot argument defaults to FALSE, so
 the new behavior would only be activated at the user's request.

Sure. I'm just saying that I see little use for the un-pivoted qr.R because, 
generically, the first thing you want to do with qr.R is to invert it, which 
is easier when it is triangular.

Greg Warnes is correct, I propose keeping the default FALSE, for backward
compatibility.

My use for the pivoted R is in computing a covariance matrix, using
  R - qr.R(QR, pivot = TRUE)
  Rinv - ginverse(R)
  covTerm - Rinv %*% t(Rinv)

But I see that lm() and glm() use chol2inv, that may be preferable.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Need to tell R CMD check that a function qr.R is not a method

2012-09-07 Thread Tim Hesterberg
When creating a package, I would like a way to tell R that
a function with a period in its name is not a method.

I'm writing a package now with a modified version of qr.R.
R CMD check gives warnings:

* checking S3 generic/method consistency ... WARNING
qr:
  function(x, ...)
qr.R:
  function(qr, complete, pivot)

See section ‘Generic functions and methods’ of the ‘Writing R
Extensions’ manual.

* checking Rd \usage sections ... NOTE
S3 methods shown with full name in documentation object 'QR.Auxiliaries':
  ‘qr.R’

The \usage entries for S3 methods should use the \method markup and
not their full name.
See the chapter ‘Writing R documentation files’ in the ‘Writing R
Extensions’ manual.

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] suggest that as.double( something double ) not make a copy

2012-06-06 Thread Tim Hesterberg
I've been playing with passing arguments to .C(), and found that replacing
as.double(x)
with
if(is.double(x)) x else as.double(x)
saves time and avoids one copy, in the case that x is already double.

I suggest modifying as.double to avoid the extra copy and just
return x, when x is already double. Similarly for as.integer, etc.

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Add DUP = FALSE when tabulate() calls .C(R_tabulate

2012-04-08 Thread Tim Hesterberg
In base/R/tabulate.R, tabulate() calls .C(R_tabulate;
I suggest adding DUP = FALSE to that call.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Suggested improvement for src/library/base/man/qraux.Rd

2011-11-21 Thread Tim Hesterberg
Here is a modified version of qraux.Rd, an edited version of
R-2.14.0/src/library/base/man/qraux.Rd

This gives some details and an example for the case of pivoting.
In this case, it is not true that X = QR; rather X[, pivot] = QR.
It may save some other people bugs and time to have this information.

Tim Hesterberg

--
% File src/library/base/man/qraux.Rd
% Part of the R package, http://www.R-project.orghttp://www.r-project.org/
% Copyright 1995-2007 R Core Development Team
% Distributed under GPL 2 or later

\name{QR.Auxiliaries}
\title{Reconstruct the Q, R, or X Matrices from a QR Object}
\usage{
qr.X(qr, complete = FALSE, ncol =)
qr.Q(qr, complete = FALSE, Dvec =)
qr.R(qr, complete = FALSE)
}
\alias{qr.X}
\alias{qr.Q}
\alias{qr.R}
\arguments{
 \item{qr}{object representing a QR decomposition.  This will
   typically have come from a previous call to \code{\link{qr}} or
   \code{\link{lsfit}}.}
 \item{complete}{logical expression of length 1.  Indicates whether an
   arbitrary  orthogonal completion of the \eqn{\bold{Q}} or
   \eqn{\bold{X}} matrices is to be made, or whether the \eqn{\bold{R}}
   matrix is to be completed  by binding zero-value rows beneath the
   square upper triangle.}
 \item{ncol}{integer in the range \code{1:nrow(qr$qr)}.  The number
   of columns to be in the reconstructed \eqn{\bold{X}}.  The default
   when \code{complete} is \code{FALSE} is the first
   \code{min(ncol(X), nrow(X))} columns of the original \eqn{\bold{X}}
   from which the qr object was constructed.  The default when
   \code{complete} is \code{TRUE} is a square matrix with the original
   \eqn{\bold{X}} in the first \code{ncol(X)} columns and an arbitrary
   orthogonal completion (unitary completion in the complex case) in
   the remaining columns.}
 \item{Dvec}{vector (not matrix) of diagonal values.  Each column of
   the returned \eqn{\bold{Q}} will be multiplied by the corresponding
   diagonal value.  Defaults to all \code{1}s.}
}
\description{
 Returns the original matrix from which the object was constructed or
 the components of the decomposition.
}
\value{
 \code{qr.X} returns \eqn{\bold{X}}, the original matrix from
 which the qr object was constructed, provided \code{ncol(X) = nrow(X)}.
 If \code{complete} is \code{TRUE} or the argument \code{ncol} is greater
than
 \code{ncol(X)}, additional columns from an arbitrary orthogonal
 (unitary) completion of \code{X} are returned.

 \code{qr.Q} returns part or all of \bold{Q}, the order-nrow(X)
 orthogonal (unitary) transformation represented by \code{qr}.  If
 \code{complete} is \code{TRUE}, \bold{Q} has \code{nrow(X)} columns.
 If \code{complete} is \code{FALSE}, \bold{Q} has \code{ncol(X)}
 columns.  When \code{Dvec} is specified, each column of \bold{Q} is
 multiplied by the corresponding value in \code{Dvec}.

 \code{qr.R} returns \bold{R}.  This may be pivoted, e.g. if
 \code{a - qr(x)} then \code{x[, a$pivot]} = \bold{QR}.
 The number of rows of \bold{R} is
 either \code{nrow(X)} or \code{ncol(X)} (and may depend on whether
 \code{complete} is \code{TRUE} or \code{FALSE}.
}
\seealso{
 \code{\link{qr}},
 \code{\link{qr.qy}}.
}
\examples{
p - ncol(x - LifeCycleSavings[,-1]) # not the 'sr'
qrstr - qr(x)   # dim(x) == c(n,p)
qrstr $ rank # = 4 = p
Q - qr.Q(qrstr) # dim(Q) == dim(x)
R - qr.R(qrstr) # dim(R) == ncol(x)
X - qr.X(qrstr) # X == x
range(X - as.matrix(x))# ~  6e-12
## X == Q \%*\% R if there has been no pivoting, as here.
Q \%*\% R

# example of pivoting
x - cbind(int=1,
  b1=rep(1:0, each=3), b2=rep(0:1, each=3),
  c1=rep(c(1,0,0), 2), c2=rep(c(0,1,0), 2), c3=rep(c(0,0,1),2))
# singular, columns b2 and c3 are extra
a - qr(x)
qr.R(a) # columns are int b1 c1 c2 b2 c3
a$pivot
all.equal(x, qr.Q(a) \%*\% qr.R(a))# no
all.equal(x[, a$pivot], qr.Q(a) \%*\% qr.R(a)) # yes
}
\keyword{algebra}
\keyword{array}

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] median and data frames

2011-04-30 Thread Tim Hesterberg
I also favor deprecating mean.data.frame.

One possible exception would be for a single-column data frame.
But even here I'd say no, lest people expect the same behavior for
median, var, ...

Pat's suggestion of using stop() would work nicely for mean.
(but omit paste - stop handles that).

Tim Hesterberg

If Martin's proposal is accepted, does
that mean that the median method for
data frames would be something like:

function (x, ...)
{
 stop(paste(you probably mean to use the command: sapply(,
 deparse(substitute(x)), , median), sep=))
}

Pat


On 29/04/2011 15:25, Martin Maechler wrote:
 Paul Johnsonpauljoh...@gmail.com
  on Thu, 28 Apr 2011 00:20:27 -0500 writes:

On Wed, Apr 27, 2011 at 12:44 PM, Patrick Burns
pbu...@pburns.seanet.com  wrote:
Here are some data frames:
  
df3.2- data.frame(1:3, 7:9)
df4.2- data.frame(1:4, 7:10)
df3.3- data.frame(1:3, 7:9, 10:12)
df4.3- data.frame(1:4, 7:10, 10:13)
df3.4- data.frame(1:3, 7:9, 10:12, 15:17)
df4.4- data.frame(1:4, 7:10, 10:13, 15:18)
  
Now here are some commands and their answers:

median(df4.4)
[1]  8.5 11.5
median(df3.2[c(1,2,3),])
[1] 2 8
median(df3.2[c(1,3,2),])
[1]  2 NA
Warning message:
In mean.default(X[[2L]], ...) :
  argument is not numeric or logical: returning NA
  
  
  
The sessionInfo is below, but it looks
to me like the present behavior started
in 2.10.0.
  
Sometimes it gets the right answer.  I'd
be grateful to hear how it does that -- I
can't figure it out.
  

Hello, Pat.

Nice poetry there!  I think I have an actual answer, as opposed to 
 the
usual crap I spew.

I would agree if you said median.data.frame ought to be written to
work columnwise, similar to mean.data.frame.

apply and sapply  always give the correct answer

apply(df3.3, 2, median)
X1.3   X7.9 X10.12
2  8 11

  [...]

 exactly

mean.data.frame is now implemented as

mean.data.frame- function(x, ...) sapply(x, mean, ...)

 exactly.

 My personal oppinion is that  mean.data.frame() should never have
 been written.
 People should know, or learn, to use apply functions for such a
 task.

 The unfortunate fact that mean.data.frame() exists makes people
 think that median.data.frame() should too,
 and then

var.data.frame()
 sd.data.frame()
mad.data.frame()
min.data.frame()
max.data.frame()
...
...

 all just in order to *not* to have to know  sapply()
 

 No, rather not.

 My vote is for deprecating  mean.data.frame().

 Martin


--
Patrick Burns
pbu...@pburns.seanet.com
twitter: @portfolioprobe
http://www.portfolioprobe.com/blog
http://www.burns-stat.com
(home of 'Some hints for the R beginner'
and 'The R Inferno')

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] matrixStats: Extend to arrays too (Was: Re: Suggestion: Adding quick rowMin and rowMax functions to base package)

2011-02-16 Thread Tim Hesterberg
For consistency with rowSums colSums rowMeans etc., the names should be
colMins colMaxs
rowMins rowMaxs
This is also consistent with S+.

FYI, the rowSums naming convention was chosen to avoid conflict
with rowsum (which computes column sums!).

Tim Hesterberg

 A well-designed API generalized to work with arrays should probably
 borrow ideas from how argument 'MARGIN' of apply() works, how argument
 'dim' in rowSums() for (though I must say the letter seem a bit ad hoc
 at first sight given the name of the function). ?There may also be
 something to learn from the 'reshape' package and so.

I'd also recommend looking at plyr::aaply, which fixes a few things
that have always annoyed me about apply - namely that it is not
idempotent/identical to aperm when the summary function is the
identity.

Hadley

--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] aperm() should retain class of input object

2010-12-28 Thread Tim Hesterberg
Having aperm() return an object of the same class is dangerous, there
are undoubtedly classes for which that is not appropriate, producing an
illegal object for that class or quietly giving incorrect results.

Three alternatives are to:
* add the keep.class option but with default FALSE
* make aperm a generic function 
  - without a keep.class argument
  - with a ... argument
  - methods for classes like table could have keep.class = TRUE
* make aperm a generic function 
  - without a keep.class argument
  - with a ... argument
  - default method have keep.class = TRUE

The third option would give the proposed behavior by default, but
allow a way out for classes where the behavior is wrong.  This puts
the burden on a class author to realize the potential problem with
aperm, so my preference is one of the first two options.

aperm() was designed for multidimensional arrays, but is also useful for
table objects, particularly
with the lattice, vcd and vcdExtra packages.  But aperm() was designed
and implemented before other
related object classes were conceived, and I propose a small tune-up to
make it more generally useful.

The problem is that  aperm() always returns an object of class 'array',
which causes problems for methods
designed for table objects. It also requires some package writers to
implement both .array and .table
methods for the same functionality, usually one in terms of the other.
Some examples of unexpected, and initially perplexing results (when only
methods for one class are implemented)
are shown below.


  library(vcd)
  pairs(UCBAdmissions, shade=TRUE)
  UCB - aperm(UCBAdmissions, c(2, 1, 3))
 
  # UCB is now an array, not a table
  pairs(UCB, shade=TRUE)
There were 50 or more warnings (use warnings() to see the first 50)
 
  # fix it, to get pairs.table
  class(UCB) - table
  pairs(UCB, shade=TRUE)
 



Of course, I can define a new function, tperm() that does what I think
should be the expected behavior:

# aperm, for table objects

tperm - function(a, perm, resize = TRUE) {
 result - aperm(a, per, resize)
 class(result) - class(a)
 result
}

But I think it is more natural to include this functionality in aperm()
itself.  Thus, I propose the following
revision of base::aperm(), at the R level:

aperm - function (a, perm, resize = TRUE, keep.class=TRUE)
{
 if (missing(perm))
 perm - integer(0L)
 result - .Internal(aperm(a, perm, resize))
 if(keep.class) class(result) - class(a)
 result
}


I don't think this would break any existing code, except where someone
depended on coercion to an array.
The drop-in replacement for aperm would set keep.class=FALSE by default,
but I think TRUE is  more
natural.

FWIW, here are the methods for table and array objects
from my current (non-representative) session.

  methods(class=table)
  [1] as.data.frame.table barchart.table* cloud.table*
contourplot.table*  dotplot.table*
  [6] head.table* levelplot.table*pairs.table*
plot.table* print.table
[11] summary.table   tail.table*

Non-visible functions are asterisked
 
  methods(class=array)
[1] anyDuplicated.array as.data.frame.array as.raster.array*
barchart.array* contourplot.array*  dotplot.array*
[7] duplicated.arraylevelplot.array*unique.array


--
Michael Friendly Email: friendly AT yorku DOT ca
Professor, Psychology Dept.
York University  Voice: 416 736-5115 x66249 Fax: 416 736-5814
4700 Keele StreetWeb:http://www.datavis.ca
Toronto, ONT  M3J 1P3 CANADA

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Using sample() to sample one value from a single value?

2010-11-04 Thread Tim Hesterberg
On Wed, Nov 3, 2010 at 3:54 PM, Henrik Bengtsson h...@biostat.ucsf.eduwrote:

 Hi, consider this one as an FYI, or a seed for further discussion.

 I am aware that many traps on sample() have been reported over the
 years.  I know that these are also documents in help(sample).  Still
 I got bitten by this while writing
...
 All of the above makes sense when one study the code of sample(), but
 sample() is indeed dangerous, e.g. imagine how many bootstrap
 estimates out there quietly gets incorrect.

Nonparametric bootstrapping from a sample of size 1 is always incorrect.
If you draw a single observation from a sample of size 1, you get that
same observation back.  This implies zero sampling variability, which
is wrong.  If this single sample represents one stratum or sample in
a larger problem, this would contribute zero variability to the overall
result, again wrong.

In general, the ordinary bootstrap underestimates variability in
small samples.  For a sample mean, the ordinary bootstrap corresponds
to using an estimate of variance equal to (1/n) sum((x - mean(x))^2),
instead of a divisor of n-1.  In stratified and multi-sample applications
the downward bias is similarly (n-1)/n.

Three remedies are:
* draw bootstrap samples of size n-1
* bootknife sampling - omit one observation (a jackknife sample), then
  draw a bootstrap sample of size n from that
* bootstrap from a kernel density estimate, with kernel covariance equal
  to empirical covariance (with divisor n-1) / n.
The latter two are described in 
Hesterberg, Tim C. (2004), Unbiasing the Bootstrap-Bootknife Sampling vs. 
Smoothing, Proceedings of the Section on Statistics and the Environment, 
American Statistical Association, 2924-2930.
http://home.comcast.net/~timhesterberg/articles/JSM04-bootknife.pdf

All three are undefined for samples of size 1.  You need to go to some
other bootstrap, e.g. a parametric bootstrap with variability estimated
from other data.

Tim Hesterberg

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] suggest enhancement to segments and arrows to facilitate horizontal and vertical segments

2009-10-02 Thread Tim Hesterberg
I suggest a simple enhancement to segments() and arrows() to
facilitate drawing horizontal and vertical segments --
set default values for the second x and y arguments equal to the first set.
This is handy, especially when the expressions for coordinates are long.

Compare:

Segments:
 function (x0, y0, x1 = x0, y1 = y0, col = par(fg), lty = par(lty),
---
 function (x0, y0, x1, y1, col = par(fg), lty = par(lty),

Arrows:
 function (x0, y0, x1 = x0, y1 = y0, length = 0.25, angle = 30, code = 2,
---
 function (x0, y0, x1, y1, length = 0.25, angle = 30, code = 2,

Tim Hesterberg

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Faster as.data.frame save copy by doing names(x) - NULL only if needed

2009-07-14 Thread Tim Hesterberg
A number of as.data.frame methods do

names(x) - NULL

Replacing that with

if(!is.null(names(x)))
  names(x) - NULL

appears to save making one copy of the data
(based on tracemem and Rprofmem in a copy of R compiled
with --enable-memory-profiling)
and gives a modest but consistent boost in speed, e.g.:

#   old new
#   user  system elapseduser  system elapsed
# integer   3.412   0.060   3.472   2.788   0.020   2.809
# numeric   6.212   0.160   6.374   4.852   0.080   5.132
# logical   3.484   0.052   3.699   2.808   0.028   2.834
# factor4.433   0.020   4.547   2.929   0.020   2.964

These visible methods can be modified as noted above:
  as.data.frame.Date
  as.data.frame.POSIXct
  as.data.frame.complex
  as.data.frame.difftime
  as.data.frame.factor
  as.data.frame.integer
  as.data.frame.logical
  as.data.frame.numeric
  as.data.frame.numeric_version
  as.data.frame.ordered
  as.data.frame.raw
  as.data.frame.vector


Here's the timing code (run in a copy of R without memory profiling):

x - 1:10^4 # integer
system.time(for(i in 1:10^4) y - as.data.frame(x), gc=TRUE)
x - x + 0.0# numeric
system.time(for(i in 1:10^4) y - as.data.frame(x), gc=TRUE)
x - rep(c(TRUE,FALSE), length = 10^4)  # logical
system.time(for(i in 1:10^4) y - as.data.frame(x), gc=TRUE)
x - factor(rep(letters[1:10], length=10^4))# factor
system.time(for(i in 1:10^4) y - as.data.frame(x), gc=TRUE)

I have not done timings where the inputs have names;
that is rare in my experience.

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] (PR#8192) [ subscripting sometimes loses names

2009-02-01 Thread Tim Hesterberg
...
Simon, no, the drop=FALSE argument has nothing to do with what
Christian was talking about.  The kind of thing he meant is PR# 8192,
Subject: [ subscripting sometimes loses names:

  http://bugs.r-project.org/cgi-bin/R/wishlist?id=8192

In R, subscripting with [ USUALLY retains names, but R has various
edge cases where it (IMNSHO) inappropriately discards them.  This
occurs with both .Primitive([) and [.data.frame.  This has been
known for years, but I have not yet tried digging into R's
implementation to see where and how the names are actually getting
lost.

Incidentally, versions of S-Plus since approximately S-Plus 6.0 back
in 2001 show similar buggy edge case behavior.  Older versions of
S-Plus, c. S-Plus 3.3 and earlier, had the correct, name preserving
behavior.  I presume that the original Bell Labs S had correct
name-preserving behavior, and then the S-Plus developers broke it
sometime along the way.

(Later comments on the thread pointed out the difference between
x[,1] for matrices and data frames.)

I rewrote the S-PLUS data frame code around then, to fix
various inconsistencies and improve efficiency.
This was probably my change, and I would do it again.

Note that the components of a data frame do not have names
attached to them; the row names are a separate object.
Extracting a component vector or matrix from a data frame should not
attach names to the result, because of:
* memory (attaching row names to an object can more than double the
  size of the object),
* speed
* some objects cannot take names, and attaching them could change
  the class and other behavior of an object, and
* the names are usually/often (depending on the user) meaningless,
  artifacts of an early design decision that all data frames have row names.

Data frames differ from matrices in two ways that matter here:
* columns in matrices are all the same kind, and are simple objects
  (numeric, etc.), whereas components of data frames can be nearly
  arbitrary objects, and
* row names get added to a data frame whether a user wants them or not,
  whereas row names on a matrix have to be specified.

A historical note - unique row names on data frame were a design
decision made when people worked with small data frames, and are
convenient for small data frames.  But they are a problem for large
data frames.  I was writing for all users, not just those with small
data frames and meaningful names.

I like R's 'automatic' row names.  This is a big help working with
huge data frames (and I do this often, at Google).  But this doesn't
go far enough; subscripting and other operations sometimes convert the
automatic names to real names, and check/enforce uniqueness, which is
a big waste of time when working with large data frames.  I'll comment
more on this in a new thread.

Tim Hesterberg

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] non-duplicate names in data frames

2009-02-01 Thread Tim Hesterberg
I wrote on another thread
(with subject [ subscripting sometimes loses names):
I like R's 'automatic' row names.  This is a big help working with
huge data frames (and I do this often, at Google).  But this doesn't
go far enough; subscripting and other operations sometimes convert the
automatic names to real names, and check/enforce uniqueness, which is
a big waste of time when working with large data frames.  I'll comment
more on this in a new thread.

I propose (and have begun writing, in my copious spare time):
* an optional argument to data.frame and other data frame creation code
* resulting in an attribute added to the data.frame
* so that subscripting and other operations on the data frame
  * always keep artificial row names
  * do not have to check for unique row names in the result.

My current thoughts, comments welcome:

Argument name and component name 'dup.row.names'
0 or FALSE or NULL - current, require unique names
1 or TRUE  - duplicates allowed (when subscripting etc.)
2  - always automatic   (when subscripting etc.)

Option maxRowNames, default say 10^4
Any data frames with more than this have dup.row.names default to 2.

The name 'dup.row.names' is for consistency with S+; there the options
are NULL, F or T.

Tim Hesterberg

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] ifelse

2008-08-25 Thread Tim Hesterberg
Others have commented on why this holds.

There is an alternative, 'ifelse1', part of the splus2R package, that
does what you'd like here.

Tim Hesterberg

I find it slightly surprising, that
   ifelse(TRUE, character(0), )
returns NA instead of character(0). 

-- 
Heikki Kaskelma

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [.data.frame speedup

2008-07-03 Thread Tim Hesterberg
I made a couple of a changes from the previous version:
 - don't use functions anyMissing or notSorted (which aren't in base R)
 - don't check for dup.row.names attribute (need to modify other functions
   before that is useful)
I have not tested this with a wide variety of inputs; I'm assuming that
you have some regression tests.

Here are the file differences.  Let me know if you'd like a different
format.

$ diff -c dataframe.R dataframe2.R
*** dataframe.RThu Jul  3 15:48:12 2008
--- dataframe2.RThu Jul  3 16:36:46 2008
***
*** 530,535 
--- 530,541 
  x - .Call(R_copyDFattr, xx, x, PACKAGE=base)
  oldClass(x) - attr(x, row.names) - NULL

+ # Do not want to check for duplicates if don't need to
+ noDuplicateRowNames - (is.logical(i) ||
+ length(i)  2 ||
+ (is.numeric(i)  min(i, 0, na.rm=TRUE)  0)
||
+ (!any(is.na(i))  all(i[-length(i)]i[-1])))
+
  if(!missing(j)) { # df[i, j]
  x - x[j]
  cols - names(x)  # needed for 'drop'
***
*** 579,592 
  ## row names might have NAs.
  if(is.null(rows)) rows - attr(xx, row.names)
  rows - rows[i]
! if((ina - any(is.na(rows))) | (dup - any(duplicated(rows {
! ## both will coerce integer 'rows' to character:
! if (!dup  is.character(rows)) dup - NA %in% rows
! if(ina)
! rows[is.na(rows)] - NA
! if(dup)
! rows - make.unique(as.character(rows))
! }
  ## new in 1.8.0  -- might have duplicate columns
  if(any(duplicated(nm - names(x names(x) - make.unique(nm)
  if(is.null(rows)) rows - attr(xx, row.names)[i]
--- 585,594 
  ## row names might have NAs.
  if(is.null(rows)) rows - attr(xx, row.names)
  rows - rows[i]
! if(any(is.na(rows)))
!   rows[is.na(rows)] - NA # coerces to integer
! if(!noDuplicateRowNames  any(duplicated(rows)))
!   rows - make.unique(as.character(rows)) # coerces to integer
  ## new in 1.8.0  -- might have duplicate columns
  if(any(duplicated(nm - names(x names(x) - make.unique(nm)
  if(is.null(rows)) rows - attr(xx, row.names)[i]



Here's some code for testing, and timings

# Use:
# R --no-init-file --no-site-file

x - data.frame(a=1:4, b=2:5)

# Run these commands with the default and new versions of [.data.frame
trace(duplicated)
trace(make.unique)
x[2:1]
x[1]
x[1:2]
x[1:3, ]# save one call to duplicated(rows)
x[c(T,F,F,T), ] # save one call to duplicated(rows)
x[-1,]  # save one call to duplicated(rows)
x[-(1:2),]  # save one call to duplicated(rows)
x[3:1, ]
x[c(1,3,2,4,3), ]
untrace(duplicated)
untrace(make.unique)


# Timings
# Run one of these lines, then everything afterward
n - 10^5
n - 10^6
n - 10^7

y - data.frame(a=1:n, b=1:n)

i - 1:n
system.time(temp - y[i, ])
#   n   old new
#   10^5.128.052
#   10^6.237.591
#   10^73.102.882

i - rep(TRUE, n)
system.time(temp - y[i, ])
#   n   old new
#   10^5.157.053
#   10^6.787.449
#   10^73.799   2.138

i - -1
system.time(temp - y[i, ])
#   n   old new
#   10^5.157.051
#   10^6.614.497
#   10^74.163   2.482

i - rep(1:(n/2), 2) # expect no speedup for this case
system.time(temp - y[i, ])
#   n   old new
#   10^5.559.782
#   10^66.066   6.078

# Times shown are the user times reported by system.time

# The time savings are mostly quite substantial in the
# cases I expect a savings.

# I've noticed a lot of variability in results from system.time,
# so I don't view these as very accurate, and I don't worry
# much about the cases where the time appears worse.


On Thu, Jul 3, 2008 at 1:08 PM, Martin Maechler [EMAIL PROTECTED]
wrote:

  TH == Tim Hesterberg [EMAIL PROTECTED]
  on Tue, 1 Jul 2008 15:23:53 -0700 writes:

TH There is a bug in the standard version of [.data.frame;
TH it mixes up handling duplicates and NAs when subscripting rows.

TH x - data.frame(x=1:3, y=2:4, row.names=c(a,b,NA))
TH y - x[c(2:3, NA),]
TH y

TH It creates a data frame with duplicate rows, but won't print.

 and that's a bug, indeed
 (introduced to R version 2.5.0, when the [.data.frame  code was much
 optimized for speed, with quite some care), and I have commited
 a fix (and a regression test) to both R-devel and R-patched.

 Thanks a lot for the bug report, Tim!

 Now about your newly proposed code:
 I'm sorry to say that it looks so much different from the source
 code in
  https://svn.r-project.org/R/trunk/src/library/base/R/dataframe.R
 that I don't think we would accept it as a substitute, easily.

 Could you try to provide a minimal patch against the source code
 and also a selfcontained example

[Rd] [.data.frame speedup

2008-07-01 Thread Tim Hesterberg
Below is a version of [.data.frame that is faster
for subscripting rows of large data frames; it avoids calling
duplicated(rows)
if there is no need to check for duplicate row names, when:
i is logical
attr(x, dup.row.names) is not NULL (S+ compatibility)
i is numeric and negative
i is strictly increasing


[.data.frame -
function (x, i, j,
  drop = if (missing(i)) TRUE else length(cols) == 1)
{
  # This version of [.data.frame avoid wasting time enforcing unique
  # row names.
  mdrop - missing(drop)
  Narg - nargs() - (!mdrop)
  if (Narg  3) {
if (!mdrop)
  warning(drop argument will be ignored)
if (missing(i))
  return(x)
if (is.matrix(i))
  return(as.matrix(x)[i])
y - NextMethod([)
cols - names(y)
if (!is.null(cols)  any(is.na(cols)))
  stop(undefined columns selected)
if (any(duplicated(cols)))
  names(y) - make.unique(cols)
return(structure(y, class = oldClass(x),
 row.names = .row_names_info(x, 0L)))
  }
  if (missing(i)) {
if (missing(j)  drop  length(x) == 1L)
  return(.subset2(x, 1L))
y - if (missing(j))
  x
else .subset(x, j)
if (drop  length(y) == 1L)
  return(.subset2(y, 1L))
cols - names(y)
if (any(is.na(cols)))
  stop(undefined columns selected)
if (any(duplicated(cols)))
  names(y) - make.unique(cols)
nrow - .row_names_info(x, 2L)
if (drop  !mdrop  nrow == 1L)
  return(structure(y, class = NULL, row.names = NULL))
else return(structure(y, class = oldClass(x),
  row.names = .row_names_info(x, 0L)))
  }
  xx - x
  cols - names(xx)
  x - vector(list, length(x))
  x - .Call(R_copyDFattr, xx, x, PACKAGE = base)
  oldClass(x) - attr(x, row.names) - NULL
  # Do not want to check for duplicates if don't need to
  noDuplicateRowNames - (is.logical(i) ||
  (!is.null(attr(x, dup.row.names))) ||
  (is.numeric(i)  min(i, 0, na.rm=TRUE)  0) ||
  (!notSorted(i, strict = TRUE)))
  if (!missing(j)) {
x - x[j]
cols - names(x)
if (drop  length(x) == 1L) {
  if (is.character(i)) {
rows - attr(xx, row.names)
i - pmatch(i, rows, duplicates.ok = TRUE)
  }
  xj - .subset2(.subset(xx, j), 1L)
  return(if (length(dim(xj)) != 2L) xj[i] else xj[i,
 , drop = FALSE])
}
if (any(is.na(cols)))
  stop(undefined columns selected)
nxx - structure(seq_along(xx), names = names(xx))
sxx - match(nxx[j], seq_along(xx))
  }
  else sxx - seq_along(x)
  rows - NULL
  if (is.character(i)) {
rows - attr(xx, row.names)
i - pmatch(i, rows, duplicates.ok = TRUE)
  }
  for (j in seq_along(x)) {
xj - xx[[sxx[j]]]
x[[j]] - if (length(dim(xj)) != 2L)
  xj[i]
else xj[i, , drop = FALSE]
  }
  if (drop) {
n - length(x)
if (n == 1L)
  return(x[[1L]])
if (n  1L) {
  xj - x[[1L]]
  nrow - if (length(dim(xj)) == 2L)
dim(xj)[1L]
  else length(xj)
  drop - !mdrop  nrow == 1L
}
else drop - FALSE
  }
  if (!drop) {
if (is.null(rows))
  rows - attr(xx, row.names)
rows - rows[i]
if ((ina - any(is.na(rows))) | (dup - !noDuplicateRowNames 
any(duplicated(rows {
  if (ina)
rows[is.na(rows)] - NA
  if (dup)
rows - make.unique(as.character(rows))
}
if (any(duplicated(nm - names(x
  names(x) - make.unique(nm)
if (is.null(rows))
  rows - attr(xx, row.names)[i]
attr(x, row.names) - rows
oldClass(x) - oldClass(xx)
  }
  x
}

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [.data.frame speedup

2008-07-01 Thread Tim Hesterberg
There is a bug in the standard version of [.data.frame;
it mixes up handling duplicates and NAs when subscripting rows.
  x - data.frame(x=1:3, y=2:4, row.names=c(a,b,NA))
  y - x[c(2:3, NA),]
  y
It creates a data frame with duplicate rows, but won't print.

In the previous message I included a version of [.data.frame;
it fails for the same example, for a different reason.  Here
is a fix.


subscript.data.frame -
function (x, i, j,
  drop = if (missing(i)) TRUE else length(cols) == 1)
{
  # This version of [.data.frame avoid wasting time enforcing unique
  # row names if possible.
  mdrop - missing(drop)
  Narg - nargs() - (!mdrop)
  if (Narg  3) {
if (!mdrop)
  warning(drop argument will be ignored)
if (missing(i))
  return(x)
if (is.matrix(i))
  return(as.matrix(x)[i])
y - NextMethod([)
cols - names(y)
if (!is.null(cols)  any(is.na(cols)))
  stop(undefined columns selected)
if (any(duplicated(cols)))
  names(y) - make.unique(cols)
return(structure(y, class = oldClass(x),
 row.names = .row_names_info(x, 0L)))
  }
  if (missing(i)) {
if (missing(j)  drop  length(x) == 1L)
  return(.subset2(x, 1L))
y - if (missing(j))
  x
else .subset(x, j)
if (drop  length(y) == 1L)
  return(.subset2(y, 1L))
cols - names(y)
if (any(is.na(cols)))
  stop(undefined columns selected)
if (any(duplicated(cols)))
  names(y) - make.unique(cols)
nrow - .row_names_info(x, 2L)
if (drop  !mdrop  nrow == 1L)
  return(structure(y, class = NULL, row.names = NULL))
else return(structure(y, class = oldClass(x),
  row.names = .row_names_info(x, 0L)))
  }
  xx - x
  cols - names(xx)
  x - vector(list, length(x))
  x - .Call(R_copyDFattr, xx, x, PACKAGE = base)
  oldClass(x) - attr(x, row.names) - NULL
  # Do not want to check for duplicates if don't need to
  noDuplicateRowNames - (is.logical(i) ||
  (!is.null(attr(x, dup.row.names))) ||
  (is.numeric(i)  min(i, 0, na.rm=TRUE)  0) ||
  (!anyMissing(i)  !notSorted(i, strict = TRUE)))
  if (!missing(j)) {
x - x[j]
cols - names(x)
if (drop  length(x) == 1L) {
  if (is.character(i)) {
rows - attr(xx, row.names)
i - pmatch(i, rows, duplicates.ok = TRUE)
  }
  xj - .subset2(.subset(xx, j), 1L)
  return(if (length(dim(xj)) != 2L) xj[i] else xj[i,
 , drop = FALSE])
}
if (any(is.na(cols)))
  stop(undefined columns selected)
nxx - structure(seq_along(xx), names = names(xx))
sxx - match(nxx[j], seq_along(xx))
  }
  else sxx - seq_along(x)
  rows - NULL
  if (is.character(i)) {
rows - attr(xx, row.names)
i - pmatch(i, rows, duplicates.ok = TRUE)
  }
  for (j in seq_along(x)) {
xj - xx[[sxx[j]]]
x[[j]] - if (length(dim(xj)) != 2L)
  xj[i]
else xj[i, , drop = FALSE]
  }
  if (drop) {
n - length(x)
if (n == 1L)
  return(x[[1L]])
if (n  1L) {
  xj - x[[1L]]
  nrow - if (length(dim(xj)) == 2L)
dim(xj)[1L]
  else length(xj)
  drop - !mdrop  nrow == 1L
}
else drop - FALSE
  }
  if (!drop) {
if (is.null(rows))
  rows - attr(xx, row.names)
rows - rows[i]
if(any(is.na(rows)))
  rows[is.na(rows)] - NA
if(!noDuplicateRowNames  any(duplicated(rows)))
rows - make.unique(as.character(rows))
if (any(duplicated(nm - names(x
  names(x) - make.unique(nm)
if (is.null(rows))
  rows - attr(xx, row.names)[i]
attr(x, row.names) - rows
oldClass(x) - oldClass(xx)
  }
  x
}

# That requires anyMissing from the splus2R package,
# plus notSorted (or a version of is.unsorted with argument 'strict' added).

notSorted - function(x, decreasing = FALSE, strict = FALSE, na.rm = FALSE){
  # return TRUE if x is not sorted
  # If decreasing=FALSE, check for sort in increasing order
  # If strict=TRUE, ties correspond to not being sorted
  n - length(x)
  if(length(n)  2)
return(FALSE)
  if(!is.atomic(x) || (!na.rm  any(is.na(x
return(NA)
  if(na.rm  any(ii - is.na(x)))
x - x[!ii]
  if(decreasing){
ifelse1(strict,
any(x[-1] = x[-n]),
any(x[-1]   x[-n]))
  } else { # check for sort in increasing order
ifelse1(strict,
any(x[-1] = x[-n]),
any(x[-1]   x[-n]))
  }
}


On Tue, Jul 1, 2008 at 11:20 AM, Tim Hesterberg [EMAIL PROTECTED]
wrote:

 Below is a version of [.data.frame that is faster
 for subscripting rows of large data frames; it avoids calling
 duplicated(rows)
 if there is no need to check for duplicate row names, when:
 i is logical
 attr(x, dup.row.names) is not NULL (S+ compatibility)
 i is numeric and negative
 i is strictly increasing


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch

Re: [Rd] [.data.frame speedup

2008-07-01 Thread Tim Hesterberg
Here is a revised version of notSorted; change argument order (to be more
like
is.unsorted) and fix blunder.

notSorted - function(x, na.rm = FALSE, decreasing = FALSE, strict = FALSE){
  # return TRUE if x is not sorted
  # If decreasing=FALSE, check for sort in increasing order
  # If strict=TRUE, ties correspond to not being sorted
  n - length(x)
  if(n  2)
return(FALSE)
  if(!is.atomic(x) || (!na.rm  any(is.na(x
return(NA)
  if(na.rm  any(ii - is.na(x))){
x - x[!ii]
n - length(x)
  }
  if(decreasing){
ifelse1(strict,
any(x[-1] = x[-n]),
any(x[-1]   x[-n]))
  } else { # check for sort in increasing order
ifelse1(strict,
any(x[-1] = x[-n]),
any(x[-1]   x[-n]))
  }
}


On Tue, Jul 1, 2008 at 3:23 PM, Tim Hesterberg [EMAIL PROTECTED]
wrote:

 There is a bug in the standard version of [.data.frame;
 it mixes up handling duplicates and NAs when subscripting rows.
   x - data.frame(x=1:3, y=2:4, row.names=c(a,b,NA))
   y - x[c(2:3, NA),]
   y
 It creates a data frame with duplicate rows, but won't print.

 In the previous message I included a version of [.data.frame;
 it fails for the same example, for a different reason.  Here
 is a fix.


 subscript.data.frame -
 function (x, i, j,
   drop = if (missing(i)) TRUE else length(cols) == 1)
 {
   # This version of [.data.frame avoid wasting time enforcing unique
   # row names if possible.

   mdrop - missing(drop)
   Narg - nargs() - (!mdrop)
   if (Narg  3) {
 if (!mdrop)
   warning(drop argument will be ignored)
 if (missing(i))
   return(x)
 if (is.matrix(i))
   return(as.matrix(x)[i])
 y - NextMethod([)
 cols - names(y)
 if (!is.null(cols)  any(is.na(cols)))
   stop(undefined columns selected)
 if (any(duplicated(cols)))
   names(y) - make.unique(cols)
 return(structure(y, class = oldClass(x),
  row.names = .row_names_info(x, 0L)))
   }
   if (missing(i)) {
 if (missing(j)  drop  length(x) == 1L)
   return(.subset2(x, 1L))
 y - if (missing(j))
   x
 else .subset(x, j)
 if (drop  length(y) == 1L)
   return(.subset2(y, 1L))
 cols - names(y)
 if (any(is.na(cols)))
   stop(undefined columns selected)
 if (any(duplicated(cols)))
   names(y) - make.unique(cols)
 nrow - .row_names_info(x, 2L)
 if (drop  !mdrop  nrow == 1L)
   return(structure(y, class = NULL, row.names = NULL))
 else return(structure(y, class = oldClass(x),
   row.names = .row_names_info(x, 0L)))
   }
   xx - x
   cols - names(xx)
   x - vector(list, length(x))
   x - .Call(R_copyDFattr, xx, x, PACKAGE = base)
   oldClass(x) - attr(x, row.names) - NULL
   # Do not want to check for duplicates if don't need to
   noDuplicateRowNames - (is.logical(i) ||
   (!is.null(attr(x, dup.row.names))) ||
   (is.numeric(i)  min(i, 0, na.rm=TRUE)  0) ||
   (!anyMissing(i)  !notSorted(i, strict = TRUE)))

   if (!missing(j)) {
 x - x[j]
 cols - names(x)
 if (drop  length(x) == 1L) {
   if (is.character(i)) {
 rows - attr(xx, row.names)
 i - pmatch(i, rows, duplicates.ok = TRUE)
   }
   xj - .subset2(.subset(xx, j), 1L)
   return(if (length(dim(xj)) != 2L) xj[i] else xj[i,
  , drop = FALSE])
 }
 if (any(is.na(cols)))
   stop(undefined columns selected)
 nxx - structure(seq_along(xx), names = names(xx))
 sxx - match(nxx[j], seq_along(xx))
   }
   else sxx - seq_along(x)
   rows - NULL
   if (is.character(i)) {
 rows - attr(xx, row.names)
 i - pmatch(i, rows, duplicates.ok = TRUE)
   }
   for (j in seq_along(x)) {
 xj - xx[[sxx[j]]]
 x[[j]] - if (length(dim(xj)) != 2L)
   xj[i]
 else xj[i, , drop = FALSE]
   }
   if (drop) {
 n - length(x)
 if (n == 1L)
   return(x[[1L]])
 if (n  1L) {
   xj - x[[1L]]
   nrow - if (length(dim(xj)) == 2L)
 dim(xj)[1L]
   else length(xj)
   drop - !mdrop  nrow == 1L
 }
 else drop - FALSE
   }
   if (!drop) {
 if (is.null(rows))
   rows - attr(xx, row.names)
 rows - rows[i]
 if(any(is.na(rows)))
   rows[is.na(rows)] - NA
 if(!noDuplicateRowNames  any(duplicated(rows)))
 rows - make.unique(as.character(rows))
 if (any(duplicated(nm - names(x
   names(x) - make.unique(nm)
 if (is.null(rows))
   rows - attr(xx, row.names)[i]
 attr(x, row.names) - rows
 oldClass(x) - oldClass(xx)
   }
   x
 }

 # That requires anyMissing from the splus2R package,
 # plus notSorted (or a version of is.unsorted with argument 'strict'
 added).

 (first version of notSorted is omitted)


 On Tue, Jul 1, 2008 at 11:20 AM, Tim Hesterberg [EMAIL PROTECTED]
 wrote:

 Below is a version of [.data.frame that is faster
 for subscripting rows of large data frames; it avoids calling

Re: [Rd] (PR#11537) help (using ?) does not handle trailing whitespace

2008-05-31 Thread Tim Hesterberg
By whitespace, I mean either a space or tab (preceding the newline).

I'm using ESS:
ess-version's value is 5.3.6
GNU Emacs 21.4.1 (i486-pc-linux-gnu, X toolkit, Xaw3d scroll bars) of
2007-08-28 on terranova, modified by Debian

I have the following in my .emacs:
(load ess-5.3.6/lisp/ess-site)
(setq ess-tab-always-indent nil)
(setq ess-fancy-comments nil)

I have not edited ess-site.el


On Fri, May 30, 2008 at 12:26 PM, Prof Brian Ripley
[EMAIL PROTECTED] wrote:
 We don't know how to reproduce this: 'whitespace' is not specific enough.

 R's tokenizer breaks input at spaces, so a space would never be part of that
 expression.  And tabs don't even get to the parser in interactive use, and
 you cannot mean a newline.  So exactly what do you mean by 'whitespace'?

 The character in your email as received here is an ASCII space, and that is
 used to end the token on all my systems.  That's not to say that you didn't
 type something else that looks like a space (e.g. a nbspace) since email
 systems are fickle.

 None of my guesses worked, so we need precise reproduction instructions.

 On Thu, 29 May 2008, [EMAIL PROTECTED] wrote:

 ?agrep


 Results in:

 No documentation for 'agrep ' in specified packages and libraries:
 you could try 'help.search(agrep )'

 There is white space after agrep, that ? doesn't ignore.


 --please do not edit the information below--

 Version:
 platform = i486-pc-linux-gnu
 arch = i486
 os = linux-gnu
 system = i486, linux-gnu
 status =
 major = 2
 minor = 7.0
 year = 2008
 month = 04
 day = 22
 svn rev = 45424
 language = R
 version.string = R version 2.7.0 (2008-04-22)

 Locale:

 LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=C;LC_COLLATE=C;LC_MONETARY=C;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C

 Search Path:
 .GlobalEnv, package:stats, package:graphics, package:grDevices,
 package:utils, package:datasets, package:showStructure, package:Rcode,
 package:splus2R, package:methods, Autoloads, package:base

 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel


 --
 Brian D. Ripley,  [EMAIL PROTECTED]
 Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
 University of Oxford, Tel:  +44 1865 272861 (self)
 1 South Parks Road, +44 1865 272866 (PA)
 Oxford OX1 3TG, UKFax:  +44 1865 272595


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Standard method for S4 object

2008-02-25 Thread Tim Hesterberg
It depends on what the object is to be used for.

If you want users to be able to operate with the object as if it
were a normal vector, to do things like mean(x), cos(x), etc.
then the list would be very long indeed; for example, there are
225 methods for the S4 'bdVector' class (in S-PLUS), plus additional
methods defined for inheriting classes.

In cases like this you might prefer using an S3 class, using
attributes rather than slots for auxiliary information, so that
you don't need to write so many methods.

Tim Hesterberg

I am defining a new class. Shortly, I will submit a package with it. 
Before, I would like to know if there is a kind of non official list 
of what method a new S4 object should have.
More precisely, personally, I use 'print', 'summary' and 'plot' a lot. 
So for my new class, I define these 3 methods and of course, a get and a 
set for each slot. What else? Is there some other methods that a R user 
can reasonably expect? Some minimum basic tools...

Thanks

Christophe

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Standard method for S4 object

2008-02-25 Thread Tim Hesterberg
Tim Hesterberg wrote:
 It depends on what the object is to be used for.
 
 If you want users to be able to operate with the object as if it
 were a normal vector, to do things like mean(x), cos(x), etc.
 then the list would be very long indeed; for example, there are
 225 methods for the S4 'bdVector' class (in S-PLUS), plus additional
 methods defined for inheriting classes.

This somehow undermines the whole idea of inheritance. If you do not 
inherit, then you are just implementing a class that mimics another one 
from scratch. However, the question then is not about standard methods 
any more, it's about the methods of the class that you mimic.

My experience with S4 classes is primarily with classes that had
to be implemented from scratch, there was nothing one could inherit
from - bdFrame and bdVector in library(bigdata), miVariable in
library(missing)  (sorry, these are S-PLUS only).  

Actually, for miVariable we considered S3 class + attributes, but
in this case we decided that we did NOT want operations like mean(x)
to work without going through a method specifically for the class.

... example of Image class omitted here 

 In cases like this you might prefer using an S3 class, using
 attributes rather than slots for auxiliary information, so that
 you don't need to write so many methods.

The reasoning here is not really clear. Could you please explain why is 
this better?

Three examples.  First is bs:
 library(splines)
 bsx - bs(1:99, knots = 10 * 2:6)
 showStructure(bsx)
numeric[99,8]  S3 class: bs basis 
 attributes: dimnames 
 degree   scalar  class: integer 
 knotsnumeric[ length 5]  class: numeric 
 Boundary.knots   numeric[ length 2]  class: integer 
 interceptlogical[ length 1]  class: logical 

(I plan to add showStructure to library(splus2R) shortly.)

This is an S3 class, a matrix plus some additional attributes.
Everything that works for a matrix works for this object,
without needing additional classes.

A second example is label:
 library(Hmisc)
  age - c(21,65,43)
  label(age) - Age in Years
 showStructure(age)
numeric[ length 3]  S3 class: labelled 
 label   character[ length 1]  class: character 
 cos(age)
Age in Years 
[1] -0.5477293 -0.5624539  0.5551133

Another S3 class, basically any object plus a label attribute.
There are a few methods for this class, otherwise it works
out of the box.

The third is lm - a list with an S3 class.  Functions that
operate on lists work fine without extra methods.  And you can
add extra components without needing to define a new class
(I've done this in library(resample)).

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Standard method for S4 object

2008-02-25 Thread Tim Hesterberg
Hi Oleg,

If there as a class to inherit from, then my point about an S4 class
requiring lots of methods is moot.  I think it would come down then to
whether one prefers flexibility (advantage S3) or a definite structure
for use with C/C++ (advantage S4).

Tim

well, I am not arguing that there are situation when one needs to 
rewrite everything from scratch. However it is always worth at least 
considering inheritance if there is a candidate to inherit from. It 
saves a lot of work.

Anyway, your examples of S3 class usage are obviously valid in sense 
that they are indeed S3 methods providing desired functionality. 
However, I still do not see WHY using attributes with S3 is better than 
slots and S4 for structures like those inherited from 'array' or 
similar. S3 gives more freedom in assigning new attributes, but this 
freedom also means that one has little control over the structure of an 
object making it, for example, more difficult to use with C/C++ code. 
Are there any specific benefits in not using S4 and slots (apart from 
some known performance issues)?

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Sampling with unequal probabilities

2008-02-07 Thread Tim Hesterberg
This is in followup to a thread on R-help with subject Sampling.
I claim that R does the wrong thing by default when
sampling with unequal probabilities without replacement -
the selection probabilities are not proportional to 'prob',
for any draw after the first: I suggest that R do what S-PLUS now does
(though you're free to choose a better implementation).

What S-PLUS now does is:
If 'replace==TRUE' then sample with replacement.
Otherwise, sample without replacement or with minimal replacement,
according to the value of argument 'minimal'.
The default is 'minimal = length(prob)  1'.
One can specify 'minimal = FALSE' for backward compatibility.

In the case of sampling with minimal replacement, duplicates
may occur whenever 'max(size*prob)  1', and are guaranteed if
'max(size*prob) = 2'.  You can think of drawing
'trunc(size*prob)' observations deterministically,
then drawing the remaining 'size - sum(trunc(size*prob))' observations
without replacement, with an adjusted prob vector.

The algorithm I used is relatively simple.  It is one of the
Brewer and Hanif algorithms (though I don't recall if they used
the final random shuffle).  Here's one description, or you
may prefer the description in
Pedro J. Saavedra (2005)
Comparison of Two Weighting Schemes for Sampling with Minimal Replacement 
http://www.amstat.org/Sections/Srms/Proceedings/y2005/Files/JSM2005-000882.pdf

In the case of minimal = TRUE (sampling with minimal replacement),
with unequal probabilities:
* scale prob to sum to 1,
* randomly sort the observations along with prob
* let cprob = cumsum(prob), 
* draw a systematic sample of size 'size' in (0,1):
  uniformVector - (1:size - runif(1))/size
* observation i is selected if cprob[i-1]  uniformVector[j] = cprob[i]
  for any j
  In the case (size*max(prob)  1), the number of times the observation
  is selected is the number of j's for which the inequalities hold.
* the selected observations are randomly sorted again.

Tim Hesterberg
Disclaimer - these are my opinions, not Insightful's.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] S3 vs S4 for a simple package

2008-01-07 Thread Tim Hesterberg
Would you like existing functions such as mean, range, sum,
colSums, dim, apply, length, and many more to operate on the array of
numbers?  If so use an S3 class.

If you would like to effectively disable such functions, to prevent
them from working on the object unless you write a method that specifies
exactly how the function should operate on the class, then either
use an S4 class, or an S3 class where the array is one component of
a list.

An S3 class also allows for flexibility - you can add attributes,
or list components, without breaking things.

As for reassurance - I use S3 classes for almost everything, happily.
The one time I chose to use an S4 class I later regretted it.  This
was for objects containing multiple imputations, where I wanted to
prevent functions like mean() from working on the original data,
without filling in imputations.  The regret was because we later
realized that in some cases we wanted to add a call attribute or
component/slot so that update() would work.  If it had been an S3
object we could have done so, but as an S4 object we would have broken
existing objects of the class.

Tim Hesterberg
Disclaimer - this is my personal opinion, not my employer's.

I am writing a package and need to decide whether to use S3 or S4.

I have a single class, multipol; this needs methods for [ and [-
and I also need a print (or show) method and methods for arithmetic +- 
*/^.

In S4, an object of class multipol has one slot that holds an array.

Objects of class multipol require specific arithmetic operations;  
a,b being
multipols means that a+b and a*b are defined in peculiar ways
that make sense in the context of the package. I can also add and  
multiply
by scalars (vectors of length one).

My impression is that S3 is perfectly adequate for this task, although
I've not yet finalized the coding.

S4 seems to be overkill for such a simple system.

Can anyone give me some motivation for persisting with S4?

Or indeed reassure me that S3 is a good design decision?

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] hasNA() / anyNA()?

2007-08-14 Thread Tim Hesterberg
S-PLUS has an anyMissing() function, for which the default is:

anyMissing.default -
function(x){
(length(which.na(x))  0)
}

This is more efficient than any(is.na(x)) in the usual case that there
are few or no missing values.  There are methods for vectors that drop
to C code, and methods for data frames and other classes.

The code below seems to presume a list, and would be very slow for vectors.

For reasons of consistency between S-PLUS and R, I would ask that an R
function be called anyMissing rather than hasNA or anyNA.

Tim Hesterberg

is there a hasNA() / an anyNA() function in R?  Of course,

hasNA - function(x) {
  any(is.na(x));
}

would do, but that would scan all elements in 'x' and then do the
test.  I'm looking for a more efficient implementation that returns
TRUE at the first NA, e.g.

hasNA - function(x) {
  for (kk in seq(along=x)) {
if (is.na(x[kk]))
  return(TRUE);
  }
  FALSE;
}

Cheers

Henrik

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] comment causes browser() to exit (PR#9063)

2006-07-07 Thread Tim Hesterberg
If I'm not mistaken, this works as documented.  ...

Thanks for the response.

The behavior with return is as documented -- hence my earlier
enhancement request.
The behavior with a comment is contrary to the documentation, hence
this as a bug report.

On the second point -- help(browser) says:
 Anything else entered at the browser prompt is interpreted as an R
 expression to be evaluated in the calling environment: ...
whereas the actual behavior is to interpret a comment as equivalent to c,
and to exit the browser.

On the first point -- I would like to argue that the behavior of
return (and comments) should be changed, at least as a user option.
Here is a typical piece of code from a function; note the use of
comments and blank lines to improve readability:

  # If statistic isn't a function or name of function or expression,
  # store it as an expression to pass to fit.func.
  substitute.stat - substitute(statistic)
  if(!is.element(mode(substitute.stat), c(name, function)))
statistic - substitute.stat

  # Get name of data.
  data.name - substitute(data)
  if(!is.name(data.name))
data.name - data
  is.df.data - is.data.frame(data)

  # How many observations, or subjects if sampling by subject?
  n - nObservations - numRows(data)  # n will change later if by subject

  # Save group or subject arguments?
  if(is.null(save.subject)) save.subject - (n = 1)
  if(is.null(save.group))   save.group   - (n = 1)

If I now stick a browser() in that function, and throw a line at a
time from the source file to R, it exits whenever I throw a blank line
or comment.  I try to remember to skip the blank lines and comments,
but I sometimes forget, and get very annoyed when I have to start over.

I could use c in some contexts, but not others:
* I often want to evaluate code that is not part of the
  defined function.
* I sometimes change objects and want to go evaluate some lines
  that were previously evaluated.

In the enhancement request I requested an option to turn off the current
behavior of return.  I personally would just change the default behavior,
and have both blank lines and comments do nothing.  This is simpler, and
people can always use c to quit the browser.

This behavior of browser() is the most annoying thing I've found
about using R.  As I anticipate using R a lot in the future, I would
appreciate very much if it is changed.  I spent a fair amount of time
trying to see if I could change it myself, but gave up.

Tim Hesterberg

Andy Liaw wrote:
If I'm not mistaken, this works as documented.  As an example (typed
directly into the Rgui console on WinXP):

R f - function() {
+ browser()
+ cat(I'm here!\n)
+ cat(I'm still here!\n)
+ }
R f()
Called from: f()
Browse[1] ## where to?
I'm here!
I'm still here!

which I think is what you saw.  However:

R f()
Called from: f()
Browse[1] n
debug: cat(I'm here!\n)
Browse[1] ##
I'm here!
debug: cat(I'm still here!\n)
Browse[1] 
I'm still here!

From ?browser:

c
(or just return) exit the browser and continue execution at the next
statement. 
cont
synonym for c. 
n
enter the step-through debugger. This changes the meaning of c: see the
documentation for debug. 

My interpretation of this is that, if the first thing typed (or pasted) in
is something like a null statement (e.g., return, empty line, or comment),
it's the same as 'c', but if the null statement follows 'n', then it behaves
differently.

   _ 
platform   i386-pc-mingw32   
arch   i386  
os mingw32   
system i386, mingw32 
status   
major  2 
minor  3.1   
year   2006  
month  06
day01
svn rev38247 
language   R 
version.string Version 2.3.1 (2006-06-01)


Andy


From: [EMAIL PROTECTED]
 
 I'm trying to step through some code using browser(), 
 executing one line at a time.
 Unfortunately, whenever I execute a comment line, the browser exits.
 
 I previously reported a similar problem with blank lines.
 
 These problems are a strong incentive to write poor code
 -- uncommented code with no blank lines to improve 
 readability -- so that I can use browser() without it exiting 
 at inconvenient times.
 
 Additional detail:  
 (1) I'm running R inside emacs, with R in one buffer and a 
 file containing code in another, using a macro to copy and 
 paste a line at a time from the file to the R buffer.
 (2) The browser() call is inside a function.  
 Right now the lines I'm sending to the browser are not part 
 of the function, though usually they are.
 
 
 --please do not edit the information below--
 
 Version:
  platform = i386-pc-mingw32
  arch = i386
  os = mingw32
  system = i386, mingw32
  status =
  major

Re: [Rd] Open .ssc .S ... files in R (PR#8690)

2006-03-17 Thread Tim Hesterberg
I think it would be good to make the change in the Mac gui too.
This would help people on the mac who work on multiple platforms,
or try scripts from other people.


I forgot to mention one other extension, .t, an extension often
used for tests to be processed using do.test().  However, this
is less common, could easily be excluded; people can use all files
for this.

Thanks,
Tim

On 3/17/2006 2:19 PM, [EMAIL PROTECTED] wrote:
 - Quick summary:  
 
 In the File:Open dialog, please change
 S files (*.q) 
 to
 S files (*.q, *.ssc, *.S) 
 and show the corresponding files (including .SSC and .s files).

I'll make this change in the Windows Rgui.  Is this an issue in the Mac 
gui too?

Duncan Murdoch

 
 - Background
 This is motivated by the following query to R-help:
 
Date: Thu, 16 Mar 2006 22:44:11 -0600
From: xpRt.wannabe [EMAIL PROTECTED]
Subject: [R] Is there a way to view S-PLUS script files in R
To: r-help@stat.math.ethz.ch

Dear List,

I have some S-PLUS script files (.ssc).  Does there exist an R
function/command  that can read such files?  I simply want to view the
code and practice in R to help me learn the subject matter.

Any help would be greatly appreciated.

platform i386-pc-mingw32
arch i386
os   mingw32
system   i386, mingw32
status
major2
minor2.1
year 2005
month12
day  20
svn rev  36812
language R
 
 I responded:
You can open them in R.  On Windows, File:Open Script,
change Files of type to All Files, then open the .ssc file.
 
 So there is a workaround.  But it is odd that the S files option
 doesn't actually include what are probably the most common S files.
 
 Thanks,
 Tim Hesterberg
 
 
 --please do not edit the information below--
 
 Version:
  platform = i386-pc-mingw32
  arch = i386
  os = mingw32
  system = i386, mingw32
  status = 
  major = 2
  minor = 2.1
  year = 2005
  month = 12
  day = 20
  svn rev = 36812
  language = R
 
 Windows XP Professional (build 2600) Service Pack 2.0
 
 Locale:
 LC_COLLATE=English_United States.1252;LC_CTYPE=English_United 
 States.1252;LC_MONETARY=English_United 
 States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
 
 Search Path:
  .GlobalEnv, package:glmpath, package:survival, package:splines, 
 package:methods, package:stats, package:graphics, package:grDevices, 
 package:utils, package:datasets, Autoloads, package:base
 
 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel