Re: [Rd] [R] "[.data.frame" and lapply

2009-03-28 Thread Wacek Kusnierczyk
Romain Francois wrote:
> Wacek Kusnierczyk wrote:
>> redirected to r-devel, because there are implementational details of
>> [.data.frame discussed here.  spoiler: at the bottom there is a fairly
>> interesting performance result.
>>
>> Romain Francois wrote:
>>  
>>> Hi,
>>>
>>> This is a bug I think. [.data.frame treats its arguments differently
>>> depending on the number of arguments.
>>> 
>>
>> you might want to hesitate a bit before you say that something in r is a
>> bug, if only because it drives certain people mad.  r is a carefully
>> tested software, and [.data.frame is such a basic function that if what
>> you talk about were a bug, it wouldn't have persisted until now.
>>   
> I did hesitate, and would be prepared to look the other way of someone
> shows me proper evidence that this makes sense.
>
> > d <- data.frame( x = 1:10, y = 1:10, z = 1:10 )
> > d[ j=1 ]
>x  y  z
> 1   1  1  1
> 2   2  2  2
> 3   3  3  3
> 4   4  4  4
> 5   5  5  5
> 6   6  6  6
> 7   7  7  7
> 8   8  8  8
> 9   9  9  9
> 10 10 10 10
>
> "If a single index is supplied, it is interpreted as indexing the list
> of columns". Clearly this does not happen here, and this is because
> NextMethod gets confused.

obviously.  it seems that there is a bug here, and that it results from
the lack of clear design specification.

>
> I have not looked your implementation in details, but it misses array
> indexing, as in:

yes;  i didn't take it into consideration, but (still without detailed
analysis) i guess it should not be difficult to extend the code to
handle this.



>
> > d <- data.frame( x = 1:10, y = 1:10, z = 1:10 )
> > m <- cbind( 5:7, 1:3 )
> > m
> [,1] [,2]
> [1,]51
> [2,]62
> [3,]73
> > d[m]
> [1] 5 6 7
> > subdf( d, m )
> Error in subdf(d, m) : undefined columns selected

this should be easy to handle by checking if i is a matrix and then
indexing by its first column as i and the second as j.

>
> "Matrix indexing using '[' is not recommended, and barely
> supported.  For extraction, 'x' is first coerced to a matrix. For
> replacement a logical matrix (only) can be used to select the
> elements to be replaced in the same way as for a matrix."

yes, here's how it's done (original comment):

if(is.matrix(i))
return(as.matrix(x)[i])  # desperate measures

and i can easily add this to my code, at virtually no additional expense.

it's probably not a good idea to convert x to a matrix, x would often be
much more data than the index matrix m, so it's presumably much more
efficient, on average, to fiddle with i instead.

there are some potentially confusing issues here:

m = cbind(8:10, 1:3)
   
d[m]
# 3-element vector, as you could expect

d[t(m)]
# 6-element vector

t(m) has dimensionality inappropriate for matrix indexing (it has 3
columns), so it gets flattened into a vector;  however, it does not work
like in the case of a single vector index where columns would be selected:

d[as.vector(t(m))]
# error: undefined columns selected

i think it would be more appropriate to raise an error in a case like
d[t(m)].

furthermore, if a matrix is used in a two-index form, the matrix is
flattened again and is used to select rows (not elements, as in
d[t(m)]).  note also that the help page says that "for extraction, 'x'
is first coerced to a matrix".  it fails to explain that if *two*
indices are used of which at least one is a matrix, no coercion is
done.  that is, the matrix is again flattened into a vector, but here
[.data.frame forgets that it was a matrix (unlike in d[t(m)]):

is(d[m])
# a character vector, matrix indexing

is(d[t(m)])
# a character vector, vector indexing of elements, not columns

is(d[m,])
# a data frame, row indexing
   
and finally, the fact that d[m] in fact converts x (i.e., d) to a matrix
before the indexing means that the types of values in a some columns in
d may get coerced to another type:

d[,2] = as.character(d[,2])
is(d[,1])
# integer vector
is(d[,2])
# character vector

is(d[1:2, 1])
# integer vector
is(d[cbind(1:2, 1)])
# character vector


for all it's worth, i think matrix indexing of data frames should be
dropped:

d[m]
# error: ...

 and if one needs it, it's as simple as

as.matrix(d)[m]

where the conversion of d to a matrix is explicit.

on the side, [.data.frame is able to index matrices:

'[.data.frame'(as.matrix(d), m)
# same as as.matrix(d)[m]

which is, so to speak, nonsense, since '[.data.frame' is designed
specifically to handle data frames;  i'd expect an error to be raised
here (or a warning, at the very least).

to summarize, the fact that subdf does not handle matrix indices is not
an issue.  anyway, thanks for the comment!

best,
vQ

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [R] "[.data.frame" and lapply

2009-03-28 Thread Romain Francois

Wacek Kusnierczyk wrote:

redirected to r-devel, because there are implementational details of
[.data.frame discussed here.  spoiler: at the bottom there is a fairly
interesting performance result.

Romain Francois wrote:
  

Hi,

This is a bug I think. [.data.frame treats its arguments differently
depending on the number of arguments.



you might want to hesitate a bit before you say that something in r is a
bug, if only because it drives certain people mad.  r is a carefully
tested software, and [.data.frame is such a basic function that if what
you talk about were a bug, it wouldn't have persisted until now.
  
I did hesitate, and would be prepared to look the other way of someone 
shows me proper evidence that this makes sense.


> d <- data.frame( x = 1:10, y = 1:10, z = 1:10 )
> d[ j=1 ]
   x  y  z
1   1  1  1
2   2  2  2
3   3  3  3
4   4  4  4
5   5  5  5
6   6  6  6
7   7  7  7
8   8  8  8
9   9  9  9
10 10 10 10

"If a single index is supplied, it is interpreted as indexing the list 
of columns". Clearly this does not happen here, and this is because 
NextMethod gets confused.


I have not looked your implementation in details, but it misses array 
indexing, as in:


> d <- data.frame( x = 1:10, y = 1:10, z = 1:10 )
> m <- cbind( 5:7, 1:3 )
> m
[,1] [,2]
[1,]51
[2,]62
[3,]73
> d[m]
[1] 5 6 7
> subdf( d, m )
Error in subdf(d, m) : undefined columns selected

"Matrix indexing using '[' is not recommended, and barely
supported.  For extraction, 'x' is first coerced to a matrix. For
replacement a logical matrix (only) can be used to select the
elements to be replaced in the same way as for a matrix."

You might also want to look at `[<-.data.frame`.

> d[j=2] <- 1:10
Error in `[<-.data.frame`(`*tmp*`, j = 2, value = 1:10) :
 element 1 is empty;
  the part of the args list of 'is.logical' being evaluated was:
  (i)
> d[2] <- 10:1
> d
   x  y  z
1   1 10  1
2   2  9  2
3   3  8  3
4   4  7  4
5   5  6  5
6   6  5  6
7   7  4  7
8   8  3  8
9   9  2  9
10 10  1 10

This is probably less of an issue, because there is very little chance 
for people to use this construct, but for the first one, if not used 
directly, it still has good chances to be used within some fooapply 
call, as in the original post. Although it might have been preferable to 
use subset as the applied function.


Romain

treating the arguments differently depending on their number is actually
(if clearly...) documented:  if there is one index (the 'i'), it selects
columns.  if there are two, 'i' selects rows.

however, not all seems fine, there might be a design flaw:

# dummy data frame
d = structure(names=paste('col', 1:3, sep='.'),
data.frame(row.names=paste('row', 1:3, sep='.'),
   matrix(1:9, 3, 3)))

d[1:2]
# correctly selects two first columns
# 1:2 passed to [.data.frame as i, no j given

d[,1:2]
# correctly selects two first columns
# 1:2 passed to [.data.frame as j, i given the missing argument
value (note the comma)

d[,i=1:2]
# correctly selects two first rows
# 1:2 passed to [.data.frame as i, j given the missing argument
value (note the comma)

d[j=1:2,]
# correctly selects two first columns
# 1:2 passed to [.data.frame as j, i given the missing argument
value (note the comma)

d[i=1:2]
# correctly (arguably) selects the first two columns
# 1:2 passed to [.data.frame as i, no j given
  
d[j=1:2]

# wrong: returns the whole data frame
# does not recognize the index as i because it is explicitly named 'j'
# does not recognize the index as j because there is only one index

i say this *might* be a design flaw because it's hard to judge what the
design really is.  the r language definition (!) [1, sec. 3.4.3 p. 18] says:

"   The most important example of a class method for [ is that used for
data frames. It is not
be described in detail here (see the help page for [.data.frame, but in
broad terms, if two
indices are supplied (even if one is empty) it creates matrix-like
indexing for a structure that is
basically a list of vectors of the same length. If a single index is
supplied, it is interpreted as
indexing the list of columns—in that case the drop argument is ignored,
with a warning."

it does not say what happens when only one *named* index argument is
given.  from the above, it would indeed seem that there is a *bug*
here:  in the last example above only one index is given, and yet
columns are not selected, even though the *language definition* says
they should.  (so it's not a documented feature, it's a
contra-definitional misfeature -- a bug?)

somewhat on the side, the 'matrix-like indexing' above is fairly
misleading;  just try the same patterns of indexing -- one index, two
indices, named indices -- on a data frame and a matrix of the same shape:

m = matrix(1:9, 3, 3)
md = data.frame(m)

md[1]
# the first column
m[1]
# the first element (i.e., m[

Re: [Rd] [R] "[.data.frame" and lapply

2009-03-27 Thread Wacek Kusnierczyk
redirected to r-devel, because there are implementational details of
[.data.frame discussed here.  spoiler: at the bottom there is a fairly
interesting performance result.

Romain Francois wrote:
>
> Hi,
>
> This is a bug I think. [.data.frame treats its arguments differently
> depending on the number of arguments.

you might want to hesitate a bit before you say that something in r is a
bug, if only because it drives certain people mad.  r is a carefully
tested software, and [.data.frame is such a basic function that if what
you talk about were a bug, it wouldn't have persisted until now.

treating the arguments differently depending on their number is actually
(if clearly...) documented:  if there is one index (the 'i'), it selects
columns.  if there are two, 'i' selects rows.

however, not all seems fine, there might be a design flaw:

# dummy data frame
d = structure(names=paste('col', 1:3, sep='.'),
data.frame(row.names=paste('row', 1:3, sep='.'),
   matrix(1:9, 3, 3)))

d[1:2]
# correctly selects two first columns
# 1:2 passed to [.data.frame as i, no j given

d[,1:2]
# correctly selects two first columns
# 1:2 passed to [.data.frame as j, i given the missing argument
value (note the comma)

d[,i=1:2]
# correctly selects two first rows
# 1:2 passed to [.data.frame as i, j given the missing argument
value (note the comma)

d[j=1:2,]
# correctly selects two first columns
# 1:2 passed to [.data.frame as j, i given the missing argument
value (note the comma)

d[i=1:2]
# correctly (arguably) selects the first two columns
# 1:2 passed to [.data.frame as i, no j given
  
d[j=1:2]
# wrong: returns the whole data frame
# does not recognize the index as i because it is explicitly named 'j'
# does not recognize the index as j because there is only one index

i say this *might* be a design flaw because it's hard to judge what the
design really is.  the r language definition (!) [1, sec. 3.4.3 p. 18] says:

"   The most important example of a class method for [ is that used for
data frames. It is not
be described in detail here (see the help page for [.data.frame, but in
broad terms, if two
indices are supplied (even if one is empty) it creates matrix-like
indexing for a structure that is
basically a list of vectors of the same length. If a single index is
supplied, it is interpreted as
indexing the list of columns—in that case the drop argument is ignored,
with a warning."

it does not say what happens when only one *named* index argument is
given.  from the above, it would indeed seem that there is a *bug*
here:  in the last example above only one index is given, and yet
columns are not selected, even though the *language definition* says
they should.  (so it's not a documented feature, it's a
contra-definitional misfeature -- a bug?)

somewhat on the side, the 'matrix-like indexing' above is fairly
misleading;  just try the same patterns of indexing -- one index, two
indices, named indices -- on a data frame and a matrix of the same shape:

m = matrix(1:9, 3, 3)
md = data.frame(m)

md[1]
# the first column
m[1]
# the first element (i.e., m[1,1])

md[,i=3]
# third row
m[,i=3]
# third column


the quote above refers to the ?'[.data.frame' for details. 
unfortunately, it the help page a lump of explanations for various
'['-like operators, and it is *not* a definition of any sort.  it does
not provide much more detail on '[.data.frame' -- it is hardly as a
design specification.  in particular, it does not explain the issue of
named arguments to '[.data.frame' at all.


`[.data.frame` only is called with two arguments in the second case,  
> so
> the following condition is true:
>
> if(Narg < 3L) {  # list-like indexing or matrix indexing
>
> And then, the function assumes the argument it has been passed is i,  
> and
> eventually calls NextMethod("[") which I think calls
> `[.listof`(x,i,...), since i is missing in `[.data.frame` it is not
> passed to `[.listof`, so you have something equivalent to as.list(d) 
> [].
>
> I think we can replace the condition with this one:
>
> if(Narg < 3L && !has.j) {  # list-like indexing or matrix indexing
>
> or this:
>
> if(Narg < 3L) {  # list-like indexing or matrix indexing
>if(has.j) i <- j
>


indeed, for a moment i thought a trivial fix somewhere there would
suffice.  unfortunately, the code for [.data.frame [2, lines 500-641] is
so clean and readable that i had to give up reading it, forget fixing. 
instead, i wrote an new version of '[.data.frame' from scratch.  it
fixes (or at least seems to fix, as far as my quick assessment goes) the
problem.  the function subdf (see the attached dataframe.r) is the new
version of '[.data.frame':

# dummy data frame
d = structure(names=paste('col', 1:3, sep='.'),
data.frame(row.names=paste('row', 1:3, sep='.'),
   matrix(1:9, 3, 3)))

d[j=1:2]
# incorrect: 

Re: [Rd] [R] "[.data.frame" and lapply

2009-03-26 Thread Romain Francois

[moving this from R-help to R-devel]

Hi,

Right, so when you call `[`, the dispatch is made internally :

> d <- data.frame( x = 1:5, y = rnorm(5), z = rnorm(5) )
> trace( `[.data.frame` )
> d[ , 1:2]   # ensuring the 1:2 is passed to j and the i is passed as 
missing

Tracing `[.data.frame`(d, , 1:2) on entry
 x   y
1 1  0.98946922
2 2  0.05323895
3 3 -0.21803664
4 4 -0.47607043
5 5  1.23366151

> d[ 1:2] # only on argument, so it goes in i
Tracing `[.data.frame`(d, 1:2) on entry
 x   y
1 1  0.98946922
2 2  0.05323895
3 3 -0.21803664
4 4 -0.47607043
5 5  1.23366151

But that does not explain why this is hapening:

> d[ i = 1:2]
Tracing `[.data.frame`(d, i = 1:2) on entry
 x   y
1 1  0.98946922
2 2  0.05323895
3 3 -0.21803664
4 4 -0.47607043
5 5  1.23366151

> d[ j = 1:2]
Tracing `[.data.frame`(d, j = 1:2) on entry
 x   y  z
1 1  0.98946922 -0.5233134
2 2  0.05323895  1.3646683
3 3 -0.21803664 -0.4998344
4 4 -0.47607043 -1.8849618
5 5  1.23366151  0.6723562

Arguments are dispatched to `[.data.frame` with their names, and 
`[.data.frame` gets confused. I'm not suggesting allowing named 
arguments because it already works, what does not work is how 
`[.data.frame` treats them, and that needs to be changed, this is a bug.


Romain

> version
  _
platform   i686-pc-linux-gnu
arch   i686
os linux-gnu
system i686, linux-gnu
status Under development (unstable)
major  2
minor  9.0
year   2009
month  03
day09
svn rev48093
language   R
version.string R version 2.9.0 Under development (unstable) (2009-03-09 
r48093)





baptiste auguie wrote:

Hi,

I got an off-line clarification from Martin Morgan which makes me 
believe it's not a bug (admittedly, I was close to suggesting it before).


Basically, "[" is a .Primitive, for which the help page says,


The advantage of |.Primitive| over |.Internal 
| functions 
is the potential efficiency of argument passing. However, this is 
done by ignoring argument names and using positional matching of 
arguments (unless arranged differently for specific primitives such 
as |rep 
|), 
so this is discouraged for functions of more than one argument.


This explains why in my tests the argument names i and j were 
completely ignored and only the number and order of arguments changed 
the result. 

I've learnt my lesson here, but I wonder what could be done to make 
this discovery easier for others:


- add a note in the documentation of each .Primitive function (at 
least a link to ?.Primitive)


- add such an example in lapply (all examples are for named arguments)

- echo a warning if trying to pass named arguments to a .Primitive

- allow for named arguments as you suggest

I'm not sure the last two would be possible without some cost in 
efficiency.



Many thanks,

baptiste




On 26 Mar 2009, at 07:46, Romain Francois wrote:



Hi,

This is a bug I think. [.data.frame treats its arguments differently
depending on the number of arguments.


d <- data.frame(x = rnorm(5), y = rnorm(5), z = rnorm(5) )
d[, 1:2]

x   y
1   0.45141341  0.03943654
2  -0.87954548  1.83690210
3  -0.91083710  0.22758584
4   0.06924279  1.26799176
5  -0.20477052 -0.25873225

base:::`[.data.frame`( d, j=1:2)

x   y  z
1   0.45141341  0.03943654 -0.8971957
2  -0.87954548  1.83690210  0.9083281
3  -0.91083710  0.22758584 -0.3104906
4   0.06924279  1.26799176  1.2625699
5  -0.20477052 -0.25873225  0.5228342
but also:

d[ j=1:2]

   x   y  z
1  0.45141341  0.03943654 -0.8971957
2 -0.87954548  1.83690210  0.9083281
3 -0.91083710  0.22758584 -0.3104906
4  0.06924279  1.26799176  1.2625699
5 -0.20477052 -0.25873225  0.5228342

`[.data.frame` only is called with two arguments in the second case, so
the following condition is true:

if(Narg < 3L) {  # list-like indexing or matrix indexing

And then, the function assumes the argument it has been passed is i, and
eventually calls NextMethod("[") which I think calls
`[.listof`(x,i,...), since i is missing in `[.data.frame` it is not
passed to `[.listof`, so you have something equivalent to as.list(d)[].

I think we can replace the condition with this one:

if(Narg < 3L && !has.j) {  # list-like indexing or matrix indexing

or this:

if(Narg < 3L) {  # list-like indexing or matrix indexing
   if(has.j) i <- j


`[.data.frame`(d, j=1:2)

   x   y
1  0.45141341  0.03943654
2 -0.87954548  1.83690210
3 -0.91083710  0.22758584
4  0.06924279  1.26799176
5 -0.20477052 -0.25873225

However, we would still have this, which is expected (same as d[1:2] ):


`[.data.frame`(d, i=1:2)

   x   y
1  0.45141341  0.03943654
2 -0.87954548  1.83690210
3 -0.91083710  0.22758584
4  0.06924279  1.26799176
5 -0.20477052 -0.25873225

Romain

baptiste auguie wrote:

Dear all,


Trying to extract a few rows for each element of a list of
dat