Re: [Rd] [R] "[.data.frame" and lapply
Romain Francois wrote: > Wacek Kusnierczyk wrote: >> redirected to r-devel, because there are implementational details of >> [.data.frame discussed here. spoiler: at the bottom there is a fairly >> interesting performance result. >> >> Romain Francois wrote: >> >>> Hi, >>> >>> This is a bug I think. [.data.frame treats its arguments differently >>> depending on the number of arguments. >>> >> >> you might want to hesitate a bit before you say that something in r is a >> bug, if only because it drives certain people mad. r is a carefully >> tested software, and [.data.frame is such a basic function that if what >> you talk about were a bug, it wouldn't have persisted until now. >> > I did hesitate, and would be prepared to look the other way of someone > shows me proper evidence that this makes sense. > > > d <- data.frame( x = 1:10, y = 1:10, z = 1:10 ) > > d[ j=1 ] >x y z > 1 1 1 1 > 2 2 2 2 > 3 3 3 3 > 4 4 4 4 > 5 5 5 5 > 6 6 6 6 > 7 7 7 7 > 8 8 8 8 > 9 9 9 9 > 10 10 10 10 > > "If a single index is supplied, it is interpreted as indexing the list > of columns". Clearly this does not happen here, and this is because > NextMethod gets confused. obviously. it seems that there is a bug here, and that it results from the lack of clear design specification. > > I have not looked your implementation in details, but it misses array > indexing, as in: yes; i didn't take it into consideration, but (still without detailed analysis) i guess it should not be difficult to extend the code to handle this. > > > d <- data.frame( x = 1:10, y = 1:10, z = 1:10 ) > > m <- cbind( 5:7, 1:3 ) > > m > [,1] [,2] > [1,]51 > [2,]62 > [3,]73 > > d[m] > [1] 5 6 7 > > subdf( d, m ) > Error in subdf(d, m) : undefined columns selected this should be easy to handle by checking if i is a matrix and then indexing by its first column as i and the second as j. > > "Matrix indexing using '[' is not recommended, and barely > supported. For extraction, 'x' is first coerced to a matrix. For > replacement a logical matrix (only) can be used to select the > elements to be replaced in the same way as for a matrix." yes, here's how it's done (original comment): if(is.matrix(i)) return(as.matrix(x)[i]) # desperate measures and i can easily add this to my code, at virtually no additional expense. it's probably not a good idea to convert x to a matrix, x would often be much more data than the index matrix m, so it's presumably much more efficient, on average, to fiddle with i instead. there are some potentially confusing issues here: m = cbind(8:10, 1:3) d[m] # 3-element vector, as you could expect d[t(m)] # 6-element vector t(m) has dimensionality inappropriate for matrix indexing (it has 3 columns), so it gets flattened into a vector; however, it does not work like in the case of a single vector index where columns would be selected: d[as.vector(t(m))] # error: undefined columns selected i think it would be more appropriate to raise an error in a case like d[t(m)]. furthermore, if a matrix is used in a two-index form, the matrix is flattened again and is used to select rows (not elements, as in d[t(m)]). note also that the help page says that "for extraction, 'x' is first coerced to a matrix". it fails to explain that if *two* indices are used of which at least one is a matrix, no coercion is done. that is, the matrix is again flattened into a vector, but here [.data.frame forgets that it was a matrix (unlike in d[t(m)]): is(d[m]) # a character vector, matrix indexing is(d[t(m)]) # a character vector, vector indexing of elements, not columns is(d[m,]) # a data frame, row indexing and finally, the fact that d[m] in fact converts x (i.e., d) to a matrix before the indexing means that the types of values in a some columns in d may get coerced to another type: d[,2] = as.character(d[,2]) is(d[,1]) # integer vector is(d[,2]) # character vector is(d[1:2, 1]) # integer vector is(d[cbind(1:2, 1)]) # character vector for all it's worth, i think matrix indexing of data frames should be dropped: d[m] # error: ... and if one needs it, it's as simple as as.matrix(d)[m] where the conversion of d to a matrix is explicit. on the side, [.data.frame is able to index matrices: '[.data.frame'(as.matrix(d), m) # same as as.matrix(d)[m] which is, so to speak, nonsense, since '[.data.frame' is designed specifically to handle data frames; i'd expect an error to be raised here (or a warning, at the very least). to summarize, the fact that subdf does not handle matrix indices is not an issue. anyway, thanks for the comment! best, vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [R] "[.data.frame" and lapply
Wacek Kusnierczyk wrote: redirected to r-devel, because there are implementational details of [.data.frame discussed here. spoiler: at the bottom there is a fairly interesting performance result. Romain Francois wrote: Hi, This is a bug I think. [.data.frame treats its arguments differently depending on the number of arguments. you might want to hesitate a bit before you say that something in r is a bug, if only because it drives certain people mad. r is a carefully tested software, and [.data.frame is such a basic function that if what you talk about were a bug, it wouldn't have persisted until now. I did hesitate, and would be prepared to look the other way of someone shows me proper evidence that this makes sense. > d <- data.frame( x = 1:10, y = 1:10, z = 1:10 ) > d[ j=1 ] x y z 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 8 8 8 8 9 9 9 9 10 10 10 10 "If a single index is supplied, it is interpreted as indexing the list of columns". Clearly this does not happen here, and this is because NextMethod gets confused. I have not looked your implementation in details, but it misses array indexing, as in: > d <- data.frame( x = 1:10, y = 1:10, z = 1:10 ) > m <- cbind( 5:7, 1:3 ) > m [,1] [,2] [1,]51 [2,]62 [3,]73 > d[m] [1] 5 6 7 > subdf( d, m ) Error in subdf(d, m) : undefined columns selected "Matrix indexing using '[' is not recommended, and barely supported. For extraction, 'x' is first coerced to a matrix. For replacement a logical matrix (only) can be used to select the elements to be replaced in the same way as for a matrix." You might also want to look at `[<-.data.frame`. > d[j=2] <- 1:10 Error in `[<-.data.frame`(`*tmp*`, j = 2, value = 1:10) : element 1 is empty; the part of the args list of 'is.logical' being evaluated was: (i) > d[2] <- 10:1 > d x y z 1 1 10 1 2 2 9 2 3 3 8 3 4 4 7 4 5 5 6 5 6 6 5 6 7 7 4 7 8 8 3 8 9 9 2 9 10 10 1 10 This is probably less of an issue, because there is very little chance for people to use this construct, but for the first one, if not used directly, it still has good chances to be used within some fooapply call, as in the original post. Although it might have been preferable to use subset as the applied function. Romain treating the arguments differently depending on their number is actually (if clearly...) documented: if there is one index (the 'i'), it selects columns. if there are two, 'i' selects rows. however, not all seems fine, there might be a design flaw: # dummy data frame d = structure(names=paste('col', 1:3, sep='.'), data.frame(row.names=paste('row', 1:3, sep='.'), matrix(1:9, 3, 3))) d[1:2] # correctly selects two first columns # 1:2 passed to [.data.frame as i, no j given d[,1:2] # correctly selects two first columns # 1:2 passed to [.data.frame as j, i given the missing argument value (note the comma) d[,i=1:2] # correctly selects two first rows # 1:2 passed to [.data.frame as i, j given the missing argument value (note the comma) d[j=1:2,] # correctly selects two first columns # 1:2 passed to [.data.frame as j, i given the missing argument value (note the comma) d[i=1:2] # correctly (arguably) selects the first two columns # 1:2 passed to [.data.frame as i, no j given d[j=1:2] # wrong: returns the whole data frame # does not recognize the index as i because it is explicitly named 'j' # does not recognize the index as j because there is only one index i say this *might* be a design flaw because it's hard to judge what the design really is. the r language definition (!) [1, sec. 3.4.3 p. 18] says: " The most important example of a class method for [ is that used for data frames. It is not be described in detail here (see the help page for [.data.frame, but in broad terms, if two indices are supplied (even if one is empty) it creates matrix-like indexing for a structure that is basically a list of vectors of the same length. If a single index is supplied, it is interpreted as indexing the list of columns—in that case the drop argument is ignored, with a warning." it does not say what happens when only one *named* index argument is given. from the above, it would indeed seem that there is a *bug* here: in the last example above only one index is given, and yet columns are not selected, even though the *language definition* says they should. (so it's not a documented feature, it's a contra-definitional misfeature -- a bug?) somewhat on the side, the 'matrix-like indexing' above is fairly misleading; just try the same patterns of indexing -- one index, two indices, named indices -- on a data frame and a matrix of the same shape: m = matrix(1:9, 3, 3) md = data.frame(m) md[1] # the first column m[1] # the first element (i.e., m[
Re: [Rd] [R] "[.data.frame" and lapply
redirected to r-devel, because there are implementational details of [.data.frame discussed here. spoiler: at the bottom there is a fairly interesting performance result. Romain Francois wrote: > > Hi, > > This is a bug I think. [.data.frame treats its arguments differently > depending on the number of arguments. you might want to hesitate a bit before you say that something in r is a bug, if only because it drives certain people mad. r is a carefully tested software, and [.data.frame is such a basic function that if what you talk about were a bug, it wouldn't have persisted until now. treating the arguments differently depending on their number is actually (if clearly...) documented: if there is one index (the 'i'), it selects columns. if there are two, 'i' selects rows. however, not all seems fine, there might be a design flaw: # dummy data frame d = structure(names=paste('col', 1:3, sep='.'), data.frame(row.names=paste('row', 1:3, sep='.'), matrix(1:9, 3, 3))) d[1:2] # correctly selects two first columns # 1:2 passed to [.data.frame as i, no j given d[,1:2] # correctly selects two first columns # 1:2 passed to [.data.frame as j, i given the missing argument value (note the comma) d[,i=1:2] # correctly selects two first rows # 1:2 passed to [.data.frame as i, j given the missing argument value (note the comma) d[j=1:2,] # correctly selects two first columns # 1:2 passed to [.data.frame as j, i given the missing argument value (note the comma) d[i=1:2] # correctly (arguably) selects the first two columns # 1:2 passed to [.data.frame as i, no j given d[j=1:2] # wrong: returns the whole data frame # does not recognize the index as i because it is explicitly named 'j' # does not recognize the index as j because there is only one index i say this *might* be a design flaw because it's hard to judge what the design really is. the r language definition (!) [1, sec. 3.4.3 p. 18] says: " The most important example of a class method for [ is that used for data frames. It is not be described in detail here (see the help page for [.data.frame, but in broad terms, if two indices are supplied (even if one is empty) it creates matrix-like indexing for a structure that is basically a list of vectors of the same length. If a single index is supplied, it is interpreted as indexing the list of columns—in that case the drop argument is ignored, with a warning." it does not say what happens when only one *named* index argument is given. from the above, it would indeed seem that there is a *bug* here: in the last example above only one index is given, and yet columns are not selected, even though the *language definition* says they should. (so it's not a documented feature, it's a contra-definitional misfeature -- a bug?) somewhat on the side, the 'matrix-like indexing' above is fairly misleading; just try the same patterns of indexing -- one index, two indices, named indices -- on a data frame and a matrix of the same shape: m = matrix(1:9, 3, 3) md = data.frame(m) md[1] # the first column m[1] # the first element (i.e., m[1,1]) md[,i=3] # third row m[,i=3] # third column the quote above refers to the ?'[.data.frame' for details. unfortunately, it the help page a lump of explanations for various '['-like operators, and it is *not* a definition of any sort. it does not provide much more detail on '[.data.frame' -- it is hardly as a design specification. in particular, it does not explain the issue of named arguments to '[.data.frame' at all. `[.data.frame` only is called with two arguments in the second case, > so > the following condition is true: > > if(Narg < 3L) { # list-like indexing or matrix indexing > > And then, the function assumes the argument it has been passed is i, > and > eventually calls NextMethod("[") which I think calls > `[.listof`(x,i,...), since i is missing in `[.data.frame` it is not > passed to `[.listof`, so you have something equivalent to as.list(d) > []. > > I think we can replace the condition with this one: > > if(Narg < 3L && !has.j) { # list-like indexing or matrix indexing > > or this: > > if(Narg < 3L) { # list-like indexing or matrix indexing >if(has.j) i <- j > indeed, for a moment i thought a trivial fix somewhere there would suffice. unfortunately, the code for [.data.frame [2, lines 500-641] is so clean and readable that i had to give up reading it, forget fixing. instead, i wrote an new version of '[.data.frame' from scratch. it fixes (or at least seems to fix, as far as my quick assessment goes) the problem. the function subdf (see the attached dataframe.r) is the new version of '[.data.frame': # dummy data frame d = structure(names=paste('col', 1:3, sep='.'), data.frame(row.names=paste('row', 1:3, sep='.'), matrix(1:9, 3, 3))) d[j=1:2] # incorrect:
Re: [Rd] [R] "[.data.frame" and lapply
[moving this from R-help to R-devel] Hi, Right, so when you call `[`, the dispatch is made internally : > d <- data.frame( x = 1:5, y = rnorm(5), z = rnorm(5) ) > trace( `[.data.frame` ) > d[ , 1:2] # ensuring the 1:2 is passed to j and the i is passed as missing Tracing `[.data.frame`(d, , 1:2) on entry x y 1 1 0.98946922 2 2 0.05323895 3 3 -0.21803664 4 4 -0.47607043 5 5 1.23366151 > d[ 1:2] # only on argument, so it goes in i Tracing `[.data.frame`(d, 1:2) on entry x y 1 1 0.98946922 2 2 0.05323895 3 3 -0.21803664 4 4 -0.47607043 5 5 1.23366151 But that does not explain why this is hapening: > d[ i = 1:2] Tracing `[.data.frame`(d, i = 1:2) on entry x y 1 1 0.98946922 2 2 0.05323895 3 3 -0.21803664 4 4 -0.47607043 5 5 1.23366151 > d[ j = 1:2] Tracing `[.data.frame`(d, j = 1:2) on entry x y z 1 1 0.98946922 -0.5233134 2 2 0.05323895 1.3646683 3 3 -0.21803664 -0.4998344 4 4 -0.47607043 -1.8849618 5 5 1.23366151 0.6723562 Arguments are dispatched to `[.data.frame` with their names, and `[.data.frame` gets confused. I'm not suggesting allowing named arguments because it already works, what does not work is how `[.data.frame` treats them, and that needs to be changed, this is a bug. Romain > version _ platform i686-pc-linux-gnu arch i686 os linux-gnu system i686, linux-gnu status Under development (unstable) major 2 minor 9.0 year 2009 month 03 day09 svn rev48093 language R version.string R version 2.9.0 Under development (unstable) (2009-03-09 r48093) baptiste auguie wrote: Hi, I got an off-line clarification from Martin Morgan which makes me believe it's not a bug (admittedly, I was close to suggesting it before). Basically, "[" is a .Primitive, for which the help page says, The advantage of |.Primitive| over |.Internal | functions is the potential efficiency of argument passing. However, this is done by ignoring argument names and using positional matching of arguments (unless arranged differently for specific primitives such as |rep |), so this is discouraged for functions of more than one argument. This explains why in my tests the argument names i and j were completely ignored and only the number and order of arguments changed the result. I've learnt my lesson here, but I wonder what could be done to make this discovery easier for others: - add a note in the documentation of each .Primitive function (at least a link to ?.Primitive) - add such an example in lapply (all examples are for named arguments) - echo a warning if trying to pass named arguments to a .Primitive - allow for named arguments as you suggest I'm not sure the last two would be possible without some cost in efficiency. Many thanks, baptiste On 26 Mar 2009, at 07:46, Romain Francois wrote: Hi, This is a bug I think. [.data.frame treats its arguments differently depending on the number of arguments. d <- data.frame(x = rnorm(5), y = rnorm(5), z = rnorm(5) ) d[, 1:2] x y 1 0.45141341 0.03943654 2 -0.87954548 1.83690210 3 -0.91083710 0.22758584 4 0.06924279 1.26799176 5 -0.20477052 -0.25873225 base:::`[.data.frame`( d, j=1:2) x y z 1 0.45141341 0.03943654 -0.8971957 2 -0.87954548 1.83690210 0.9083281 3 -0.91083710 0.22758584 -0.3104906 4 0.06924279 1.26799176 1.2625699 5 -0.20477052 -0.25873225 0.5228342 but also: d[ j=1:2] x y z 1 0.45141341 0.03943654 -0.8971957 2 -0.87954548 1.83690210 0.9083281 3 -0.91083710 0.22758584 -0.3104906 4 0.06924279 1.26799176 1.2625699 5 -0.20477052 -0.25873225 0.5228342 `[.data.frame` only is called with two arguments in the second case, so the following condition is true: if(Narg < 3L) { # list-like indexing or matrix indexing And then, the function assumes the argument it has been passed is i, and eventually calls NextMethod("[") which I think calls `[.listof`(x,i,...), since i is missing in `[.data.frame` it is not passed to `[.listof`, so you have something equivalent to as.list(d)[]. I think we can replace the condition with this one: if(Narg < 3L && !has.j) { # list-like indexing or matrix indexing or this: if(Narg < 3L) { # list-like indexing or matrix indexing if(has.j) i <- j `[.data.frame`(d, j=1:2) x y 1 0.45141341 0.03943654 2 -0.87954548 1.83690210 3 -0.91083710 0.22758584 4 0.06924279 1.26799176 5 -0.20477052 -0.25873225 However, we would still have this, which is expected (same as d[1:2] ): `[.data.frame`(d, i=1:2) x y 1 0.45141341 0.03943654 2 -0.87954548 1.83690210 3 -0.91083710 0.22758584 4 0.06924279 1.26799176 5 -0.20477052 -0.25873225 Romain baptiste auguie wrote: Dear all, Trying to extract a few rows for each element of a list of dat