Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows
Dear Colleagues, I am grateful to all of you for helping me with my question, how to write R code that will identify the first row of each ID within a data frame, create a variable first=1 for the first row and first=0 for all repeats of the ID. WOW!!! I just saw Boris Steipe's answer to my question: olddata$first <- as.numeric(! duplicated(olddata$ID)) The solution is elegant, short, easy to understand, and it uses base R! All important characteristics of a good solution, at least for me. While I want to learn solutions using packages that extend base R, I believe that a good programmer learns how to do something using the base language and once that is learned, explores way to solve a programing problem using advanced packages. Each and every one of you (I hope I did not miss anyone in my list of email addresses) took the time to read my emails and respond to me. Your collective help is invaluable, and I am in your collect debt. Many, many thanks, John John David Sorkin M.D., Ph.D. Professor of Medicine, University of Maryland School of Medicine; Associate Director for Biostatistics and Informatics, Baltimore VA Medical Center Geriatrics Research, Education, and Clinical Center; PI Biostatistics and Informatics Core, University of Maryland School of Medicine Claude D. Pepper Older Americans Independence Center; Senior Statistician University of Maryland Center for Vascular Research; Division of Gerontology and Paliative Care, 10 North Greene Street GRECC (BT/18/GR) Baltimore, MD 21201-1524 Cell phone 443-418-5382 From: Bert Gunter Sent: Sunday, December 1, 2024 11:30 AM To: Rui Barradas Cc: Sorkin, John; [email protected] ([email protected]) Subject: Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows Rui: "f these two, diff is faster. But of all the solutions posted so far, Ben Bolker's is the fastest." But the explicit version of diff is still considerably faster: > D <- c(rep(1,10),rep(2,6),rep(3,2)) > microbenchmark(c(1L,diff(D)), times = 1000L) Unit: microseconds expr minlqmean medianuqmax neval c(1L, diff(D)) 3.075 3.198 3.34396 3.28 3.362 29.684 1000 > microbenchmark( as.integer(!duplicated(D)), times =1000L) Unit: microseconds expr minlq mean median uq max neval as.integer(!duplicated(D)) 1.476 1.558 1.644264 1.599 1.64 16.4 1000 > microbenchmark( D - c(0L, D[-length(D)]), times = 1000L) Unit: nanoseconds ## note that unit is nanoseconds not microseconds expr min lqmean median uq max neval D - c(0L, D[-length(D)]) 369 410 489.335492 533 9840 1000 Cheers, Bert On Sat, Nov 30, 2024 at 11:05 PM Rui Barradas wrote: > > Às 02:27 de 01/12/2024, Sorkin, John escreveu: > > Dear R help folks, > > > > First my apologizes for sending several related questions to the list > > server. I am trying to learn how to manipulate data in R . . . and am > > having difficulty getting my program to work. I greatly appreciate the help > > and support list member give! > > > > I am trying to write a program that will run through a data frame organized > > by ID and for the first line of each new group of data lines that has the > > same ID create a new variable first that will be 1 for the first line of > > the group and 0 for all other lines. > > > > e.g. if my original data is > > olddata > > ID date > > 1 1 > > 1 1 > > 1 2 > > 1 2 > > 1 3 > > 1 3 > > 1 4 > > 1 4 > > 1 5 > > 1 5 > > 2 5 > > 2 5 > > 2 5 > > 2 6 > > 2 6 > > 2 6 > > 3 10 > > 3 10 > > > > the new data will be > > newdata > > ID date first > > 1 1 1 > > 1 1 0 > > 1 2 0 > > 1 2 0 > > 1 3 0 > > 1 3 0 > > 1 4 0 > > 1 4 0 > > 1 5 0 > > 1 5 0 > > 2 5 1 > > 2 5 0 > > 2 5 0 > > 2 6 0 > > 2 6 0 > > 2 6 0 > > 3 10 1 > > 3 10 0 > > > > When I run the program below, I receive the following error: > > Error in df[, "ID"] : incorrect number of dimensions > > > > My code: > > # Create data.frame > > ID <- c(rep(1,10),rep(2,6),rep(3,2)) > > date &l
Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows
John, Thanks for enlightening us so we better understand. I won't argue with your wish to learn to do things in base R first. I started that way, myself, and found lots of the commands not particularly easy to fit into a single worldview. Many functions I read about were promptly forgotten, especially those without great documentation and not enough examples of real world usage. This is why some packages that came later are important as they generally try to come up with a somewhat consistent set of tools that often are also faster and more flexible. There is often a set of reasons various packages are created in the first place to meet real needs. And, I note that some may be subtle. Original R was often inconsistent in the order of command arguments while the dplyr and other tidyverse command try as much as possible to make the first argument be the one normally passed through a pipeline. R fairly recently added a native pipe operator that may be faster than the magrittr pipe but in some ways makes some functionality harder. The rest of R has not really been changed to make using commands in pipelines easy. You seem to have also looked at data.table and given you may have large amounts of data, it may be designed in ways that might also be beneficial. But as I do not want to relearn lots of R functions I never use, I will bow out from further discussion as what I would offer these days would probably not be what you want. My personal opinion is that proper use of R can actually be far easier and more flexible than you had with the proprietary software that may largely consist of canned reports often used. I do want to point out a few things to consider. When you go grouping, you may want to consider grouping (as well as sorting) by multipole variables. You mention a variable with about 500 possibilities and then another variable with an ID number but did not say the ID number was unique across them all. And, I want to note you may want to also look into testing the sanity of your data. That is a wide area too. Things like duplicates, for example. I do not know how many steps you can handle but there are sometimes designs that make an algorithm work differently. Consider your request to find the first row in each grouping and add a column with a 1, and 0 for all others. If that is what you need, fine. But, what if instead you just added a row number. Some rows would have a 1, and some may have a 2, 3, or 4. When you wanted to so something to just the rows with a 1, you can filter out a subset of the data easily enough or apply a command only to those rows. But if you want to test if any entry has more than 4 rows, this could allow you to detect an error. Other ideas might be possible if that is how the data was saved. And, if it really is a 0/1 choice, fine, but consider the advantages or disadvantages of what you save in the new column. Storing a numeric or an int can take up space when storing a Boolean or TRUE/FALSE is what you need. R gives you lots of flexibility which perhaps you did not have to think about before. All I know is that so much of what you want to do is easily enough done with a pipeline or two in dplyr. But this is your task and you choose what makes sense. It specializes in group analysis and generates reports and so on. It may not be how you think. -Original Message- From: Sorkin, John Sent: Sunday, December 1, 2024 11:19 PM To: Bert Gunter ; Rui Barradas ; [email protected]; [email protected]; Bert Gunter ; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected] Cc: [email protected] ([email protected]) ; [email protected] Subject: Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows Dear Colleagues, I am grateful to all of you for helping me with my question, how to write R code that will identify the first row of each ID within a data frame, create a variable first=1 for the first row and first=0 for all repeats of the ID. WOW!!! I just saw Boris Steipe's answer to my question: olddata$first <- as.numeric(! duplicated(olddata$ID)) The solution is elegant, short, easy to understand, and it uses base R! All important characteristics of a good solution, at least for me. While I want to learn solutions using packages that extend base R, I believe that a good programmer learns how to do something using the base language and once that is learned, explores way to solve a programing problem using advanced packages. Each and every one of you (I hope I did not miss anyone in my list of email addresses) took the time to read my emails and respond to me. Your collective help is invaluable, and I am in your collect debt. Many, many thanks, John John David Sorkin M.D., Ph.D. Professo
Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows
R. This is the part of my > analysis that I asked the R community to help me write. I know how this this > can be easily done in SAS: > * Arrange data date and time within each geographic region (i.e. lat_lon and > daytime); > proc sort data=mydata; > by lat_lon daytime; > run; > > data mydata; > /* For each getgraphic area, each time a new record is run, keep the > preceding value of daynum */ > retain daynum; > set mydata; > by lat_lon daytime; > /* initialize daynum to 0 for first record from a given geographic > location*/ > if _n_ eq 1 then daynum=0; > /* Determine start of each day */ > mytime = timepart(daytime) /* Extract time from date-time constant */ > if mytime eq '00:00:00't then daynum=daynum+1; /* Increment daynum for each > new day */ > run; > > 4) Get average value for a pollutant, pm25, by day across all 500 geographic > areas. This is easily done in SAS using proc sort and proc means. > proc sort data=mydata; > by daynum; > run; > > * Get mean pm 2.5 by day accross all 500 geographic regions.; > proc means data=mydata; > by daynum; > var pm25; > run; > > If I can get step (3) above accomplished in R, I know how to accomplish step > 4) in R using the by function: > by(mydata[,"pm25"], mydata[,"daynum"],mean) > > I am trying to write the analysis described, and written in SAS, for 3) above > in R. Please understand that I am fluent in SAS, and (except for straight > forward analyses that require little or no data manipulation, where I am an > intermediate programmer) i am an R tyro. > > Thank you for your help. My apologies for the long description of what I am > trying to do. I sent this because you asked what I was trying to do and why I > was doing it from the perspective of a SAS programmer rather than a > matrix-based R programmer. > > John David Sorkin M.D., Ph.D. > Professor of Medicine, University of Maryland School of Medicine; > Associate Director for Biostatistics and Informatics, Baltimore VA Medical > Center Geriatrics Research, Education, and Clinical Center; > PI Biostatistics and Informatics Core, University of Maryland School of > Medicine Claude D. Pepper Older Americans Independence Center; > Senior Statistician University of Maryland Center for Vascular Research; > > Division of Gerontology and Paliative Care, > 10 North Greene Street > GRECC (BT/18/GR) > Baltimore, MD 21201-1524 > Cell phone 443-418-5382 > > > > > > From: Bert Gunter > Sent: Saturday, November 30, 2024 11:33 PM > To: Sorkin, John > Cc: [email protected] ([email protected]) > Subject: Re: [R] Identify first row of each ID within a data frame, create a > variable first =1 for the first row and first=0 of all other rows > > May I ask *why* you want to do this? > > It sounds to me like like you're using SAS-like strategies for your > data analysis rather than R-like. > > -- Bert > > -- Bert > > On Sat, Nov 30, 2024 at 6:27 PM Sorkin, John > wrote: > > > > Dear R help folks, > > > > First my apologizes for sending several related questions to the list > > server. I am trying to learn how to manipulate data in R . . . and am > > having difficulty getting my program to work. I greatly appreciate the help > > and support list member give! > > > > I am trying to write a program that will run through a data frame organized > > by ID and for the first line of each new group of data lines that has the > > same ID create a new variable first that will be 1 for the first line of > > the group and 0 for all other lines. > > > > e.g. if my original data is > > olddata > >ID date > > 1 1 > > 1 1 > > 1 2 > > 1 2 > > 1 3 > > 1 3 > > 1 4 > > 1 4 > > 1 5 > > 1 5 > > 2 5 > > 2 5 > > 2 5 > > 2 6 > > 2 6 > > 2 6 > > 3 10 > > 3 10 > > > > the new data will be > > newdata > >ID date first > > 1 1 1 > > 1 1 0 > > 1 2 0 > > 1 2 0 > > 1 3 0 > > 1 3 0 > > 1 4 0 > > 1 4 0 > > 1 5 0 > > 1 5 0 > > 2 5 1 > > 2 5 0 > > 2 5 0 > > 2 6 0 > > 2 6 0 > > 2 6 0 > > 3
Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows
Rui:
"f these two, diff is faster. But of all the solutions posted so far,
Ben Bolker's is the fastest."
But the explicit version of diff is still considerably faster:
> D <- c(rep(1,10),rep(2,6),rep(3,2))
> microbenchmark(c(1L,diff(D)), times = 1000L)
Unit: microseconds
expr minlqmean medianuqmax neval
c(1L, diff(D)) 3.075 3.198 3.34396 3.28 3.362 29.684 1000
> microbenchmark( as.integer(!duplicated(D)), times =1000L)
Unit: microseconds
expr minlq mean median uq max neval
as.integer(!duplicated(D)) 1.476 1.558 1.644264 1.599 1.64 16.4 1000
> microbenchmark( D - c(0L, D[-length(D)]), times = 1000L)
Unit: nanoseconds ## note that unit is nanoseconds not microseconds
expr min lqmean median uq max neval
D - c(0L, D[-length(D)]) 369 410 489.335492 533 9840 1000
Cheers,
Bert
On Sat, Nov 30, 2024 at 11:05 PM Rui Barradas wrote:
>
> Às 02:27 de 01/12/2024, Sorkin, John escreveu:
> > Dear R help folks,
> >
> > First my apologizes for sending several related questions to the list
> > server. I am trying to learn how to manipulate data in R . . . and am
> > having difficulty getting my program to work. I greatly appreciate the help
> > and support list member give!
> >
> > I am trying to write a program that will run through a data frame organized
> > by ID and for the first line of each new group of data lines that has the
> > same ID create a new variable first that will be 1 for the first line of
> > the group and 0 for all other lines.
> >
> > e.g. if my original data is
> > olddata
> > ID date
> > 1 1
> > 1 1
> > 1 2
> > 1 2
> > 1 3
> > 1 3
> > 1 4
> > 1 4
> > 1 5
> > 1 5
> > 2 5
> > 2 5
> > 2 5
> > 2 6
> > 2 6
> > 2 6
> > 3 10
> > 3 10
> >
> > the new data will be
> > newdata
> > ID date first
> > 1 1 1
> > 1 1 0
> > 1 2 0
> > 1 2 0
> > 1 3 0
> > 1 3 0
> > 1 4 0
> > 1 4 0
> > 1 5 0
> > 1 5 0
> > 2 5 1
> > 2 5 0
> > 2 5 0
> > 2 6 0
> > 2 6 0
> > 2 6 0
> > 3 10 1
> > 3 10 0
> >
> > When I run the program below, I receive the following error:
> > Error in df[, "ID"] : incorrect number of dimensions
> >
> > My code:
> > # Create data.frame
> > ID <- c(rep(1,10),rep(2,6),rep(3,2))
> > date <- c(rep(1,2),rep(2,2),rep(3,2),rep(4,2),rep(5,2),
> >rep(5,3),rep(6,3),rep(10,2))
> > olddata <- data.frame(ID=ID,date=date)
> > class(olddata)
> > cat("This is the original data frame","\n")
> > print(olddata)
> >
> > # This function is supposed to identify the first row
> > # within each level of ID and, for the first row, set
> > # the variable first to 1, and for all rows other than
> > # the first row set first to 0.
> > mydoit <- function(df){
> >value <- ifelse (first(df[,"ID"]),1,0)
> >cat("value=",value,"\n")
> >df[,"first"] <- value
> > }
> > newdata <- aggregate(olddata,list(olddata[,"ID"]),mydoit)
> >
> > Thank you,
> > John
> >
> >
> > John David Sorkin M.D., Ph.D.
> > Professor of Medicine, University of Maryland School of Medicine;
> > Associate Director for Biostatistics and Informatics, Baltimore VA Medical
> > Center Geriatrics Research, Education, and Clinical Center;
> > PI Biostatistics and Informatics Core, University of Maryland School of
> > Medicine Claude D. Pepper Older Americans Independence Center;
> > Senior Statistician University of Maryland Center for Vascular Research;
> >
> > Division of Gerontology and Paliative Care,
> > 10 North Greene Street
> > GRECC (BT/18/GR)
> > Baltimore, MD 21201-1524
> > Cell phone 443-418-5382
> >
> >
> >
> > __
> > [email protected] mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > https://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> Hello,
>
> And here are two other solutions.
>
>
> olddata$first <- with(olddata, ave(seq_along(ID), ID, FUN = \(x) x ==
> x[1L]))
>
> olddata$first <- c(1L, diff(olddata$ID))
>
>
> Of these two, diff is faster. But of all the solutions posted so far,
> Ben Bolker's is the fastest. And it can be made a little faster if
> as.integer substitutes for as.numeric.
> And dplyr::mutate now has a .by argument, which avoids explicit the call
> to group_by, with a performance gain.
>
>
> library(microbenchmark)
>
> mb <- microbenchmark(
>ave = with(olddata, ave(seq_along(ID), ID, FUN = \(x) x == x[1L])),
>dup_num = as.numeric(! duplicated(olddata$ID)),
>dup_int = as.in
Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows
olddata$first <- as.numeric(! duplicated(olddata$ID)) :-) > On Nov 30, 2024, at 22:27, Sorkin, John wrote: > > ID <- c(rep(1,10),rep(2,6),rep(3,2)) > date <- c(rep(1,2),rep(2,2),rep(3,2),rep(4,2),rep(5,2), > rep(5,3),rep(6,3),rep(10,2)) > olddata <- data.frame(ID=ID,date=date) > class(olddata) __ [email protected] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide https://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows
Às 02:27 de 01/12/2024, Sorkin, John escreveu:
Dear R help folks,
First my apologizes for sending several related questions to the list server. I
am trying to learn how to manipulate data in R . . . and am having difficulty
getting my program to work. I greatly appreciate the help and support list
member give!
I am trying to write a program that will run through a data frame organized by
ID and for the first line of each new group of data lines that has the same ID
create a new variable first that will be 1 for the first line of the group and
0 for all other lines.
e.g. if my original data is
olddata
ID date
1 1
1 1
1 2
1 2
1 3
1 3
1 4
1 4
1 5
1 5
2 5
2 5
2 5
2 6
2 6
2 6
3 10
3 10
the new data will be
newdata
ID date first
1 1 1
1 1 0
1 2 0
1 2 0
1 3 0
1 3 0
1 4 0
1 4 0
1 5 0
1 5 0
2 5 1
2 5 0
2 5 0
2 6 0
2 6 0
2 6 0
3 10 1
3 10 0
When I run the program below, I receive the following error:
Error in df[, "ID"] : incorrect number of dimensions
My code:
# Create data.frame
ID <- c(rep(1,10),rep(2,6),rep(3,2))
date <- c(rep(1,2),rep(2,2),rep(3,2),rep(4,2),rep(5,2),
rep(5,3),rep(6,3),rep(10,2))
olddata <- data.frame(ID=ID,date=date)
class(olddata)
cat("This is the original data frame","\n")
print(olddata)
# This function is supposed to identify the first row
# within each level of ID and, for the first row, set
# the variable first to 1, and for all rows other than
# the first row set first to 0.
mydoit <- function(df){
value <- ifelse (first(df[,"ID"]),1,0)
cat("value=",value,"\n")
df[,"first"] <- value
}
newdata <- aggregate(olddata,list(olddata[,"ID"]),mydoit)
Thank you,
John
John David Sorkin M.D., Ph.D.
Professor of Medicine, University of Maryland School of Medicine;
Associate Director for Biostatistics and Informatics, Baltimore VA Medical
Center Geriatrics Research, Education, and Clinical Center;
PI Biostatistics and Informatics Core, University of Maryland School of
Medicine Claude D. Pepper Older Americans Independence Center;
Senior Statistician University of Maryland Center for Vascular Research;
Division of Gerontology and Paliative Care,
10 North Greene Street
GRECC (BT/18/GR)
Baltimore, MD 21201-1524
Cell phone 443-418-5382
__
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Hello,
And here are two other solutions.
olddata$first <- with(olddata, ave(seq_along(ID), ID, FUN = \(x) x ==
x[1L]))
olddata$first <- c(1L, diff(olddata$ID))
Of these two, diff is faster. But of all the solutions posted so far,
Ben Bolker's is the fastest. And it can be made a little faster if
as.integer substitutes for as.numeric.
And dplyr::mutate now has a .by argument, which avoids explicit the call
to group_by, with a performance gain.
library(microbenchmark)
mb <- microbenchmark(
ave = with(olddata, ave(seq_along(ID), ID, FUN = \(x) x == x[1L])),
dup_num = as.numeric(! duplicated(olddata$ID)),
dup_int = as.integer(! duplicated(olddata$ID)),
diff = diff = c(1L, diff(olddata$ID)),
dplyr_grp = olddata %>% group_by(ID) %>% mutate(first =
as.integer(row_number() == 1)),
dplyr = olddata %>% mutate(first = as.integer(row_number() == 1), .by
= ID)
)
print(mb, order = "median")
However, note that dplyr operates in entire data.frames and therefore is
expected to be slower when tested against instructions that process one
column only.
Hope this helps,
Rui Barradas
--
Este e-mail foi analisado pelo software antivírus AVG para verificar a presença
de vírus.
www.avg.com
__
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows
I was wondering along similar lines, Bert. One way to get help is to ask how to do some single step of a larger strategy. That can lead to answers that may not be as applicable to the scenario. Another way would be to include a synopsis of what they are trying to do. But, as John says he is trying to learn and improve his abilities, perhaps he s getting what he wants. After watching some of the exchanges in multiple questions, many seem to revolve around a wish to deal with sorted grouped data. He seems to have looked at some base R methods as well as packages like dplyr using tibbles as well as another package and format. What interests me from a dplyr perspective is how many little embedded functions it makes available and some have been mentioned here. If you want to add a column that contains the same value for each group, such as the minimum, mean, first and many other things, it is very easily doable. The latest request seems to be a bit different as it wants a column with a 1 (presumably for TRUE) only for the first entry in the group. Again, fairly easy using one of several hooks such as the rownumber being "1" versus not. There are many variations on the answer supplied depending on style and need, such as making a column that contains the row number, and in a later step, set those to zero that are not a one. But sometimes you want to ask what the overall algorithm is. Do you need extra columns to then use for some purpose, or could that purpose have been done another way such as doing some calculation only when rownumber is one. As noted, R makes some operations fairly natural, in ways that differ from the "natural" way another program/environment does it. Sometimes a translation is not worth doing as compared to a reworked algorithm that makes good use of whichever package and related functionality you want to use. Assuming all these questions relate to the same project, I am not clear if and where the lookback at previous row/value fits. Of course, John may not be free to share more in public. Anyone want to suggest a book or two on data processing of this sort using R that might illustrate with examples galore on how various problems are solved and then perhaps some will be similar enough ... -Original Message- From: R-help On Behalf Of Bert Gunter Sent: Saturday, November 30, 2024 11:34 PM To: Sorkin, John Cc: [email protected] ([email protected]) Subject: Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows May I ask *why* you want to do this? It sounds to me like like you're using SAS-like strategies for your data analysis rather than R-like. -- Bert -- Bert On Sat, Nov 30, 2024 at 6:27 PM Sorkin, John wrote: > > Dear R help folks, > > First my apologizes for sending several related questions to the list server. > I am trying to learn how to manipulate data in R . . . and am having > difficulty getting my program to work. I greatly appreciate the help and > support list member give! > > I am trying to write a program that will run through a data frame organized > by ID and for the first line of each new group of data lines that has the > same ID create a new variable first that will be 1 for the first line of the > group and 0 for all other lines. > > e.g. if my original data is > olddata >ID date > 1 1 > 1 1 > 1 2 > 1 2 > 1 3 > 1 3 > 1 4 > 1 4 > 1 5 > 1 5 > 2 5 > 2 5 > 2 5 > 2 6 > 2 6 > 2 6 > 3 10 > 3 10 > > the new data will be > newdata >ID date first > 1 1 1 > 1 1 0 > 1 2 0 > 1 2 0 > 1 3 0 > 1 3 0 > 1 4 0 > 1 4 0 > 1 5 0 > 1 5 0 > 2 5 1 > 2 5 0 > 2 5 0 > 2 6 0 > 2 6 0 > 2 6 0 > 3 10 1 > 3 10 0 > > When I run the program below, I receive the following error: > Error in df[, "ID"] : incorrect number of dimensions > > My code: > # Create data.frame > ID <- c(rep(1,10),rep(2,6),rep(3,2)) > date <- c(rep(1,2),rep(2,2),rep(3,2),rep(4,2),rep(5,2), > rep(5,3),rep(6,3),rep(10,2)) > olddata <- data.frame(ID=ID,date=date) > class(olddata) > cat("This is the original data frame","\n") > print(olddata) > > # This function is supposed to identify the first row > # within each level of ID and, for the first row, set > # the variable first to 1, and for all rows other
Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows
May I ask *why* you want to do this?
It sounds to me like like you're using SAS-like strategies for your
data analysis rather than R-like.
-- Bert
-- Bert
On Sat, Nov 30, 2024 at 6:27 PM Sorkin, John wrote:
>
> Dear R help folks,
>
> First my apologizes for sending several related questions to the list server.
> I am trying to learn how to manipulate data in R . . . and am having
> difficulty getting my program to work. I greatly appreciate the help and
> support list member give!
>
> I am trying to write a program that will run through a data frame organized
> by ID and for the first line of each new group of data lines that has the
> same ID create a new variable first that will be 1 for the first line of the
> group and 0 for all other lines.
>
> e.g. if my original data is
> olddata
>ID date
> 1 1
> 1 1
> 1 2
> 1 2
> 1 3
> 1 3
> 1 4
> 1 4
> 1 5
> 1 5
> 2 5
> 2 5
> 2 5
> 2 6
> 2 6
> 2 6
> 3 10
> 3 10
>
> the new data will be
> newdata
>ID date first
> 1 1 1
> 1 1 0
> 1 2 0
> 1 2 0
> 1 3 0
> 1 3 0
> 1 4 0
> 1 4 0
> 1 5 0
> 1 5 0
> 2 5 1
> 2 5 0
> 2 5 0
> 2 6 0
> 2 6 0
> 2 6 0
> 3 10 1
> 3 10 0
>
> When I run the program below, I receive the following error:
> Error in df[, "ID"] : incorrect number of dimensions
>
> My code:
> # Create data.frame
> ID <- c(rep(1,10),rep(2,6),rep(3,2))
> date <- c(rep(1,2),rep(2,2),rep(3,2),rep(4,2),rep(5,2),
> rep(5,3),rep(6,3),rep(10,2))
> olddata <- data.frame(ID=ID,date=date)
> class(olddata)
> cat("This is the original data frame","\n")
> print(olddata)
>
> # This function is supposed to identify the first row
> # within each level of ID and, for the first row, set
> # the variable first to 1, and for all rows other than
> # the first row set first to 0.
> mydoit <- function(df){
> value <- ifelse (first(df[,"ID"]),1,0)
> cat("value=",value,"\n")
> df[,"first"] <- value
> }
> newdata <- aggregate(olddata,list(olddata[,"ID"]),mydoit)
>
> Thank you,
> John
>
>
> John David Sorkin M.D., Ph.D.
> Professor of Medicine, University of Maryland School of Medicine;
> Associate Director for Biostatistics and Informatics, Baltimore VA Medical
> Center Geriatrics Research, Education, and Clinical Center;
> PI Biostatistics and Informatics Core, University of Maryland School of
> Medicine Claude D. Pepper Older Americans Independence Center;
> Senior Statistician University of Maryland Center for Vascular Research;
>
> Division of Gerontology and Paliative Care,
> 10 North Greene Street
> GRECC (BT/18/GR)
> Baltimore, MD 21201-1524
> Cell phone 443-418-5382
>
>
>
> __
> [email protected] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
__
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows
Sorry, for completeness: library(dplyr) olddata %>% group_by(ID) %>% mutate(first = as.integer(row_number() == 1)) --Chris Ryan Christopher W. Ryan wrote: > Personally, I'd do this in the tidyverse with dplyr and its row_number() > function. > > olddata %>% group_by(ID) %>% mutate(first = as.integer(row_number() == 1)) > > --Chris Ryan > > Sorkin, John wrote: >> ID <- c(rep(1,10),rep(2,6),rep(3,2)) >> date <- c(rep(1,2),rep(2,2),rep(3,2),rep(4,2),rep(5,2), >> rep(5,3),rep(6,3),rep(10,2)) >> olddata <- data.frame(ID=ID,date=date) __ [email protected] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide https://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows
Personally, I'd do this in the tidyverse with dplyr and its row_number() function. olddata %>% group_by(ID) %>% mutate(first = as.integer(row_number() == 1)) --Chris Ryan Sorkin, John wrote: > ID <- c(rep(1,10),rep(2,6),rep(3,2)) > date <- c(rep(1,2),rep(2,2),rep(3,2),rep(4,2),rep(5,2), > rep(5,3),rep(6,3),rep(10,2)) > olddata <- data.frame(ID=ID,date=date) __ [email protected] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide https://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows
I think as.numeric(! duplicated(group)) might do this for you ...
On Sat, Nov 30, 2024, 9:27 PM Sorkin, John
wrote:
> Dear R help folks,
>
> First my apologizes for sending several related questions to the list
> server. I am trying to learn how to manipulate data in R . . . and am
> having difficulty getting my program to work. I greatly appreciate the help
> and support list member give!
>
> I am trying to write a program that will run through a data frame
> organized by ID and for the first line of each new group of data lines that
> has the same ID create a new variable first that will be 1 for the first
> line of the group and 0 for all other lines.
>
> e.g. if my original data is
> olddata
>ID date
> 1 1
> 1 1
> 1 2
> 1 2
> 1 3
> 1 3
> 1 4
> 1 4
> 1 5
> 1 5
> 2 5
> 2 5
> 2 5
> 2 6
> 2 6
> 2 6
> 3 10
> 3 10
>
> the new data will be
> newdata
>ID date first
> 1 1 1
> 1 1 0
> 1 2 0
> 1 2 0
> 1 3 0
> 1 3 0
> 1 4 0
> 1 4 0
> 1 5 0
> 1 5 0
> 2 5 1
> 2 5 0
> 2 5 0
> 2 6 0
> 2 6 0
> 2 6 0
> 3 10 1
> 3 10 0
>
> When I run the program below, I receive the following error:
> Error in df[, "ID"] : incorrect number of dimensions
>
> My code:
> # Create data.frame
> ID <- c(rep(1,10),rep(2,6),rep(3,2))
> date <- c(rep(1,2),rep(2,2),rep(3,2),rep(4,2),rep(5,2),
> rep(5,3),rep(6,3),rep(10,2))
> olddata <- data.frame(ID=ID,date=date)
> class(olddata)
> cat("This is the original data frame","\n")
> print(olddata)
>
> # This function is supposed to identify the first row
> # within each level of ID and, for the first row, set
> # the variable first to 1, and for all rows other than
> # the first row set first to 0.
> mydoit <- function(df){
> value <- ifelse (first(df[,"ID"]),1,0)
> cat("value=",value,"\n")
> df[,"first"] <- value
> }
> newdata <- aggregate(olddata,list(olddata[,"ID"]),mydoit)
>
> Thank you,
> John
>
>
> John David Sorkin M.D., Ph.D.
> Professor of Medicine, University of Maryland School of Medicine;
> Associate Director for Biostatistics and Informatics, Baltimore VA Medical
> Center Geriatrics Research, Education, and Clinical Center;
> PI Biostatistics and Informatics Core, University of Maryland School of
> Medicine Claude D. Pepper Older Americans Independence Center;
> Senior Statistician University of Maryland Center for Vascular Research;
>
> Division of Gerontology and Paliative Care,
> 10 North Greene Street
>
> GRECC (BT/18/GR)
> Baltimore, MD 21201-1524
> Cell phone 443-418-5382
>
>
>
> __
> [email protected] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]
__
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

