Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows

2024-12-02 Thread Sorkin, John
Dear Colleagues,

I am grateful to all of you for helping me with my question, how to write R 
code that will identify the first row of each ID within a data frame, create a 
variable first=1 for the first row and first=0 for all repeats of the ID.

WOW!!!
I just saw Boris Steipe's answer to my question:
olddata$first <- as.numeric(! duplicated(olddata$ID))
The solution is elegant, short, easy to understand, and it uses base R! All 
important characteristics of a good solution, at least for me. While I want to 
learn solutions using packages that extend base R, I believe that a good 
programmer learns how to do something using the base language and once that is 
learned, explores way to solve a programing problem using advanced packages.

Each and every one of you (I hope I did not miss anyone in my list of email 
addresses) took the time to read my emails and respond to me. Your collective 
help is invaluable, and I am in your collect debt.

Many, many thanks,
John

John David Sorkin M.D., Ph.D.
Professor of Medicine, University of Maryland School of Medicine;
Associate Director for Biostatistics and Informatics, Baltimore VA Medical 
Center Geriatrics Research, Education, and Clinical Center;
PI Biostatistics and Informatics Core, University of Maryland School of 
Medicine Claude D. Pepper Older Americans Independence Center;
Senior Statistician University of Maryland Center for Vascular Research;

Division of Gerontology and Paliative Care,
10 North Greene Street
GRECC (BT/18/GR)
Baltimore, MD 21201-1524
Cell phone 443-418-5382





From: Bert Gunter 
Sent: Sunday, December 1, 2024 11:30 AM
To: Rui Barradas
Cc: Sorkin, John; [email protected] ([email protected])
Subject: Re: [R] Identify first row of each ID within a data frame, create a 
variable first =1 for the first row and first=0 of all other rows

Rui:
"f these two, diff is faster. But of all the solutions posted so far,
Ben Bolker's is the fastest."

But the explicit version of diff is still considerably faster:

> D <- c(rep(1,10),rep(2,6),rep(3,2))

> microbenchmark(c(1L,diff(D)), times = 1000L)
Unit: microseconds
   expr   minlqmean medianuqmax neval
 c(1L, diff(D)) 3.075 3.198 3.34396   3.28 3.362 29.684  1000

> microbenchmark( as.integer(!duplicated(D)), times =1000L)
Unit: microseconds
   expr   minlq mean median   uq  max neval
 as.integer(!duplicated(D)) 1.476 1.558 1.644264  1.599 1.64 16.4  1000

> microbenchmark( D - c(0L, D[-length(D)]), times = 1000L)
Unit: nanoseconds  ## note that unit is nanoseconds not microseconds
 expr min  lqmean median  uq  max neval
 D - c(0L, D[-length(D)]) 369 410 489.335492 533 9840  1000

Cheers,
Bert

On Sat, Nov 30, 2024 at 11:05 PM Rui Barradas  wrote:
>
> Às 02:27 de 01/12/2024, Sorkin, John escreveu:
> > Dear R help folks,
> >
> > First my apologizes for sending several related questions to the list 
> > server. I am trying to learn how to manipulate data in R . . . and am 
> > having difficulty getting my program to work. I greatly appreciate the help 
> > and support list member give!
> >
> > I am trying to write a program that will run through a data frame organized 
> > by ID and for the first line of each new group of data lines that has the 
> > same ID create a new variable first that will be 1 for the first line of 
> > the group and 0 for all other lines.
> >
> > e.g. if my original data is
> >   olddata
> > ID date
> >  1 1
> >  1 1
> >  1 2
> >  1 2
> >  1 3
> >  1 3
> >  1 4
> >  1 4
> >  1 5
> >  1 5
> >  2 5
> >  2 5
> >  2 5
> >  2 6
> >  2 6
> >  2 6
> >  3   10
> >  3   10
> >
> > the new data will be
> > newdata
> > ID date  first
> >  1 1   1
> >  1 1   0
> >  1 2   0
> >  1 2   0
> >  1 3   0
> >  1 3   0
> >  1 4   0
> >  1 4   0
> >  1 5   0
> >  1 5   0
> >  2 5   1
> >  2 5   0
> >  2 5   0
> >  2 6   0
> >  2 6   0
> >  2 6   0
> >  3   10   1
> >  3   10   0
> >
> > When I run the program below, I receive the following error:
> > Error in df[, "ID"] : incorrect number of dimensions
> >
> > My code:
> > # Create data.frame
> > ID <- c(rep(1,10),rep(2,6),rep(3,2))
> > date &l

Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows

2024-12-02 Thread avi.e.gross
John,

Thanks for enlightening us so we better understand.

I won't argue with your wish to learn to do things in base R first. I started 
that way, myself, and found lots of the commands not particularly easy to fit 
into a single worldview. Many functions I read about were promptly forgotten, 
especially those without great documentation and not enough examples of real 
world usage.

This is why some packages that came later are important as they generally try 
to come up with a somewhat consistent set of tools that often are also faster 
and more flexible. There is often a set of reasons various packages are created 
in the first place to meet real needs. And, I note that some may be subtle. 
Original R was often inconsistent in the order of command arguments while the 
dplyr and other tidyverse command try as much as possible to make the first 
argument be the one normally passed through a pipeline. R fairly recently added 
a native pipe operator that may be faster than the magrittr pipe but in some 
ways makes some functionality harder. The rest of R has not really been changed 
to make using commands in pipelines easy.

You seem to have also looked at data.table and given you may have large amounts 
of data, it may be designed in ways that might also be beneficial.

But as I do not want to relearn lots of R functions I never use, I will bow out 
from further discussion as what I would offer these days would probably not be 
what you want.

My personal opinion is that proper use of R can actually be far easier and more 
flexible than you had with the proprietary software that may largely consist of 
canned reports often used.

I do want to point out a few things to consider.

When you go grouping, you may want to consider grouping (as well as sorting) by 
multipole variables. You mention a variable with about 500 possibilities and 
then another variable with an ID number but did not say the ID number was 
unique across them all. 

And, I want to note you may want to also look into testing the sanity of your 
data. That is a wide area too. Things like duplicates, for example.

I do not know how many steps you can handle but there are sometimes designs 
that make an algorithm work differently.

Consider your request to find  the first row in each grouping and add a column 
with a 1, and 0 for all others. If that is what you need, fine.

But, what if instead you just added a row number. Some rows would have a 1, and 
some may have a 2, 3, or 4.

When you wanted  to so something to just the rows with a 1, you can filter out 
a subset of the data easily enough or apply a command only to those rows. But 
if you want to test if any entry has more than 4 rows, this could allow you to 
detect an error. Other ideas might be possible if that is how the data was 
saved.

And, if it really is a 0/1 choice, fine, but consider the advantages or 
disadvantages of what you save in the new column. Storing a numeric or an int 
can take up space when storing a Boolean or TRUE/FALSE is what you need. R 
gives you lots of flexibility which perhaps you did not have to think about 
before.

All I know is that so much of what you want to do is easily enough done with a 
pipeline or two in dplyr. But this is your task and you choose what makes 
sense. It specializes in group analysis and generates reports and so on. It may 
not be how you think. 


-Original Message-
From: Sorkin, John  
Sent: Sunday, December 1, 2024 11:19 PM
To: Bert Gunter ; Rui Barradas ; 
[email protected]; [email protected]; Bert Gunter ; 
[email protected]; [email protected]; [email protected]; 
[email protected]; [email protected]; [email protected]; 
[email protected]; [email protected]
Cc: [email protected] ([email protected]) ; 
[email protected]
Subject: Re: [R] Identify first row of each ID within a data frame, create a 
variable first =1 for the first row and first=0 of all other rows

Dear Colleagues,

I am grateful to all of you for helping me with my question, how to write R 
code that will identify the first row of each ID within a data frame, create a 
variable first=1 for the first row and first=0 for all repeats of the ID.

WOW!!!
I just saw Boris Steipe's answer to my question:
olddata$first <- as.numeric(! duplicated(olddata$ID))
The solution is elegant, short, easy to understand, and it uses base R! All 
important characteristics of a good solution, at least for me. While I want to 
learn solutions using packages that extend base R, I believe that a good 
programmer learns how to do something using the base language and once that is 
learned, explores way to solve a programing problem using advanced packages.

Each and every one of you (I hope I did not miss anyone in my list of email 
addresses) took the time to read my emails and respond to me. Your collective 
help is invaluable, and I am in your collect debt.

Many, many thanks,
John

John David Sorkin M.D., Ph.D.
Professo

Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows

2024-12-01 Thread Bert Gunter
 R. This is the part of my 
> analysis that I asked the R community to help me write. I know how this this 
> can be easily done in SAS:
> * Arrange data date and time within each geographic region (i.e. lat_lon and 
> daytime);
> proc sort data=mydata;
>   by lat_lon daytime;
> run;
>
> data mydata;
>   /* For each getgraphic area, each time a new record is run, keep the 
> preceding value of daynum */
>   retain daynum;
>   set mydata;
>  by lat_lon daytime;
>   /* initialize daynum to 0 for first record from a given geographic 
> location*/
>   if _n_ eq 1 then daynum=0;
>  /* Determine start of each day */
> mytime = timepart(daytime)  /* Extract time from date-time constant */
>   if mytime eq '00:00:00't then daynum=daynum+1; /* Increment daynum for each 
> new day */
> run;
>
> 4) Get average value for a pollutant, pm25, by day across all 500 geographic 
> areas. This is easily done in SAS using proc sort and proc means.
> proc sort data=mydata;
>   by daynum;
> run;
>
> * Get mean pm 2.5 by day accross all 500 geographic regions.;
> proc means data=mydata;
>   by daynum;
>   var pm25;
> run;
>
> If I can get step (3) above accomplished in R, I know how to accomplish step 
> 4) in R using the by function:
> by(mydata[,"pm25"], mydata[,"daynum"],mean)
>
> I am trying to write the analysis described, and written in SAS, for 3) above 
> in R. Please understand that I am fluent in SAS, and (except for straight 
> forward analyses that require little or no data manipulation, where I am an 
> intermediate programmer) i am an R tyro.
>
> Thank you for your help. My apologies for the long description of what I am 
> trying to do. I sent this because you asked what I was trying to do and why I 
> was doing it from the perspective of a SAS programmer rather than a 
> matrix-based R programmer.
>
> John David Sorkin M.D., Ph.D.
> Professor of Medicine, University of Maryland School of Medicine;
> Associate Director for Biostatistics and Informatics, Baltimore VA Medical 
> Center Geriatrics Research, Education, and Clinical Center;
> PI Biostatistics and Informatics Core, University of Maryland School of 
> Medicine Claude D. Pepper Older Americans Independence Center;
> Senior Statistician University of Maryland Center for Vascular Research;
>
> Division of Gerontology and Paliative Care,
> 10 North Greene Street
> GRECC (BT/18/GR)
> Baltimore, MD 21201-1524
> Cell phone 443-418-5382
>
>
>
>
> 
> From: Bert Gunter 
> Sent: Saturday, November 30, 2024 11:33 PM
> To: Sorkin, John
> Cc: [email protected] ([email protected])
> Subject: Re: [R] Identify first row of each ID within a data frame, create a 
> variable first =1 for the first row and first=0 of all other rows
>
> May I ask *why* you want to do this?
>
> It sounds to me like like you're using SAS-like strategies for your
> data analysis rather than R-like.
>
> -- Bert
>
> -- Bert
>
> On Sat, Nov 30, 2024 at 6:27 PM Sorkin, John  
> wrote:
> >
> > Dear R help folks,
> >
> > First my apologizes for sending several related questions to the list 
> > server. I am trying to learn how to manipulate data in R . . . and am 
> > having difficulty getting my program to work. I greatly appreciate the help 
> > and support list member give!
> >
> > I am trying to write a program that will run through a data frame organized 
> > by ID and for the first line of each new group of data lines that has the 
> > same ID create a new variable first that will be 1 for the first line of 
> > the group and 0 for all other lines.
> >
> > e.g. if my original data is
> >  olddata
> >ID date
> > 1 1
> > 1 1
> > 1 2
> > 1 2
> > 1 3
> > 1 3
> > 1 4
> > 1 4
> > 1 5
> > 1 5
> > 2 5
> > 2 5
> > 2 5
> > 2 6
> > 2 6
> > 2 6
> > 3   10
> > 3   10
> >
> > the new data will be
> > newdata
> >ID date  first
> > 1 1   1
> > 1 1   0
> > 1 2   0
> > 1 2   0
> > 1 3   0
> > 1 3   0
> > 1 4   0
> > 1 4   0
> > 1 5   0
> > 1 5   0
> > 2 5   1
> > 2 5   0
> > 2 5   0
> > 2 6   0
> > 2 6   0
> > 2 6   0
> > 3 

Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows

2024-12-01 Thread Bert Gunter
Rui:
"f these two, diff is faster. But of all the solutions posted so far,
Ben Bolker's is the fastest."

But the explicit version of diff is still considerably faster:

> D <- c(rep(1,10),rep(2,6),rep(3,2))

> microbenchmark(c(1L,diff(D)), times = 1000L)
Unit: microseconds
   expr   minlqmean medianuqmax neval
 c(1L, diff(D)) 3.075 3.198 3.34396   3.28 3.362 29.684  1000

> microbenchmark( as.integer(!duplicated(D)), times =1000L)
Unit: microseconds
   expr   minlq mean median   uq  max neval
 as.integer(!duplicated(D)) 1.476 1.558 1.644264  1.599 1.64 16.4  1000

> microbenchmark( D - c(0L, D[-length(D)]), times = 1000L)
Unit: nanoseconds  ## note that unit is nanoseconds not microseconds
 expr min  lqmean median  uq  max neval
 D - c(0L, D[-length(D)]) 369 410 489.335492 533 9840  1000

Cheers,
Bert

On Sat, Nov 30, 2024 at 11:05 PM Rui Barradas  wrote:
>
> Às 02:27 de 01/12/2024, Sorkin, John escreveu:
> > Dear R help folks,
> >
> > First my apologizes for sending several related questions to the list 
> > server. I am trying to learn how to manipulate data in R . . . and am 
> > having difficulty getting my program to work. I greatly appreciate the help 
> > and support list member give!
> >
> > I am trying to write a program that will run through a data frame organized 
> > by ID and for the first line of each new group of data lines that has the 
> > same ID create a new variable first that will be 1 for the first line of 
> > the group and 0 for all other lines.
> >
> > e.g. if my original data is
> >   olddata
> > ID date
> >  1 1
> >  1 1
> >  1 2
> >  1 2
> >  1 3
> >  1 3
> >  1 4
> >  1 4
> >  1 5
> >  1 5
> >  2 5
> >  2 5
> >  2 5
> >  2 6
> >  2 6
> >  2 6
> >  3   10
> >  3   10
> >
> > the new data will be
> > newdata
> > ID date  first
> >  1 1   1
> >  1 1   0
> >  1 2   0
> >  1 2   0
> >  1 3   0
> >  1 3   0
> >  1 4   0
> >  1 4   0
> >  1 5   0
> >  1 5   0
> >  2 5   1
> >  2 5   0
> >  2 5   0
> >  2 6   0
> >  2 6   0
> >  2 6   0
> >  3   10   1
> >  3   10   0
> >
> > When I run the program below, I receive the following error:
> > Error in df[, "ID"] : incorrect number of dimensions
> >
> > My code:
> > # Create data.frame
> > ID <- c(rep(1,10),rep(2,6),rep(3,2))
> > date <- c(rep(1,2),rep(2,2),rep(3,2),rep(4,2),rep(5,2),
> >rep(5,3),rep(6,3),rep(10,2))
> > olddata <- data.frame(ID=ID,date=date)
> > class(olddata)
> > cat("This is the original data frame","\n")
> > print(olddata)
> >
> > # This function is supposed to identify the first row
> > # within each level of ID and, for the first row, set
> > # the variable first to 1, and for all rows other than
> > # the first row set first to 0.
> > mydoit <- function(df){
> >value <- ifelse (first(df[,"ID"]),1,0)
> >cat("value=",value,"\n")
> >df[,"first"] <- value
> > }
> > newdata <- aggregate(olddata,list(olddata[,"ID"]),mydoit)
> >
> > Thank you,
> > John
> >
> >
> > John David Sorkin M.D., Ph.D.
> > Professor of Medicine, University of Maryland School of Medicine;
> > Associate Director for Biostatistics and Informatics, Baltimore VA Medical 
> > Center Geriatrics Research, Education, and Clinical Center;
> > PI Biostatistics and Informatics Core, University of Maryland School of 
> > Medicine Claude D. Pepper Older Americans Independence Center;
> > Senior Statistician University of Maryland Center for Vascular Research;
> >
> > Division of Gerontology and Paliative Care,
> > 10 North Greene Street
> > GRECC (BT/18/GR)
> > Baltimore, MD 21201-1524
> > Cell phone 443-418-5382
> >
> >
> >
> > __
> > [email protected] mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide 
> > https://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> Hello,
>
> And here are two other solutions.
>
>
> olddata$first <- with(olddata, ave(seq_along(ID), ID, FUN = \(x) x ==
> x[1L]))
>
> olddata$first <- c(1L, diff(olddata$ID))
>
>
> Of these two, diff is faster. But of all the solutions posted so far,
> Ben Bolker's is the fastest. And it can be made a little faster if
> as.integer substitutes for as.numeric.
> And dplyr::mutate now has a .by argument, which avoids explicit the call
> to group_by, with a performance gain.
>
>
> library(microbenchmark)
>
> mb <- microbenchmark(
>ave = with(olddata, ave(seq_along(ID), ID, FUN = \(x) x == x[1L])),
>dup_num = as.numeric(! duplicated(olddata$ID)),
>dup_int = as.in

Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows

2024-12-01 Thread Boris Steipe


olddata$first <- as.numeric(! duplicated(olddata$ID))


:-)




> On Nov 30, 2024, at 22:27, Sorkin, John  wrote:
> 
> ID <- c(rep(1,10),rep(2,6),rep(3,2))
> date <- c(rep(1,2),rep(2,2),rep(3,2),rep(4,2),rep(5,2),
>  rep(5,3),rep(6,3),rep(10,2))
> olddata <- data.frame(ID=ID,date=date)
> class(olddata)

__
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows

2024-11-30 Thread Rui Barradas

Às 02:27 de 01/12/2024, Sorkin, John escreveu:

Dear R help folks,

First my apologizes for sending several related questions to the list server. I 
am trying to learn how to manipulate data in R . . . and am having difficulty 
getting my program to work. I greatly appreciate the help and support list 
member give!

I am trying to write a program that will run through a data frame organized by 
ID and for the first line of each new group of data lines that has the same ID 
create a new variable first that will be 1 for the first line of the group and 
0 for all other lines.

e.g. if my original data is
  olddata
ID date
 1 1
 1 1
 1 2
 1 2
 1 3
 1 3
 1 4
 1 4
 1 5
 1 5
 2 5
 2 5
 2 5
 2 6
 2 6
 2 6
 3   10
 3   10

the new data will be
newdata
ID date  first
 1 1   1
 1 1   0
 1 2   0
 1 2   0
 1 3   0
 1 3   0
 1 4   0
 1 4   0
 1 5   0
 1 5   0
 2 5   1
 2 5   0
 2 5   0
 2 6   0
 2 6   0
 2 6   0
 3   10   1
 3   10   0

When I run the program below, I receive the following error:
Error in df[, "ID"] : incorrect number of dimensions

My code:
# Create data.frame
ID <- c(rep(1,10),rep(2,6),rep(3,2))
date <- c(rep(1,2),rep(2,2),rep(3,2),rep(4,2),rep(5,2),
   rep(5,3),rep(6,3),rep(10,2))
olddata <- data.frame(ID=ID,date=date)
class(olddata)
cat("This is the original data frame","\n")
print(olddata)
  
# This function is supposed to identify the first row

# within each level of ID and, for the first row, set
# the variable first to 1, and for all rows other than
# the first row set first to 0.
mydoit <- function(df){
   value <- ifelse (first(df[,"ID"]),1,0)
   cat("value=",value,"\n")
   df[,"first"] <- value
}
newdata <- aggregate(olddata,list(olddata[,"ID"]),mydoit)

Thank you,
John


John David Sorkin M.D., Ph.D.
Professor of Medicine, University of Maryland School of Medicine;
Associate Director for Biostatistics and Informatics, Baltimore VA Medical 
Center Geriatrics Research, Education, and Clinical Center;
PI Biostatistics and Informatics Core, University of Maryland School of 
Medicine Claude D. Pepper Older Americans Independence Center;
Senior Statistician University of Maryland Center for Vascular Research;

Division of Gerontology and Paliative Care,
10 North Greene Street
GRECC (BT/18/GR)
Baltimore, MD 21201-1524
Cell phone 443-418-5382



__
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Hello,

And here are two other solutions.


olddata$first <- with(olddata, ave(seq_along(ID), ID, FUN = \(x) x == 
x[1L]))


olddata$first <- c(1L, diff(olddata$ID))


Of these two, diff is faster. But of all the solutions posted so far, 
Ben Bolker's is the fastest. And it can be made a little faster if 
as.integer substitutes for as.numeric.
And dplyr::mutate now has a .by argument, which avoids explicit the call 
to group_by, with a performance gain.



library(microbenchmark)

mb <- microbenchmark(
  ave = with(olddata, ave(seq_along(ID), ID, FUN = \(x) x == x[1L])),
  dup_num = as.numeric(! duplicated(olddata$ID)),
  dup_int = as.integer(! duplicated(olddata$ID)),
  diff = diff = c(1L, diff(olddata$ID)),
  dplyr_grp = olddata %>% group_by(ID) %>% mutate(first = 
as.integer(row_number() == 1)),
  dplyr = olddata %>% mutate(first = as.integer(row_number() == 1), .by 
= ID)

)
print(mb, order = "median")



However, note that dplyr operates in entire data.frames and therefore is 
expected to be slower when tested against instructions that process one 
column only.



Hope this helps,

Rui Barradas


--
Este e-mail foi analisado pelo software antivírus AVG para verificar a presença 
de vírus.
www.avg.com

__
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows

2024-11-30 Thread avi.e.gross
I was wondering along similar lines, Bert.

One way to get help is to ask how to do some single step of a larger strategy. 
That can lead to answers that may not be as applicable to the scenario.

Another way would be to include a synopsis of what they are trying to do.

But, as John says he is trying to learn and improve his abilities, perhaps he s 
getting what he wants.
After watching some of the exchanges in multiple questions, many seem to 
revolve around a wish to deal with sorted grouped data. He seems to have looked 
at some base R methods as well as packages like dplyr using tibbles as well as 
another package and format.

What interests me from a dplyr perspective is how many little embedded 
functions it makes available and some have been mentioned here. If you want to  
add a column that contains the same value for each group, such as the minimum, 
mean, first and many other things, it is very easily doable.

The latest request seems to be a bit different as it wants a column with a 1 
(presumably for TRUE) only for the first entry in  the group. Again, fairly 
easy using one of several hooks such as the rownumber being "1" versus not. 
There are many variations on the answer supplied depending on style and need, 
such as making a column that contains the row number, and in a later step, set 
those to zero that are not a one. 

But sometimes you want to ask what the overall algorithm is. Do you need extra 
columns to then use for some purpose, or could that purpose have been done 
another way such as doing some calculation only when rownumber is one.

As noted, R makes some operations fairly natural, in ways that differ from the 
"natural" way another program/environment does it. Sometimes a translation is 
not worth doing as compared to a reworked algorithm that makes good use of 
whichever package and related functionality you want to use. 

Assuming all these questions relate to the same project, I am not clear if and 
where the lookback at previous row/value fits.

Of course, John may not be free to share more in public.

Anyone want to suggest a book or two on data processing of this sort using R 
that might illustrate with examples galore on how various problems are solved 
and then perhaps some will be similar enough ...

-Original Message-
From: R-help  On Behalf Of Bert Gunter
Sent: Saturday, November 30, 2024 11:34 PM
To: Sorkin, John 
Cc: [email protected] ([email protected]) 
Subject: Re: [R] Identify first row of each ID within a data frame, create a 
variable first =1 for the first row and first=0 of all other rows

May I ask *why* you want to do this?

It sounds to me like like you're using SAS-like strategies for your
data analysis rather than R-like.

-- Bert

-- Bert

On Sat, Nov 30, 2024 at 6:27 PM Sorkin, John  wrote:
>
> Dear R help folks,
>
> First my apologizes for sending several related questions to the list server. 
> I am trying to learn how to manipulate data in R . . . and am having 
> difficulty getting my program to work. I greatly appreciate the help and 
> support list member give!
>
> I am trying to write a program that will run through a data frame organized 
> by ID and for the first line of each new group of data lines that has the 
> same ID create a new variable first that will be 1 for the first line of the 
> group and 0 for all other lines.
>
> e.g. if my original data is
>  olddata
>ID date
> 1 1
> 1 1
> 1 2
> 1 2
> 1 3
> 1 3
> 1 4
> 1 4
> 1 5
> 1 5
> 2 5
> 2 5
> 2 5
> 2 6
> 2 6
> 2 6
> 3   10
> 3   10
>
> the new data will be
> newdata
>ID date  first
> 1 1   1
> 1 1   0
> 1 2   0
> 1 2   0
> 1 3   0
> 1 3   0
> 1 4   0
> 1 4   0
> 1 5   0
> 1 5   0
> 2 5   1
> 2 5   0
> 2 5   0
> 2 6   0
> 2 6   0
> 2 6   0
> 3   10   1
> 3   10   0
>
> When I run the program below, I receive the following error:
> Error in df[, "ID"] : incorrect number of dimensions
>
> My code:
> # Create data.frame
> ID <- c(rep(1,10),rep(2,6),rep(3,2))
> date <- c(rep(1,2),rep(2,2),rep(3,2),rep(4,2),rep(5,2),
>   rep(5,3),rep(6,3),rep(10,2))
> olddata <- data.frame(ID=ID,date=date)
> class(olddata)
> cat("This is the original data frame","\n")
> print(olddata)
>
> # This function is supposed to identify the first row
> # within each level of ID and, for the first row, set
> # the variable first to 1, and for all rows other 

Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows

2024-11-30 Thread Bert Gunter
May I ask *why* you want to do this?

It sounds to me like like you're using SAS-like strategies for your
data analysis rather than R-like.

-- Bert

-- Bert

On Sat, Nov 30, 2024 at 6:27 PM Sorkin, John  wrote:
>
> Dear R help folks,
>
> First my apologizes for sending several related questions to the list server. 
> I am trying to learn how to manipulate data in R . . . and am having 
> difficulty getting my program to work. I greatly appreciate the help and 
> support list member give!
>
> I am trying to write a program that will run through a data frame organized 
> by ID and for the first line of each new group of data lines that has the 
> same ID create a new variable first that will be 1 for the first line of the 
> group and 0 for all other lines.
>
> e.g. if my original data is
>  olddata
>ID date
> 1 1
> 1 1
> 1 2
> 1 2
> 1 3
> 1 3
> 1 4
> 1 4
> 1 5
> 1 5
> 2 5
> 2 5
> 2 5
> 2 6
> 2 6
> 2 6
> 3   10
> 3   10
>
> the new data will be
> newdata
>ID date  first
> 1 1   1
> 1 1   0
> 1 2   0
> 1 2   0
> 1 3   0
> 1 3   0
> 1 4   0
> 1 4   0
> 1 5   0
> 1 5   0
> 2 5   1
> 2 5   0
> 2 5   0
> 2 6   0
> 2 6   0
> 2 6   0
> 3   10   1
> 3   10   0
>
> When I run the program below, I receive the following error:
> Error in df[, "ID"] : incorrect number of dimensions
>
> My code:
> # Create data.frame
> ID <- c(rep(1,10),rep(2,6),rep(3,2))
> date <- c(rep(1,2),rep(2,2),rep(3,2),rep(4,2),rep(5,2),
>   rep(5,3),rep(6,3),rep(10,2))
> olddata <- data.frame(ID=ID,date=date)
> class(olddata)
> cat("This is the original data frame","\n")
> print(olddata)
>
> # This function is supposed to identify the first row
> # within each level of ID and, for the first row, set
> # the variable first to 1, and for all rows other than
> # the first row set first to 0.
> mydoit <- function(df){
>   value <- ifelse (first(df[,"ID"]),1,0)
>   cat("value=",value,"\n")
>   df[,"first"] <- value
> }
> newdata <- aggregate(olddata,list(olddata[,"ID"]),mydoit)
>
> Thank you,
> John
>
>
> John David Sorkin M.D., Ph.D.
> Professor of Medicine, University of Maryland School of Medicine;
> Associate Director for Biostatistics and Informatics, Baltimore VA Medical 
> Center Geriatrics Research, Education, and Clinical Center;
> PI Biostatistics and Informatics Core, University of Maryland School of 
> Medicine Claude D. Pepper Older Americans Independence Center;
> Senior Statistician University of Maryland Center for Vascular Research;
>
> Division of Gerontology and Paliative Care,
> 10 North Greene Street
> GRECC (BT/18/GR)
> Baltimore, MD 21201-1524
> Cell phone 443-418-5382
>
>
>
> __
> [email protected] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows

2024-11-30 Thread Christopher W. Ryan via R-help
Sorry, for completeness:

library(dplyr)
olddata %>% group_by(ID) %>% mutate(first = as.integer(row_number() == 1))

--Chris Ryan


Christopher W. Ryan wrote:
> Personally, I'd do this in the tidyverse with dplyr and its row_number()
> function.
> 
> olddata %>% group_by(ID) %>% mutate(first = as.integer(row_number() == 1))
> 
> --Chris Ryan
> 
> Sorkin, John wrote:
>> ID <- c(rep(1,10),rep(2,6),rep(3,2))
>> date <- c(rep(1,2),rep(2,2),rep(3,2),rep(4,2),rep(5,2),
>>   rep(5,3),rep(6,3),rep(10,2))
>> olddata <- data.frame(ID=ID,date=date)

__
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows

2024-11-30 Thread Christopher W. Ryan via R-help
Personally, I'd do this in the tidyverse with dplyr and its row_number()
function.

olddata %>% group_by(ID) %>% mutate(first = as.integer(row_number() == 1))

--Chris Ryan

Sorkin, John wrote:
> ID <- c(rep(1,10),rep(2,6),rep(3,2))
> date <- c(rep(1,2),rep(2,2),rep(3,2),rep(4,2),rep(5,2),
>   rep(5,3),rep(6,3),rep(10,2))
> olddata <- data.frame(ID=ID,date=date)

__
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows

2024-11-30 Thread Ben Bolker
I think as.numeric(! duplicated(group)) might do this for you ...

On Sat, Nov 30, 2024, 9:27 PM Sorkin, John 
wrote:

> Dear R help folks,
>
> First my apologizes for sending several related questions to the list
> server. I am trying to learn how to manipulate data in R . . . and am
> having difficulty getting my program to work. I greatly appreciate the help
> and support list member give!
>
> I am trying to write a program that will run through a data frame
> organized by ID and for the first line of each new group of data lines that
> has the same ID create a new variable first that will be 1 for the first
> line of the group and 0 for all other lines.
>
> e.g. if my original data is
>  olddata
>ID date
> 1 1
> 1 1
> 1 2
> 1 2
> 1 3
> 1 3
> 1 4
> 1 4
> 1 5
> 1 5
> 2 5
> 2 5
> 2 5
> 2 6
> 2 6
> 2 6
> 3   10
> 3   10
>
> the new data will be
> newdata
>ID date  first
> 1 1   1
> 1 1   0
> 1 2   0
> 1 2   0
> 1 3   0
> 1 3   0
> 1 4   0
> 1 4   0
> 1 5   0
> 1 5   0
> 2 5   1
> 2 5   0
> 2 5   0
> 2 6   0
> 2 6   0
> 2 6   0
> 3   10   1
> 3   10   0
>
> When I run the program below, I receive the following error:
> Error in df[, "ID"] : incorrect number of dimensions
>
> My code:
> # Create data.frame
> ID <- c(rep(1,10),rep(2,6),rep(3,2))
> date <- c(rep(1,2),rep(2,2),rep(3,2),rep(4,2),rep(5,2),
>   rep(5,3),rep(6,3),rep(10,2))
> olddata <- data.frame(ID=ID,date=date)
> class(olddata)
> cat("This is the original data frame","\n")
> print(olddata)
>
> # This function is supposed to identify the first row
> # within each level of ID and, for the first row, set
> # the variable first to 1, and for all rows other than
> # the first row set first to 0.
> mydoit <- function(df){
>   value <- ifelse (first(df[,"ID"]),1,0)
>   cat("value=",value,"\n")
>   df[,"first"] <- value
> }
> newdata <- aggregate(olddata,list(olddata[,"ID"]),mydoit)
>
> Thank you,
> John
>
>
> John David Sorkin M.D., Ph.D.
> Professor of Medicine, University of Maryland School of Medicine;
> Associate Director for Biostatistics and Informatics, Baltimore VA Medical
> Center Geriatrics Research, Education, and Clinical Center;
> PI Biostatistics and Informatics Core, University of Maryland School of
> Medicine Claude D. Pepper Older Americans Independence Center;
> Senior Statistician University of Maryland Center for Vascular Research;
>
> Division of Gerontology and Paliative Care,
> 10 North Greene Street
> 
> GRECC (BT/18/GR)
> Baltimore, MD 21201-1524
> Cell phone 443-418-5382
>
>
>
> __
> [email protected] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.