Re: [R] create groups from data with duplicates, such that each group has a duplicate represented once
Dear Petr, thank you for the guidance. A colleague managed to solve it I'll definitely use "dput" for future postings. Regards ------ Kevin Wamae On 17/01/2019, 03:57, "PIKAL Petr" wrote: Hi Instead of attachment which is usually removed you should use dput Something like output from dput(head(yourdata,30)) To remove duplicate values see unique or duplicated Cheers Petr > -Original Message- > From: R-help On Behalf Of Kevin Wamae > Sent: Thursday, January 17, 2019 1:29 AM > To: r-help@r-project.org > Subject: [R] create groups from data with duplicates, such that each group has > a duplicate represented once > > Hi, I have a sequencing run with ~3000 samples (attached dataset). The > samples were initially tagged and amplified by PCR in duplicate. The tags used > range from MID01 to MID26. > > MID01-MID13 were used for pair 1 while MID14-MID26 were used for pair 2. > The tags are re-used to allow samples to be pooled. > > The pooling process will involve mixing samples with MID01-26 into the first > group, the next group samples with MID01-26 into the second group and so on. > > I'm hoping to get an R script that can create these groups such that for each > group, any of the Tags appears only once. An example is shown below. > > ID > > TagA > > TagB > > group > > 180 > > MID03 > > MID10 > > group1 > > 181 > > MID04 > > MID06 > > group1 > > 182 > > MID05 > > MID07 > > group1 > > 183 > > MID03 > > MID09 > > group2 > > 184 > > MID04 > > MID10 > > group2 > > 185 > > MID05 > > MID06 > > group2 > > 186 > > MID01 > > MID06 > > group3 > > 187 > > MID02 > > MID07 > > group3 > > 188 > > MID03 > > MID08 > > group3 > > > > ___ > ___ > > This e-mail contains information which is confidential. It is intended only for > the use of the named recipient. If you have received this e-mail in error, please > let us know by replying to the sender, and immediately delete it from your > system. Please note, that in these circumstances, the use, disclosure, > distribution or copying of this information is strictly prohibited. KEMRI- > Wellcome Trust Programme cannot accept any responsibility for the accuracy > or completeness of this message as it has been transmitted over a public > network. Although the Programme has taken reasonable precautions to ensure > no viruses are present in emails, it cannot accept responsibility for any loss or > damage arising from the use of the email or attachments. Any views expressed > in this message are those of the individual sender, except where the sender > specifically states them to be the views of KEMRI-Wellcome Trust Programme. > ___ > ___ > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. Osobní údaje: Informace o zpracování a ochraně osobních údajů obchodních partnerů PRECHEZA a.s. jsou zveřejněny na: https://www.precheza.cz/zasady-ochrany-osobnich-udaju/ | Information about processing and protection of business partner’s personal data are available on website: https://www.precheza.cz/en/personal-data-protection-principles/ Důvěrnost: Tento e-mail a jakékoliv k němu připojené dokumenty jsou důvěrné a podléhají tomuto právně závaznému prohláąení o vyloučení odpovědnosti: https://www.precheza.cz/01-dovetek/ | This email and any documents attached to it may be confidential and are subject to the legally binding disclaimer: https://www.precheza.cz/en/01-disclaimer/ __ This e-mail contains information which is confidential. It is intended only for the use of the named recipient. If you have received this e-mail in error, please
[R] create groups from data with duplicates, such that each group has a duplicate represented once
Hi, I have a sequencing run with ~3000 samples (attached dataset). The samples were initially tagged and amplified by PCR in duplicate. The tags used range from MID01 to MID26. MID01-MID13 were used for pair 1 while MID14-MID26 were used for pair 2. The tags are re-used to allow samples to be pooled. The pooling process will involve mixing samples with MID01-26 into the first group, the next group samples with MID01-26 into the second group and so on. I'm hoping to get an R script that can create these groups such that for each group, any of the Tags appears only once. An example is shown below. ID TagA TagB group 180 MID03 MID10 group1 181 MID04 MID06 group1 182 MID05 MID07 group1 183 MID03 MID09 group2 184 MID04 MID10 group2 185 MID05 MID06 group2 186 MID01 MID06 group3 187 MID02 MID07 group3 188 MID03 MID08 group3 __ This e-mail contains information which is confidential. It is intended only for the use of the named recipient. If you have received this e-mail in error, please let us know by replying to the sender, and immediately delete it from your system. Please note, that in these circumstances, the use, disclosure, distribution or copying of this information is strictly prohibited. KEMRI-Wellcome Trust Programme cannot accept any responsibility for the accuracy or completeness of this message as it has been transmitted over a public network. Although the Programme has taken reasonable precautions to ensure no viruses are present in emails, it cannot accept responsibility for any loss or damage arising from the use of the email or attachments. Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of KEMRI-Wellcome Trust Programme. __ __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] alternative for multiple if_else statements
Dear Ellison, thank you for the feedback, we replaced dplyr::if_else with dplyr::case_when and it seems to do the trick. Still, we have to write several statements to match all the respective years but it's working. Let me see how we can implement your suggestion. Regards -- Kevin Wamae On 26/02/2018, 14:57, "S Ellison" <s.elli...@lgcgroup.com> wrote: That many ifelse statements is obviously rather a pain. Would you not have got what you want with ... paste("survey", year, sep="_") ? If that is not what you're looking for (eg because 'year' is the observation year and not the study start year), perhaps something that picks the minimum year for a subject or other relevant group might work? For example paste("survey", ave(year, studyno, FUN=min), sep="_") S Ellison > -Original Message- > From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Kevin > Wamae > Sent: 21 February 2018 20:34 > To: R-help@r-project.org > Subject: [R] alternative for multiple if_else statements > > Hi, I am having trouble trying to figure out why if_else is behaving the way it is, > it may be my code or the way the data is structured. > > Below is a snapshot of a database am working on and it represents a > longitudinal survey of study participants in a trial with weekly follow up. > > The variable "survey_start" represents the start of the study-defined one year > follow up (which we called "survey_year"). > > I am trying to populate all subsequent entries for each participant, per survey > year, with the entry "survey" followed by an underscore and the respective > year, eg. survey_2014. > > There are missing entries such as the participant represented here, wasn't > available at the start of the 2015 survey. Also, some participants don’t have > complete one-year follow ups but I still need to include them. > > I have written two codes, first one fails while the second works, the only > difference being I have reversed the order in which the entries are populated in > the second code (from 2007-2016 to 2016-2007) and removed the if_else > statement for 2015. Also noticed, that for the second code, which spans the > years 2007-2016 (less 2015), if a participants entries start from 2010-2016, the > code fails. > > Kindly assist in figuring this out...or better yet, an alternative. > > trialData <- structure(list(study = c("site_1", "site_1", "site_1", "site_1", > "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", > "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", > "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", > "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", > "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", > "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", > "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", > "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", > "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", > "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", > "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", > "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", > "site_1", "site_1", "site_1", "site_1", &
Re: [R] alternative for multiple if_else statements
Dear Eric, thank you for that observation. I realised that some of the participants have duplicated “survey_start” dates and when I corrected this, the code works. Regards -- Kevin Wamae From: Eric Berger <ericjber...@gmail.com> Date: Thursday, 22 February 2018 at 15:16 To: Kevin Wamae <kwa...@kemri-wellcome.org> Cc: "R-help@r-project.org" <R-help@r-project.org> Subject: Re: [R] alternative for multiple if_else statements Hi Kevin, I ran the code on the full data set and was able to reproduce the problem that you are facing. My guess is that you have an error in your intuition and/or logic, and that this relates to the use of the subscript [1]. Specifically, on the full dataset, the condition trialData$date[trialData$survey_start == "Y" & trialData$year == 2013 & trialData$site == "site_1"] yields 412 matches, of which there are 9 unique ones, specifically April 2,3,4,5,8,10,11,16,17 In the full data set the first element that appears, i.e. subscript[1], is "2013-04-04". In the filtered data set the first element that appears is "2013-04-05". I hope that is enough information for you to make further progress from here. Best, Eric On Thu, Feb 22, 2018 at 1:28 PM, Kevin Wamae <kwa...@kemri-wellcome.org<mailto:kwa...@kemri-wellcome.org>> wrote: Dear Eric, wow, this seems to do the trick. But I have encountered a problem. I have tested it on the larger dataset and it seems to work on a filtered dataset but not on the whole dataset (attached). See below script.. #load packages Library(dplyr) #load data trialData <- fread("trialData.txt") %>% mutate(date = as.Date(date,"%d/%m/%Y")) #create blank variable trialData$survey_year <- rep(NA_character_, nrow(trialData)) #attempt 1 fails: code for survey trialData$survey_year[trialData$date >= trialData$date[trialData$survey_start == "Y" & trialData$year == 2013 & trialData$site == "site_1"][1] & trialData$date < trialData$date[trialData$month == 4 & trialData$year == 2014 & trialData$site == "site_1"][1]] <- "survey_2013" #filter trialData trialData <- trialData %>% filter(id == "id_786/3") #attempt 2 works: code for survey trialData$survey_year[trialData$date >= trialData$date[trialData$survey_start == "Y" & trialData$year == 2013 & trialData$site == "site_1"][1] & trialData$date < trialData$date[trialData$month == 4 & trialData$year == 2014 & trialData$site == "site_1"][1]] <- "survey_2013" From: Eric Berger <ericjber...@gmail.com<mailto:ericjber...@gmail.com>> Date: Thursday, 22 February 2018 at 13:05 To: Kevin Wamae <kwa...@kemri-wellcome.org<mailto:kwa...@kemri-wellcome.org>> Cc: "R-help@r-project.org<mailto:R-help@r-project.org>" <R-help@r-project.org<mailto:R-help@r-project.org>> Subject: Re: [R] alternative for multiple if_else statements Hi, 1. I think the reason that the different ordering leads to different results is because of the following: date[ some condition is true ][1] will give you an NA if there are no rows where 'some condition holds'. In the code that 'works' you don't have such a situation, but in the code that 'does not work' you presumably hit an NA before you get to the result that you really want. 2. I am not a big fan of your "nested if" layout. I think you could rewrite it more clearly - and without nesting - with something like > trialData$survey_year <- rep(NA_character_, nrow(trialData)) > trialData$survey_year[ condition for survey_2007 ] <- "survey_2007" > trialData$survey_year[ condition for survey_2008 ] <- "survey_2008" > etc HTH, Eric On Wed, Feb 21, 2018 at 10:33 PM, Kevin Wamae <kwa...@kemri-wellcome.org<mailto:kwa...@kemri-wellcome.org>> wrote: Hi, I am having trouble trying to figure out why if_else is behaving the way it is, it may be my code or the way the data is structured. Below is a snapshot of a database am working on and it represents a longitudinal survey of study participants in a trial with weekly follow up. The variable "survey_start" represents the start of the study-defined one year follow up (which we called "survey_year"). I am trying to populate all subsequent entries for each participant, per survey year, with the entry "survey" followed by an underscore and the respective year, eg. survey_2014. There are missing entries such as the participant represented here, wasn't available at the start of the 2015 survey. Also, some participants don’t have complete one-year follow ups but I still need to include them. I have written two codes, first one fails while the second works, the only differenc
Re: [R] alternative for multiple if_else statements
Dear Ista, thank you. Let me see how best I can implement this. Regards -- Kevin Wamae On 22/02/2018, 16:58, "Ista Zahn" <istaz...@gmail.com> wrote: I don't fully understand the logic you are trying to implement, but something along the lines of foo <- cut(trialData$date, breaks = as.Date(c("2007-01-01", "2008-05-01", "2009-04-01", "2010-05-01", "2011-05-01", "2012-04-01", "2013-04-01", "2014-04-01", "2015-04-01", "2016-03-01", "2017-01-01"))) might work. Best, Ista On Wed, Feb 21, 2018 at 3:33 PM, Kevin Wamae <kwa...@kemri-wellcome.org> wrote: > Hi, I am having trouble trying to figure out why if_else is behaving the way it is, it may be my code or the way the data is structured. > > Below is a snapshot of a database am working on and it represents a longitudinal survey of study participants in a trial with weekly follow up. > > The variable "survey_start" represents the start of the study-defined one year follow up (which we called "survey_year"). > > I am trying to populate all subsequent entries for each participant, per survey year, with the entry "survey" followed by an underscore and the respective year, eg. survey_2014. > > There are missing entries such as the participant represented here, wasn't available at the start of the 2015 survey. Also, some participants don’t have complete one-year follow ups but I still need to include them. > > I have written two codes, first one fails while the second works, the only difference being I have reversed the order in which the entries are populated in the second code (from 2007-2016 to 2016-2007) and removed the if_else statement for 2015. Also noticed, that for the second code, which spans the years 2007-2016 (less 2015), if a participants entries start from 2010-2016, the code fails. > > Kindly assist in figuring this out...or better yet, an alternative. > > trialData <- structure(list(study = c("site_1", "site_1", "site_1", "site_1", > "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", > "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", > "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", > "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", > "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", > "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", > "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", > "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", > "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", > "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", > "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", > "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", > "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", > "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", > "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", > "site_1", "site_1", "site_1", "site_1"
[R] alternative for multiple if_else statements
Hi, I am having trouble trying to figure out why if_else is behaving the way it is, it may be my code or the way the data is structured. Below is a snapshot of a database am working on and it represents a longitudinal survey of study participants in a trial with weekly follow up. The variable "survey_start" represents the start of the study-defined one year follow up (which we called "survey_year"). I am trying to populate all subsequent entries for each participant, per survey year, with the entry "survey" followed by an underscore and the respective year, eg. survey_2014. There are missing entries such as the participant represented here, wasn't available at the start of the 2015 survey. Also, some participants don’t have complete one-year follow ups but I still need to include them. I have written two codes, first one fails while the second works, the only difference being I have reversed the order in which the entries are populated in the second code (from 2007-2016 to 2016-2007) and removed the if_else statement for 2015. Also noticed, that for the second code, which spans the years 2007-2016 (less 2015), if a participants entries start from 2010-2016, the code fails. Kindly assist in figuring this out...or better yet, an alternative. trialData <- structure(list(study = c("site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1"), studyno = c("child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1", "child_1"), date = structure(c(16078, 16085, 16092, 16098, 16104, 16115, 16121, 16129, 16135, 16140, 16146, 16156, 16162, 16168, 16177, 16185, 16191, 16195, 16203, 16210, 16217, 16225, 16234, 16237, 16246, 16253, 16262, 16269, 16278, 16283, 16288, 16297, 16304, 16311, 16319, 16326, 16332, 16337, 16346, 16353, 16360, 16366, 16370, 16381, 16384, 16395, 16399, 16407, 16415, 16422, 16444, 16452, 16454, 16467, 16474, 16477, 16484, 16490, 16501, 16508, 16514, 16520, 16529, 16533, 16539, 16550, 16556, 16564, 16566, 16578, 16582, 16593, 16599, 16604, 16613, 16620, 16623, 16635, 16636, 16654, 16660, 1, 16673, 16681, 16688, 16693, 16702, 16706, 16714, 16721, 16728, 16734, 16745, 16749, 16757, 16764, 16769, 16778, 16785, 16792, 16805, 16812, 16819, 16830, 16832, 16839, 16846, 16856, 16862, 16867, 16877, 16884, 16890, 16898, 16904, 16912, 16917, 16923, 16936, 16938, 16953, 16960, 16966, 16973, 16980), class = "Date"), year = c(2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L,
Re: [R] Populate one data frame with values from another dataframe for rows that match
Pardon me, here’s @Eric<mailto:ericjber...@gmail.com>’s solution… myDF1$studyno <- as.character(myDF1$studyno) myDF2$studyno <- as.character(myDF2$studyno) myDF3 <- merge(myDF1, myDF2, by="studyno", all.x=TRUE ) %>% dplyr::mutate( pf_mcl = ifelse( is.na<http://is.na/>(pf_mcl.y), pf_mcl.x, pf_mcl.y ) ) %>% dplyr::select( studyno, date, pf_mcl ) Regards ------ Kevin Wamae From: Kevin Wamae <kwa...@kemri-wellcome.org> Date: Sunday, 15 October 2017 at 14:03 To: William Dunlap <wdun...@tibco.com> Cc: Bert Gunter <bgunter.4...@gmail.com>, Rui Barradas <ruipbarra...@sapo.pt>, Eric Berger <ericjber...@gmail.com>, R-help <R-help@r-project.org> Subject: Re: [R] Populate one data frame with values from another dataframe for rows that match Dear @William<mailto:wdun...@tibco.com>, thanks for the feedback. I have tested it on the larger dataset and noticed that it created two variables, pf_raw and pf_curated. The output we were looking for, was one that takes the variable pf_mcl in curated dataset and replaces pf_mcl in matching rows within the raw dataset. @Eric<mailto:ericjber...@gmail.com>’s solution was able to achieve that. Nonetheless, we do appreciate your solution. Regards -- Kevin Wamae From: William Dunlap <wdun...@tibco.com> Date: Saturday, 14 October 2017 at 20:21 To: Kevin Wamae <kwa...@kemri-wellcome.org> Cc: Bert Gunter <bgunter.4...@gmail.com>, Rui Barradas <ruipbarra...@sapo.pt>, R-help <R-help@r-project.org> Subject: Re: [R] Populate one data frame with values from another dataframe for rows that match Your example used one distinct studyno in DF1 and one distinct pf_mcl in DF2. I think that makes it hard to see what is going on, but maybe I completely misunderstand the problem. In any case, let's redefine myDF1 and myDF2. Note that myDF1 contains a studyno not in myDF2 and vice versa. myDF1 <- structure(list(studyno = c("J1000/9", "J895/7", "J931/6", "J666/6", "J1000/9", "J1000/9"), date = structure(c(17123, 17127, 17135, 17144, 17148, 17155), class = "Date"), pf_mcl = c(NA_integer_, 2L, 3L, 4L, 5L, NA_integer_ ), year = c(2016, 2016, 2016, 2016, 2016, 2016)), .Names = c("studyno", "date", "pf_mcl", "year"), row.names = c(NA, 6L), class = "data.frame") myDF2 <- structure(list(studyno = c("J740/4", "J1000/9", "J895/7", "J931/6", "J609/1", "J941/3"), pf_mcl = c(101L, 102L, 103L, 104L, 105L, 106L)), .Names = c("studyno", "pf_mcl"), row.names = c(NA, 6L), class = "data.frame") m <- merge(myDF1, myDF2, by="studyno", all.x=TRUE, all.y=FALSE, suffixes=c(".raw", ".curated")) The results are: > myDF1 studyno date pf_mcl year 1 J1000/9 2016-11-18 NA 2016 2 J895/7 2016-11-22 2 2016 3 J931/6 2016-11-30 3 2016 4 J666/6 2016-12-09 4 2016 5 J1000/9 2016-12-13 5 2016 6 J1000/9 2016-12-20 NA 2016 > myDF2 studyno pf_mcl 1 J740/4101 2 J1000/9102 3 J895/7103 4 J931/6104 5 J609/1105 6 J941/3106 > m studyno date pf_mcl.raw year pf_mcl.curated 1 J1000/9 2016-11-18 NA 2016102 2 J1000/9 2016-12-13 5 2016102 3 J1000/9 2016-12-20 NA 2016102 4 J666/6 2016-12-09 4 2016 NA 5 J895/7 2016-11-22 2 2016103 6 J931/6 2016-11-30 3 2016104 Now your problem is to combine the columns pf_mcl.raw and pf_mcl.curated in the way you want. ifelse() may be useful for that. Bill Dunlap TIBCO Software wdunlap tibco.com<http://tibco.com> On Fri, Oct 13, 2017 at 10:48 PM, Kevin Wamae <kwa...@kemri-wellcome.org<mailto:kwa...@kemri-wellcome.org>> wrote: Dear @Bert Gunter<mailto:bgunter.4...@gmail.com<mailto:bgunter.4...@gmail.com>>, I tried merge and I faced many challenges. @Rui Barradas<mailto:ruipbarra...@sapo.pt<mailto:ruipbarra...@sapo.pt>> solution is working. From: Bert Gunter <bgunter.4...@gmail.com<mailto:bgunter.4...@gmail.com>> Date: Friday, 13 October 2017 at 22:44 To: Kevin Wamae <kwa...@kemri-wellcome.org<mailto:kwa...@kemri-wellcome.org>> Cc: R-help <R-help@r-project.org<mailto:R-help@r-project.org>> Subject: Re: [R] Populate one data frame with values from another dataframe for rows that match ?merge Bert On Oct 13, 2017 12:09 PM, "Kevin Wamae" <kwa...@kemri-wellcome.org<mailto:kwa...@kemri-wellcome.org><mailto:kwa...@kemri-wellcome.org<mailto:kwa...@kemri-wellcome.org>>> wrote: I'm trying to populate the column “pf_mcl” in myDF1 with values from myDF
Re: [R] Populate one data frame with values from another dataframe for rows that match
Dear @William<mailto:wdun...@tibco.com>, thanks for the feedback. I have tested it on the larger dataset and noticed that it created two variables, pf_raw and pf_curated. The output we were looking for, was one that takes the variable pf_mcl in curated dataset and replaces pf_mcl in matching rows within the raw dataset. @Eric<mailto:ericjber...@gmail.com>’s solution was able to achieve that. Nonetheless, we do appreciate your solution. Regards ------ Kevin Wamae From: William Dunlap <wdun...@tibco.com> Date: Saturday, 14 October 2017 at 20:21 To: Kevin Wamae <kwa...@kemri-wellcome.org> Cc: Bert Gunter <bgunter.4...@gmail.com>, Rui Barradas <ruipbarra...@sapo.pt>, R-help <R-help@r-project.org> Subject: Re: [R] Populate one data frame with values from another dataframe for rows that match Your example used one distinct studyno in DF1 and one distinct pf_mcl in DF2. I think that makes it hard to see what is going on, but maybe I completely misunderstand the problem. In any case, let's redefine myDF1 and myDF2. Note that myDF1 contains a studyno not in myDF2 and vice versa. myDF1 <- structure(list(studyno = c("J1000/9", "J895/7", "J931/6", "J666/6", "J1000/9", "J1000/9"), date = structure(c(17123, 17127, 17135, 17144, 17148, 17155), class = "Date"), pf_mcl = c(NA_integer_, 2L, 3L, 4L, 5L, NA_integer_ ), year = c(2016, 2016, 2016, 2016, 2016, 2016)), .Names = c("studyno", "date", "pf_mcl", "year"), row.names = c(NA, 6L), class = "data.frame") myDF2 <- structure(list(studyno = c("J740/4", "J1000/9", "J895/7", "J931/6", "J609/1", "J941/3"), pf_mcl = c(101L, 102L, 103L, 104L, 105L, 106L)), .Names = c("studyno", "pf_mcl"), row.names = c(NA, 6L), class = "data.frame") m <- merge(myDF1, myDF2, by="studyno", all.x=TRUE, all.y=FALSE, suffixes=c(".raw", ".curated")) The results are: > myDF1 studyno date pf_mcl year 1 J1000/9 2016-11-18 NA 2016 2 J895/7 2016-11-22 2 2016 3 J931/6 2016-11-30 3 2016 4 J666/6 2016-12-09 4 2016 5 J1000/9 2016-12-13 5 2016 6 J1000/9 2016-12-20 NA 2016 > myDF2 studyno pf_mcl 1 J740/4101 2 J1000/9102 3 J895/7103 4 J931/6104 5 J609/1105 6 J941/3106 > m studyno date pf_mcl.raw year pf_mcl.curated 1 J1000/9 2016-11-18 NA 2016102 2 J1000/9 2016-12-13 5 2016102 3 J1000/9 2016-12-20 NA 2016102 4 J666/6 2016-12-09 4 2016 NA 5 J895/7 2016-11-22 2 2016 103 6 J931/6 2016-11-30 3 2016104 Now your problem is to combine the columns pf_mcl.raw and pf_mcl.curated in the way you want. ifelse() may be useful for that. Bill Dunlap TIBCO Software wdunlap tibco.com<http://tibco.com> On Fri, Oct 13, 2017 at 10:48 PM, Kevin Wamae <kwa...@kemri-wellcome.org<mailto:kwa...@kemri-wellcome.org>> wrote: Dear @Bert Gunter<mailto:bgunter.4...@gmail.com<mailto:bgunter.4...@gmail.com>>, I tried merge and I faced many challenges. @Rui Barradas<mailto:ruipbarra...@sapo.pt<mailto:ruipbarra...@sapo.pt>> solution is working. From: Bert Gunter <bgunter.4...@gmail.com<mailto:bgunter.4...@gmail.com>> Date: Friday, 13 October 2017 at 22:44 To: Kevin Wamae <kwa...@kemri-wellcome.org<mailto:kwa...@kemri-wellcome.org>> Cc: R-help <R-help@r-project.org<mailto:R-help@r-project.org>> Subject: Re: [R] Populate one data frame with values from another dataframe for rows that match ?merge Bert On Oct 13, 2017 12:09 PM, "Kevin Wamae" <kwa...@kemri-wellcome.org<mailto:kwa...@kemri-wellcome.org><mailto:kwa...@kemri-wellcome.org<mailto:kwa...@kemri-wellcome.org>>> wrote: I'm trying to populate the column “pf_mcl” in myDF1 with values from myDF2, where rows match based on column "studyno" but the solutions I have found so far don't seem to be giving me the desired output. Below is a snapshot of the data.frames. myDF1 <- structure(list(studyno = c("J1000/9", "J1000/9", "J1000/9", "J1000/9", "J1000/9", "J1000/9"), date = structure(c(17123, 17127, 17135, 17144, 17148, 17155), class = "Date"), pf_mcl = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_ ), year = c(2016, 2016, 2016, 2016, 2016, 2016)), .Names = c("studyno", "date", "pf_mcl", "year"), row.names = c(NA, 6L), class = "data.frame") myDF2 <- structure(list(studyno = c("J740/4", "J1000/9", "J895/7", "J931/6", "J609/1&q
Re: [R] Populate one data frame with values from another dataframe for rows that match
Dear @Eric<mailto:ericjber...@gmail.com>, thank you so very much for noticing that. When I tested @Rui<mailto:ruipbarra...@sapo.pt>’s solution, it was on a smaller dataset that had purely matching rows. I had considered including non-matching rows to evaluate what the alternative would be. Also, I hadn’t even tested it on the larger dataset. I have now and noticed that it went further to omit rows that did not match, just like you said. Your proposed solution works well. Much appreciated…I’ll get in touch in case I encounter any problems From: Eric Berger <ericjber...@gmail.com> Date: Saturday, 14 October 2017 at 12:43 To: Kevin Wamae <kwa...@kemri-wellcome.org> Cc: Rui Barradas <ruipbarra...@sapo.pt>, "r-help@r-project.org" <r-help@r-project.org> Subject: Re: [R] Populate one data frame with values from another dataframe for rows that match Hi Kevin, I think there are issues with Rui's proposed solution. For example, if there are rows in myDF1 which have a studyno which does not match any row in myDF2, then you will lose those rows. In your original request you said that you wanted to keep those rows. To demonstrate my point I need to modify your sample data. Specifically, I changed some studyno settings in myDF1, and also the entries of pf_mcl in myDF1. myDF1 <- structure(list(studyno = c("J1000/8", "J1000/9", "J1000/9", "J1000/9", "J1000/5", "J1000/6"), date = structure(c(17123, 17127, 17135, 17144, 17148, 17155), class = "Date"), pf_mcl = c(1:6 ), year = c(2016, 2016, 2016, 2016, 2016, 2016)), .Names = c("studyno", "date", "pf_mcl", "year"), row.names = c(NA, 6L), class = "data.frame") myDF2 <- structure(list(studyno = c("J740/4", "J1000/9", "J895/7", "J931/6", "J609/1", "J941/3"), pf_mcl = c(0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("studyno", "pf_mcl"), row.names = c(NA, 6L), class = "data.frame") #Rui's proposal gives the following result # studyno date year pf_mcl # 1 J1000/9 2016-11-22 2016 0 # 2 J1000/9 2016-11-30 2016 0 # 3 J1000/9 2016-12-09 2016 0 My proposal library(dplyr) myDF1$studyno <- as.character(myDF1$studyno) myDF2$studyno <- as.character(myDF2$studyno) myDF3 <- merge(myDF1, myDF2, by="studyno", all.x=TRUE ) %>% dplyr::mutate( pf_mcl = ifelse( is.na<http://is.na>(pf_mcl.y), pf_mcl.x, pf_mcl.y ) ) %>% dplyr::select( studyno, date, pf_mcl ) # The results of this approach # studyno date pf_mcl # 1 J1000/5 2016-12-13 5 # 2 J1000/6 2016-12-20 6 # 3 J1000/8 2016-11-18 1 # 4 J1000/9 2016-11-22 0 # 5 J1000/9 2016-11-30 0 # 6 J1000/9 2016-12-09 0 Comparing the two results you see that no rows have been dropped in my approach. HTH, Eric On Sat, Oct 14, 2017 at 8:49 AM, Kevin Wamae <kwa...@kemri-wellcome.org<mailto:kwa...@kemri-wellcome.org>> wrote: Dear @Rui Barradas, thank you for the solution. It works perfectly. On 13/10/2017, 23:35, "Rui Barradas" <ruipbarra...@sapo.pt<mailto:ruipbarra...@sapo.pt>> wrote: Hello, Try the following. myDF1$studyno <- as.character(myDF1$studyno) myDF2$studyno <- as.character(myDF2$studyno) i1 <- which(names(myDF1) == "pf_mcl") merge(myDF1[-i1], myDF2, by = "studyno") Hope this helps, Rui Barradas Em 13-10-2017 20:09, Kevin Wamae escreveu: > I'm trying to populate the column “pf_mcl” in myDF1 with values from myDF2, where rows match based on column "studyno" but the solutions I have found so far don't seem to be giving me the desired output. > > Below is a snapshot of the data.frames. > > myDF1 <- structure(list(studyno = c("J1000/9", "J1000/9", "J1000/9", "J1000/9", > "J1000/9", "J1000/9"), date = structure(c(17123, 17127, 17135, > 17144, 17148, 17155), class = "Date"), pf_mcl = c(NA_integer_, > NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_ > ), year = c(2016, 2016, 2016, 2016, 2016, 2016)), .Names = c("studyno", > "date", "pf_mcl", "year"), row.names = c(NA, 6L), class = "data.frame") > > myDF2 <- structure(list(studyno = c("J740/4", "J1000/9", "J895/7", "J931/6", > "J609/1", "J941/3"), pf_mcl = c(0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("studyno", > "pf_mcl"), row.names = c(NA, 6L), class = "data.frame") > > myDF2 is a well curated subset of myDF1. Some rows in the two datasets match based on &
Re: [R] Populate one data frame with values from another dataframe for rows that match
Dear @Rui Barradas, thank you for the solution. It works perfectly. On 13/10/2017, 23:35, "Rui Barradas" <ruipbarra...@sapo.pt> wrote: Hello, Try the following. myDF1$studyno <- as.character(myDF1$studyno) myDF2$studyno <- as.character(myDF2$studyno) i1 <- which(names(myDF1) == "pf_mcl") merge(myDF1[-i1], myDF2, by = "studyno") Hope this helps, Rui Barradas Em 13-10-2017 20:09, Kevin Wamae escreveu: > I'm trying to populate the column “pf_mcl” in myDF1 with values from myDF2, where rows match based on column "studyno" but the solutions I have found so far don't seem to be giving me the desired output. > > Below is a snapshot of the data.frames. > > myDF1 <- structure(list(studyno = c("J1000/9", "J1000/9", "J1000/9", "J1000/9", > "J1000/9", "J1000/9"), date = structure(c(17123, 17127, 17135, > 17144, 17148, 17155), class = "Date"), pf_mcl = c(NA_integer_, > NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_ > ), year = c(2016, 2016, 2016, 2016, 2016, 2016)), .Names = c("studyno", > "date", "pf_mcl", "year"), row.names = c(NA, 6L), class = "data.frame") > > myDF2 <- structure(list(studyno = c("J740/4", "J1000/9", "J895/7", "J931/6", > "J609/1", "J941/3"), pf_mcl = c(0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("studyno", > "pf_mcl"), row.names = c(NA, 6L), class = "data.frame") > > myDF2 is a well curated subset of myDF1. Some rows in the two datasets match based on "studyno", one may find that values are missing in myDF1$pf_mcl or the values are wrong. > > All I want to do is identify a matching row in myDF2 and populate myDF1$pf_mcl with the value in myDF2$pf_mcl. If a row does not match based on “studyno”, the value should remain the same. > > It's probably worth mentioning, the two data frames have other columns...I have selected a few for example purposes. > > > > __ > > This e-mail contains information which is confidential. It is intended only for the use of the named recipient. If you have received this e-mail in error, please let us know by replying to the sender, and immediately delete it from your system. Please note, that in these circumstances, the use, disclosure, distribution or copying of this information is strictly prohibited. KEMRI-Wellcome Trust Programme cannot accept any responsibility for the accuracy or completeness of this message as it has been transmitted over a public network. Although the Programme has taken reasonable precautions to ensure no viruses are present in emails, it cannot accept responsibility for any loss or damage arising from the use of the email or attachments. Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of KEMRI-Wellcome Trust Programme. > __ > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > __ This e-mail contains information which is confidential. It is intended only for the use of the named recipient. If you have received this e-mail in error, please let us know by replying to the sender, and immediately delete it from your system. Please note, that in these circumstances, the use, disclosure, distribution or copying of this information is strictly prohibited. KEMRI-Wellcome Trust Programme cannot accept any responsibility for the accuracy or completeness of this message as it has been transmitted over a public network. Although the Programme has taken reasonable precautions to ensure no viruses are present in emails, it cannot accept responsibility for any loss or damage arising from the use of the email or attachments. Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of KEMRI-Wellcome Trust Programme. __ __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Populate one data frame with values from another dataframe for rows that match
Dear @Bert Gunter<mailto:bgunter.4...@gmail.com>, I tried merge and I faced many challenges. @Rui Barradas<mailto:ruipbarra...@sapo.pt> solution is working. From: Bert Gunter <bgunter.4...@gmail.com> Date: Friday, 13 October 2017 at 22:44 To: Kevin Wamae <kwa...@kemri-wellcome.org> Cc: R-help <R-help@r-project.org> Subject: Re: [R] Populate one data frame with values from another dataframe for rows that match ?merge Bert On Oct 13, 2017 12:09 PM, "Kevin Wamae" <kwa...@kemri-wellcome.org<mailto:kwa...@kemri-wellcome.org>> wrote: I'm trying to populate the column “pf_mcl” in myDF1 with values from myDF2, where rows match based on column "studyno" but the solutions I have found so far don't seem to be giving me the desired output. Below is a snapshot of the data.frames. myDF1 <- structure(list(studyno = c("J1000/9", "J1000/9", "J1000/9", "J1000/9", "J1000/9", "J1000/9"), date = structure(c(17123, 17127, 17135, 17144, 17148, 17155), class = "Date"), pf_mcl = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_ ), year = c(2016, 2016, 2016, 2016, 2016, 2016)), .Names = c("studyno", "date", "pf_mcl", "year"), row.names = c(NA, 6L), class = "data.frame") myDF2 <- structure(list(studyno = c("J740/4", "J1000/9", "J895/7", "J931/6", "J609/1", "J941/3"), pf_mcl = c(0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("studyno", "pf_mcl"), row.names = c(NA, 6L), class = "data.frame") myDF2 is a well curated subset of myDF1. Some rows in the two datasets match based on "studyno", one may find that values are missing in myDF1$pf_mcl or the values are wrong. All I want to do is identify a matching row in myDF2 and populate myDF1$pf_mcl with the value in myDF2$pf_mcl. If a row does not match based on “studyno”, the value should remain the same. It's probably worth mentioning, the two data frames have other columns...I have selected a few for example purposes. __ This e-mail contains information which is confidential. It is intended only for the use of the named recipient. If you have received this e-mail in error, please let us know by replying to the sender, and immediately delete it from your system. Please note, that in these circumstances, the use, disclosure, distribution or copying of this information is strictly prohibited. KEMRI-Wellcome Trust Programme cannot accept any responsibility for the accuracy or completeness of this message as it has been transmitted over a public network. Although the Programme has taken reasonable precautions to ensure no viruses are present in emails, it cannot accept responsibility for any loss or damage arising from the use of the email or attachments. Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of KEMRI-Wellcome Trust Programme. __ [[alternative HTML version deleted]] __ R-help@r-project.org<mailto:R-help@r-project.org> mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ This e-mail contains information which is confidential. It is intended only for the use of the named recipient. If you have received this e-mail in error, please let us know by replying to the sender, and immediately delete it from your system. Please note, that in these circumstances, the use, disclosure, distribution or copying of this information is strictly prohibited. KEMRI-Wellcome Trust Programme cannot accept any responsibility for the accuracy or completeness of this message as it has been transmitted over a public network. Although the Programme has taken reasonable precautions to ensure no viruses are present in emails, it cannot accept responsibility for any loss or damage arising from the use of the email or attachments. Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of KEMRI-Wellcome Trust Programme. __ [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Populate one data frame with values from another dataframe for rows that match
I'm trying to populate the column “pf_mcl” in myDF1 with values from myDF2, where rows match based on column "studyno" but the solutions I have found so far don't seem to be giving me the desired output. Below is a snapshot of the data.frames. myDF1 <- structure(list(studyno = c("J1000/9", "J1000/9", "J1000/9", "J1000/9", "J1000/9", "J1000/9"), date = structure(c(17123, 17127, 17135, 17144, 17148, 17155), class = "Date"), pf_mcl = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_ ), year = c(2016, 2016, 2016, 2016, 2016, 2016)), .Names = c("studyno", "date", "pf_mcl", "year"), row.names = c(NA, 6L), class = "data.frame") myDF2 <- structure(list(studyno = c("J740/4", "J1000/9", "J895/7", "J931/6", "J609/1", "J941/3"), pf_mcl = c(0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("studyno", "pf_mcl"), row.names = c(NA, 6L), class = "data.frame") myDF2 is a well curated subset of myDF1. Some rows in the two datasets match based on "studyno", one may find that values are missing in myDF1$pf_mcl or the values are wrong. All I want to do is identify a matching row in myDF2 and populate myDF1$pf_mcl with the value in myDF2$pf_mcl. If a row does not match based on “studyno”, the value should remain the same. It's probably worth mentioning, the two data frames have other columns...I have selected a few for example purposes. __ This e-mail contains information which is confidential. It is intended only for the use of the named recipient. If you have received this e-mail in error, please let us know by replying to the sender, and immediately delete it from your system. Please note, that in these circumstances, the use, disclosure, distribution or copying of this information is strictly prohibited. KEMRI-Wellcome Trust Programme cannot accept any responsibility for the accuracy or completeness of this message as it has been transmitted over a public network. Although the Programme has taken reasonable precautions to ensure no viruses are present in emails, it cannot accept responsibility for any loss or damage arising from the use of the email or attachments. Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of KEMRI-Wellcome Trust Programme. __ [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset
Hi Bert, The “status” at the end of the study does exist in the original dataset, what was missing was the time between events. And there exists so many events that fall between the first and last day to be explored in this work. The suggestion I received then, was to compute time between the initial date for each individual and all sub subsequent events, up to the last day of the study. The rationale being, once I have that column of difference in days, I can then use it to make any other calculations that arise. Let me try your suggested script and see how that goes..highly appreciated.. Regards --- Kevin Wame On 7/4/16, 9:32 AM, "Bert Gunter" <bgunter.4...@gmail.com> wrote: A kaplan-meier plot requires for each individual (in each treatment group, if there are more than one): 1. Survival time,which in your case appears to mean time without disease; 2. Status at end of time on study: whether the individual was censored (still without disease) or died (in your case, was diseased) on the last date they are seen in the study. AFAICT, the 2nd piece of information is not present in your data; if this is so, then you cannot do the K-M plot or, indeed, any survival analysis. That is, you can quit the analysis right now. If you have the status, where is it? If, for example, the last date for each individual is the date at which disease is first seen, then you can simply convert the date column to the Date class with ?as.Date (the year and month columns appear to be useless as they repeat info already available in the date columns), and then: survtimes_byID <- with(datasetname, tapply(date, ID, function(x)diff(range(x will give you a list of survival times (in days) by ID. See ?with, ?tapply for details. If the status info is in some other form, then this advice should be ignored of course and you have to incorporate it into your data in some other way. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Sun, Jul 3, 2016 at 2:43 PM, Jeff Newmiller <jdnew...@dcn.davis.ca.us> wrote: > There are a great many hits when I search on the keywords "kaplan meier plot > R"... so my first reaction is that you should be referring to some of the > existing packages for doing this type of analysis. I do not do this type of > analysis normally, so am probably not your best helper... perhaps someone > else will chime in if you show that you have read some existing KM examples. > > My second reaction is that if you want to avoid losing records you should > also avoid adding records. Your example extends from the first matching date > to and including the next matching date, which conflicts with analysis of > successive treatment periods. You may have a good reason for doing this, but > in my experience this is usually a mistake. > > Finally, I think you should more closely study the use of the ave function > that I already used if you want to work with the data in its original form. > It should not be too difficult to generate your diff_days column using ave if > you have the admin_period column that I showed you how to make. > -- > Sent from my phone. Please excuse my brevity. > > On July 3, 2016 1:47:17 PM PDT, Kevin Wamae <kwa...@kemri-wellcome.org> wrote: >>Hi Bert, my first task is to make a Kaplan Meier Plot to evaluate the >>risk of developing disease in the treated vs the non-treated >>individuals. I therefore figured it might be easier to compute dates >>first as any further analysis will be based on time, in this case days. >>I keep getting recommendations on how to tweak my analysis and keeps >>coming down to dates between the start of drug administration and the >>end of it. >> >>Can you suggest an “easier” way to go about this.. >> >>Regards >>--- >>Kevin Wame >> >> >>On 7/3/16, 11:28 PM, "Bert Gunter" <bgunter.4...@gmail.com> wrote: >> >>I haven't followed this thread closely, but if it's not too late, I >>might suggest that you stop worrying about how you want your data >>frame to look and start worrying about you want to display/analyze >>your data. As Jeff suggested, you and your supervisor are probably >>being driven by paradigms from Excel, SPSS, or whatever that are >>simply unnecessary for R. My guess would be that if you explained the >>sort of analyses/plots you wish to do, you will find it can be done >>fairly directly from your existing data. At the very least it would >>give Jeff and
Re: [R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset
Hi Jeff, thanks and I will explore your suggestions too.. Regards --- Kevin Wame On 7/4/16, 12:43 AM, "Jeff Newmiller" <jdnew...@dcn.davis.ca.us> wrote: There are a great many hits when I search on the keywords "kaplan meier plot R"... so my first reaction is that you should be referring to some of the existing packages for doing this type of analysis. I do not do this type of analysis normally, so am probably not your best helper... perhaps someone else will chime in if you show that you have read some existing KM examples. My second reaction is that if you want to avoid losing records you should also avoid adding records. Your example extends from the first matching date to and including the next matching date, which conflicts with analysis of successive treatment periods. You may have a good reason for doing this, but in my experience this is usually a mistake. Finally, I think you should more closely study the use of the ave function that I already used if you want to work with the data in its original form. It should not be too difficult to generate your diff_days column using ave if you have the admin_period column that I showed you how to make. -- Sent from my phone. Please excuse my brevity. On July 3, 2016 1:47:17 PM PDT, Kevin Wamae <kwa...@kemri-wellcome.org> wrote: >Hi Bert, my first task is to make a Kaplan Meier Plot to evaluate the >risk of developing disease in the treated vs the non-treated >individuals. I therefore figured it might be easier to compute dates >first as any further analysis will be based on time, in this case days. >I keep getting recommendations on how to tweak my analysis and keeps >coming down to dates between the start of drug administration and the >end of it. > >Can you suggest an “easier” way to go about this.. > >Regards >--- >Kevin Wame > > >On 7/3/16, 11:28 PM, "Bert Gunter" <bgunter.4...@gmail.com> wrote: > >I haven't followed this thread closely, but if it's not too late, I >might suggest that you stop worrying about how you want your data >frame to look and start worrying about you want to display/analyze >your data. As Jeff suggested, you and your supervisor are probably >being driven by paradigms from Excel, SPSS, or whatever that are >simply unnecessary for R. My guess would be that if you explained the >sort of analyses/plots you wish to do, you will find it can be done >fairly directly from your existing data. At the very least it would >give Jeff and other helpeRs a better idea of what you might need >rather than what you and your supervisor think you need. > > >Cheers, >Bert > > >Bert Gunter > >"The trouble with having an open mind is that people keep coming along >and sticking things into it." >-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > >On Sun, Jul 3, 2016 at 1:08 PM, Kevin Wamae <kwa...@kemri-wellcome.org> >wrote: >> Hi Jeff, It works on well on a dataset with 10 rows and I figure >it will work well with the “real” dataset. You’ve been of great help >and I am starting to make headway. >> >> It creates a new dataframe (result), as shown below that doesn’t >quite have the result as I would want it. >> >> ID admin_periodstart end ddays >> J1/31 5/11/07 8/13/07 94 >> J1/32 8/13/07 11/12/0791 >> J1/33 11/12/072/4/08 84 >> J1/34 2/4/08 5/5/08 91 >> J1/35 5/5/08 5/4/09364 >> J1/36 5/4/09 5/17/10378 >> J1/37 5/17/10 5/16/11 364 >> J10/1 1 5/11/07 8/13/07 94 >> J10/1 2 8/13/07 11/12/0791 >> J10/1 3 11/12/072/4/08 84 >> J10/1 4 2/4/085/5/0891 >> J10/1 5 5/5/085/8/09368 >> J10/1 6 5/8/09 5/17/10374 >> J10/1 7 5/17/10 5/16/11 364 >> J102/1 1 5/15/07 8/15/07 92 >> J102/1 2 8/15/07 11/13/0790 >> J102/1 3 11/13/072/5/08 84 >> J102/1 4 2/5/085/6/0891 >> J102/1 5 5/6/085/5/09364 >> J102/1 6 5/5/095/19/10 379 >> >> My supervisor doesn’t want me to create a new dataset, she’s afraid I >might lose some data…I cannot fight that. >> >> Like you mentioned earlier, I might be mixing up things which I think >is what you alluded t
Re: [R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset
Hi Bert, my first task is to make a Kaplan Meier Plot to evaluate the risk of developing disease in the treated vs the non-treated individuals. I therefore figured it might be easier to compute dates first as any further analysis will be based on time, in this case days. I keep getting recommendations on how to tweak my analysis and keeps coming down to dates between the start of drug administration and the end of it. Can you suggest an “easier” way to go about this.. Regards --- Kevin Wame On 7/3/16, 11:28 PM, "Bert Gunter" <bgunter.4...@gmail.com> wrote: I haven't followed this thread closely, but if it's not too late, I might suggest that you stop worrying about how you want your data frame to look and start worrying about you want to display/analyze your data. As Jeff suggested, you and your supervisor are probably being driven by paradigms from Excel, SPSS, or whatever that are simply unnecessary for R. My guess would be that if you explained the sort of analyses/plots you wish to do, you will find it can be done fairly directly from your existing data. At the very least it would give Jeff and other helpeRs a better idea of what you might need rather than what you and your supervisor think you need. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Sun, Jul 3, 2016 at 1:08 PM, Kevin Wamae <kwa...@kemri-wellcome.org> wrote: > Hi Jeff, It works on well on a dataset with 10 rows and I figure it will > work well with the “real” dataset. You’ve been of great help and I am > starting to make headway. > > It creates a new dataframe (result), as shown below that doesn’t quite have > the result as I would want it. > > ID admin_periodstart end ddays > J1/31 5/11/07 8/13/07 94 > J1/32 8/13/07 11/12/0791 > J1/33 11/12/072/4/08 84 > J1/34 2/4/08 5/5/08 91 > J1/35 5/5/08 5/4/09364 > J1/36 5/4/09 5/17/10378 > J1/37 5/17/10 5/16/11 364 > J10/1 1 5/11/07 8/13/07 94 > J10/1 2 8/13/07 11/12/0791 > J10/1 3 11/12/072/4/08 84 > J10/1 4 2/4/085/5/0891 > J10/1 5 5/5/085/8/09368 > J10/1 6 5/8/09 5/17/10374 > J10/1 7 5/17/10 5/16/11 364 > J102/1 1 5/15/07 8/15/07 92 > J102/1 2 8/15/07 11/13/0790 > J102/1 3 11/13/072/5/08 84 > J102/1 4 2/5/085/6/0891 > J102/1 5 5/6/085/5/09364 > J102/1 6 5/5/095/19/10 379 > > My supervisor doesn’t want me to create a new dataset, she’s afraid I might > lose some data…I cannot fight that. > > Like you mentioned earlier, I might be mixing up things which I think is what > you alluded to earlier. > > After consultation with my supervisor, this is what we’ve agreed. For every > individual, given the start and end date, create a new column (say, > diff_days) and for every row that falls within the range of start and > end_date, get the difference between the date in that row and start date and > add it to the diff_days column. Below is an example of the result. As it can > be seen 5/11/2007 is the start while 2/4/2008 is the end. The diff_days has > been populated excluding the end date and that is because that is the start > of the study in 2008 that will continue into 2009 and thus from 2/4/2008, I > should compute diff_days till 2009 and so no (I hope this makes sense). > > ID datedrug_admin yearmonth diff_days > R1/35/11/2007 Y 20075 0 > R1/35/16/2007 20075 6 > R1/35/22/2007 20075 11 > R1/35/28/2007 20075 17 > R1/31/14/2008 20081 248 > R1/31/21/2008 20081 255 > R1/31/28/2008 20081 263 > R1/32/4/2008Y 20082 > > > Regards > --- > Kevin Wame > > > On 7/3/16, 10:09 PM, "Jeff Newmiller" <jdnew...@dcn.davis.ca.us> wrote: > > Typo on the second line > > result <- ( result0 > %>% select( -admin_period1 ) > %>% inner_join( result0 %>% select( ID, admin_period1, end=start ) >
Re: [R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset
Hi Jeff, It works on well on a dataset with 10 rows and I figure it will work well with the “real” dataset. You’ve been of great help and I am starting to make headway. It creates a new dataframe (result), as shown below that doesn’t quite have the result as I would want it. ID admin_periodstart end ddays J1/31 5/11/07 8/13/07 94 J1/32 8/13/07 11/12/0791 J1/33 11/12/072/4/08 84 J1/34 2/4/08 5/5/08 91 J1/35 5/5/08 5/4/09364 J1/36 5/4/09 5/17/10378 J1/37 5/17/10 5/16/11 364 J10/1 1 5/11/07 8/13/07 94 J10/1 2 8/13/07 11/12/0791 J10/1 3 11/12/072/4/08 84 J10/1 4 2/4/085/5/0891 J10/1 5 5/5/085/8/09368 J10/1 6 5/8/09 5/17/10374 J10/1 7 5/17/10 5/16/11 364 J102/1 1 5/15/07 8/15/07 92 J102/1 2 8/15/07 11/13/0790 J102/1 3 11/13/072/5/08 84 J102/1 4 2/5/085/6/0891 J102/1 5 5/6/085/5/09364 J102/1 6 5/5/095/19/10 379 My supervisor doesn’t want me to create a new dataset, she’s afraid I might lose some data…I cannot fight that. Like you mentioned earlier, I might be mixing up things which I think is what you alluded to earlier. After consultation with my supervisor, this is what we’ve agreed. For every individual, given the start and end date, create a new column (say, diff_days) and for every row that falls within the range of start and end_date, get the difference between the date in that row and start date and add it to the diff_days column. Below is an example of the result. As it can be seen 5/11/2007 is the start while 2/4/2008 is the end. The diff_days has been populated excluding the end date and that is because that is the start of the study in 2008 that will continue into 2009 and thus from 2/4/2008, I should compute diff_days till 2009 and so no (I hope this makes sense). ID datedrug_admin yearmonth diff_days R1/35/11/2007 Y 20075 0 R1/35/16/2007 20075 6 R1/35/22/2007 20075 11 R1/35/28/2007 20075 17 R1/31/14/2008 20081 248 R1/31/21/2008 20081 255 R1/31/28/2008 20081 263 R1/32/4/2008Y 20082 Regards --- Kevin Wame On 7/3/16, 10:09 PM, "Jeff Newmiller" <jdnew...@dcn.davis.ca.us> wrote: Typo on the second line result <- ( result0 %>% select( -admin_period1 ) %>% inner_join( result0 %>% select( ID, admin_period1, end=start ) , by = c( ID="ID", admin_period ="admin_period1" ) ) %>% mutate( ddays = end - start ) ) -- Sent from my phone. Please excuse my brevity. On July 3, 2016 11:55:14 AM PDT, Kevin Wamae <kwa...@kemri-wellcome.org> wrote: >Hi Jeff, “likes its Excel”, I don’t follow. Pardon me for any mix up. > >Thanks for the code. After running it, this is the error I get. > >Error: cannot join on columns 'admin_period' x 'admin_period1': index >out of bounds > >Regards >--- >Kevin Wame | Ph.D. Student (IDeAL) >KEMRI-Wellcome Trust Collaborative Research Programme >Centre for Geographic Medicine Research >P.O. Box 230-80108, Kilifi, Kenya > > >On 7/3/16, 9:34 PM, "Jeff Newmiller" <jdnew...@dcn.davis.ca.us> wrote: > >I still get the impression from your mixing of information types that >you are thinking like this is Excel. > >Perhaps something like > >drug_study$admin_period <- ave( "Y" == drug_study$drug_admin, >drug_study$ID, FUN=cumsum ) >library(dplyr) >result0 <- ( drug_study > %>% filter( 0 != admin_period ) > %>% group_by( ID, admin_period ) > %>% summarise( start = min( date ) ) > %>% mutate( admin_period1 = admin_period -1 ) > ) >result <- ( result0 > %>% select( -admin_period ) > %>% inner_join( result0 %>% select( ID, admin_period1, end=start ) > , by = c( ID="ID", admin_period ="admin_period1" ) >) > %>% mutate( ddays = end - start ) > ) >-- >Sent from my phone. Please excuse my brevity. > >On July 3, 2016 10:24:51 AM PDT, Kevin Wamae ><kwa...@kemr
Re: [R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset
Thanks Jeff, let me try it on the larger dataset. Regards --- Kevin Wame On 7/3/16, 10:09 PM, "Jeff Newmiller"wrote: result <- ( result0 %>% select( -admin_period1 ) %>% inner_join( result0 %>% select( ID, admin_period1, end=start ) , by = c( ID="ID", admin_period ="admin_period1" ) ) %>% mutate( ddays = end - start ) ) __ This e-mail contains information which is confidential. It is intended only for the use of the named recipient. If you have received this e-mail in error, please let us know by replying to the sender, and immediately delete it from your system. Please note, that in these circumstances, the use, disclosure, distribution or copying of this information is strictly prohibited. KEMRI-Wellcome Trust Programme cannot accept any responsibility for the accuracy or completeness of this message as it has been transmitted over a public network. Although the Programme has taken reasonable precautions to ensure no viruses are present in emails, it cannot accept responsibility for any loss or damage arising from the use of the email or attachments. Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of KEMRI-Wellcome Trust Programme. __ __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset
HI Jeff, it’s been an uphill task working with the dataset and I am not the first to complain. Nonetheless, data-cleaning is ongoing and since I cannot wait for that to get done, I decided to make the most of what the dataset looks like at this time. It appears the process may take a while. Thanks for the script. From the output, I noticed that “result” contains the first and last date for each of the individuals and not taking into account the variable “drug-admin”. ID start end J1/31/5/09 12/25/10 R1/31/4/07 12/15/08 R10/1 1/4/07 3/5/12 My aim is to pick the date, for example in 2007, where drug-admin == “Y” as my start and the date in the subsequent year (2008 in this case) where drug-admin == “Y” as my end. Then, I should populate the variable “study_id” with “start” up to the entry just above the one whose date matches “end”, as the output below shows (I hope its structure is maintained as I have copied it from R-Studio). The goal for now is to then get difference in days between “date” and “study_id” and still get to keep that column for “study_id” as I might use it later. From the output, it can be seen that for this individual, the dates run from 2007 to 2008. However, for some individuals, the dates run from 2008-2009, 2009-2010 and so on. Therefore, I need to make the script deal with all the years as the dates range from 2001-2016 ID datedrug_admin yearmonth study_id R1/35/11/07 Y 20075 5/11/07 R1/35/16/07 20075 5/11/07 R1/35/22/07 20075 5/11/07 R1/35/28/07 20075 5/11/07 R1/36/5/07 20076 5/11/07 R1/36/11/07 20076 5/11/07 R1/36/18/07 20076 5/11/07 R1/36/25/07 20076 5/11/07 R1/37/2/07 20077 5/11/07 R1/37/16/07 20077 5/11/07 R1/37/29/07 20077 5/11/07 R1/38/2/07 20078 5/11/07 R1/38/7/07 20078 5/11/07 R1/38/13/07 20078 5/11/07 R1/39/18/07 20079 5/11/07 R1/39/24/07 20079 5/11/07 R1/310/6/07 200710 5/11/07 R1/310/8/07 200710 5/11/07 R1/310/15/07200710 5/11/07 R1/310/22/07200710 5/11/07 R1/310/29/07200710 5/11/07 R1/311/8/07 200711 5/11/07 R1/311/12/07200711 5/11/07 R1/311/19/07200711 5/11/07 R1/311/29/07200711 5/11/07 R1/312/6/07 200712 5/11/07 R1/312/10/07200712 5/11/07 R1/312/21/07200712 5/11/07 R1/31/7/08 20081 5/11/07 R1/31/14/08 20081 5/11/07 R1/31/21/08 20081 5/11/07 R1/31/28/08 20081 5/11/07 R1/32/4/08 Y 20082 Regards --- Kevin Wame ### ### On 7/3/16, 7:05 PM, "Jeff Newmiller"wrote: result <- setNames( data.frame( aggregate( date~ID, data=drug_study, FUN=min ), aggregate( date~ID, data=drug_study, FUN=max )[2] ), c( "ID", "start", "end" ) ) __ This e-mail contains information which is confidential. It is intended only for the use of the named recipient. If you have received this e-mail in error, please let us know by replying to the sender, and immediately delete it from your system. Please note, that in these circumstances, the use, disclosure, distribution or copying of this information is strictly prohibited. KEMRI-Wellcome Trust Programme cannot accept any responsibility for the accuracy or completeness of this message as it has been transmitted over a public network. Although the Programme has taken reasonable precautions to ensure no viruses are present in emails, it cannot accept responsibility for any loss or damage arising from the use of the email or attachments. Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of KEMRI-Wellcome Trust Programme. __ __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented,
Re: [R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset
Hi Jeff, pardon me, I was surely not making it easy. I hope this time I will ☺ Attached is snippet of the dataset in csv format and below is the R.script I have managed so far. --- --- drug_study <- read.csv("drug_study.csv", header = T); head(drug_study) drug_study$date <- as.Date(drug_study$date, "%m/%d/%Y") drug_study$study_id <- "" #create new column individual <- unique (drug_study$ID) #vector of individuals datalength <- dim(drug_study)[1] #number of rows in dataframe for (i in 1:length(individual)) { for (j in 1:datalength) { start_admin <- drug_study[c(drug_study$ID == individual[i] & drug_study$year == 2007 & drug_study$drug_admin == "Y" & drug_study$month == 5),2] #capture date of start end_admin <- drug_study[(drug_study$ID == individual[i] & drug_study$year == 2008 & drug_study$drug_admin == "Y" & drug_study$month == 2),2]#capture date of end if(drug_study[j,1] == individual[i] & drug_study[j,2] >= start_admin & drug_study[j,2] < end_admin) { drug_study[j,6] <- paste(start_admin) #populate respective row if condition is met } } } ~ ~ For this dataset, there exists three individuals, J1/3, R1/3, R10/1. The script works for the last two individuals but not J1/3 with the error below: ~ ~ Error in if (drug_study[j, 1] == individual[i] & drug_study[j, 2] >= start_admin & : argument is of length zero ~ ~ I figured it’s because this individuals start_admin and end_admin dates aren’t captured because the if-loop fails. There’s my first problem, there are thousands of individuals with varying start_admin and end_admin dates and I need a script to capture these for every individual. Secondly, the above script is taking almost an hour to run for the entire dataset, just for the individuals whose start_admin and end_admin dates can be captured by the if-loop. I need help in coming up with a script that will tackle the problem taking into account the different start_admin and end_admin dates and be resourceful with regards to time. Regards --- Kevin Kariuki ### ### On 7/3/16, 8:42 AM, "Jeff Newmiller" <jdnew...@dcn.davis.ca.us> wrote: You are making this hard on yourself by not paying attention the Posting Guide listed in the footer of every email on this list. You would probably also find [1] helpful also. [1] http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example -- Sent from my phone. Please excuse my brevity. On July 2, 2016 3:41:07 PM PDT, Kevin Wamae <kwa...@kemri-wellcome.org> wrote: >Hi Jeff, sorry for referring to you as Jennifer earlier, accept my >apologies. > >I attached a sample dataset in the question, am afraid it must have >failed to attach. > >I have attached it again.. > > >Regards >--- >Kevin Kariuki > > >On 7/2/16, 7:37 PM, "Jeff Newmiller" <jdnew...@dcn.davis.ca.us> wrote: > >I can understand you not wanting to supply your actual data online, but >only you know what your data looks like so only you can create a >simulated data set that we could show you how to work with. >-- >Sent from my phone. Please excuse my brevity. > >On July 2, 2016 2:57:39 AM PDT, Kevin Wamae <kwa...@kemri-wellcome.org> >wrote: >>I have a drug-trial study dataset (attached image). >> >>Since its a large and complex dataset (at least to me) and I hope to >be >>as clear as possible with my question. >>The dataset is from a study where individuals are given drugs and >>followed up over a period spanning two
Re: [R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset
Hi Jeff, sorry for referring to you as Jennifer earlier, accept my apologies. I attached a sample dataset in the question, am afraid it must have failed to attach. I have attached it again.. Regards --- Kevin Kariuki On 7/2/16, 7:37 PM, "Jeff Newmiller" <jdnew...@dcn.davis.ca.us> wrote: I can understand you not wanting to supply your actual data online, but only you know what your data looks like so only you can create a simulated data set that we could show you how to work with. -- Sent from my phone. Please excuse my brevity. On July 2, 2016 2:57:39 AM PDT, Kevin Wamae <kwa...@kemri-wellcome.org> wrote: >I have a drug-trial study dataset (attached image). > >Since its a large and complex dataset (at least to me) and I hope to be >as clear as possible with my question. >The dataset is from a study where individuals are given drugs and >followed up over a period spanning two consecutive years. Individuals >do not start treatment on the same day and once they start, the >variable "drug-admin" is marked "x" as well as the time they stop >treatment in the following year. >There exists another variable, "study_id", that I hope to populate as >can be seen in the dataset, with the following conditions: > >For every individual >•if the individual has entries that show they received drugs both >on the start and end date (marked with the "x") >•if the start of drug administration falls in month == 2 | 3 and >end of administration falls in month == 2 | 4 >•then, using the date that marks the start of drug administration, >populate the variable _"study_id"_ in all the rows that fall within the >timeframe that the individual was given drugs but excluding the end of >drug administration. >I have tried my level best and while I have explored several examples >online, I haven't managed to solve this. The dataset contains close to >6000 individuals spanning 10 years and my best bet was to use a loop >which keeps crushing R after running for close to 30min. I have also >read that dplyr may do the job but my attempts have been in vain. > >sample code >--- >individual <- unique (df$ID) #vector of individuals >datalength <- dim(df)[1] #number of rows in dataframe > >for (i in 1:length(individual)) { > for (j in 1:datalength) { >start_admin <- df[(df$year == 2007] & df$drug_admin == "x" & c(df$month >== 2 | df$month == 3),1] #capture date of start >end_admin <- df[(df$year == 2008] & df$drug_admin == "x" & c(df$month >== 2 | df$month == 4),1]#capture date of end > >if(df[datalength,1] == individual(i) & df[datalength,2] >= start_admin >& df[datalength,2] < end_admin) { >df[datalength,6] <- start_admin #populate respective row if condition >is met > } >} > } > >--- > >Above is the code that keeps failing.. > >Any help is highly appreciated > > >__ > >This e-mail contains information which is confidential. It is intended >only for the use of the named recipient. If you have received this >e-mail in error, please let us know by replying to the sender, and >immediately delete it from your system. Please note, that in these >circumstances, the use, disclosure, distribution or copying of this >information is strictly prohibited. KEMRI-Wellcome Trust Programme >cannot accept any responsibility for the accuracy or completeness of >this message as it has been transmitted over a public network. Although >the Programme has taken reasonable precautions to ensure no viruses are >present in emails, it cannot accept responsibility for any loss or >damage arising from the use of the email or attachments. Any views >expressed in this message are those of the individual sender, except >where the sender specifically states them to be the views of >KEMRI-Wellcome Trust Programme. >__ > > > > >__ >R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html
[R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset
I have a drug-trial study dataset (attached image). Since its a large and complex dataset (at least to me) and I hope to be as clear as possible with my question. The dataset is from a study where individuals are given drugs and followed up over a period spanning two consecutive years. Individuals do not start treatment on the same day and once they start, the variable "drug-admin" is marked "x" as well as the time they stop treatment in the following year. There exists another variable, "study_id", that I hope to populate as can be seen in the dataset, with the following conditions: For every individual •if the individual has entries that show they received drugs both on the start and end date (marked with the "x") •if the start of drug administration falls in month == 2 | 3 and end of administration falls in month == 2 | 4 •then, using the date that marks the start of drug administration, populate the variable _"study_id"_ in all the rows that fall within the timeframe that the individual was given drugs but excluding the end of drug administration. I have tried my level best and while I have explored several examples online, I haven't managed to solve this. The dataset contains close to 6000 individuals spanning 10 years and my best bet was to use a loop which keeps crushing R after running for close to 30min. I have also read that dplyr may do the job but my attempts have been in vain. sample code --- individual <- unique (df$ID) #vector of individuals datalength <- dim(df)[1] #number of rows in dataframe for (i in 1:length(individual)) { for (j in 1:datalength) { start_admin <- df[(df$year == 2007] & df$drug_admin == "x" & c(df$month == 2 | df$month == 3),1] #capture date of start end_admin <- df[(df$year == 2008] & df$drug_admin == "x" & c(df$month == 2 | df$month == 4),1]#capture date of end if(df[datalength,1] == individual(i) & df[datalength,2] >= start_admin & df[datalength,2] < end_admin) { df[datalength,6] <- start_admin #populate respective row if condition is met } } } --- Above is the code that keeps failing.. Any help is highly appreciated __ This e-mail contains information which is confidential. It is intended only for the use of the named recipient. If you have received this e-mail in error, please let us know by replying to the sender, and immediately delete it from your system. Please note, that in these circumstances, the use, disclosure, distribution or copying of this information is strictly prohibited. KEMRI-Wellcome Trust Programme cannot accept any responsibility for the accuracy or completeness of this message as it has been transmitted over a public network. Although the Programme has taken reasonable precautions to ensure no viruses are present in emails, it cannot accept responsibility for any loss or damage arising from the use of the email or attachments. Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of KEMRI-Wellcome Trust Programme. __ __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.