Re: [R] create groups from data with duplicates, such that each group has a duplicate represented once

2019-01-17 Thread Kevin Wamae
Dear Petr, thank you for the guidance.

A colleague managed to solve it

I'll definitely use "dput" for future postings.

Regards
------
Kevin Wamae

On 17/01/2019, 03:57, "PIKAL Petr"  wrote:

Hi

Instead of attachment which is usually removed you should use dput

Something like output from
dput(head(yourdata,30))

To remove duplicate values see

unique or duplicated

Cheers
Petr

> -Original Message-
> From: R-help  On Behalf Of Kevin Wamae
> Sent: Thursday, January 17, 2019 1:29 AM
> To: r-help@r-project.org
> Subject: [R] create groups from data with duplicates, such that each 
group has
> a duplicate represented once
>
> Hi, I have a sequencing run with ~3000 samples (attached dataset). The
> samples were initially tagged and amplified by PCR in duplicate. The tags 
used
> range from MID01 to MID26.
>
> MID01-MID13 were used for pair 1 while MID14-MID26 were used for pair 2.
> The tags are re-used to allow samples to be pooled.
>
> The pooling process will involve mixing samples with MID01-26 into the 
first
> group, the next group samples with MID01-26 into the second group and so 
on.
>
> I'm hoping to get an R script that can create these groups such that for 
each
> group, any of the Tags appears only once. An example is shown below.
>
> ID
>
> TagA
>
> TagB
>
> group
>
> 180
>
> MID03
>
> MID10
>
> group1
>
> 181
>
> MID04
>
> MID06
>
> group1
>
> 182
>
> MID05
>
> MID07
>
> group1
>
> 183
>
> MID03
>
> MID09
>
> group2
>
> 184
>
> MID04
>
> MID10
>
> group2
>
> 185
>
> MID05
>
> MID06
>
> group2
>
> 186
>
> MID01
>
> MID06
>
> group3
>
> 187
>
> MID02
>
> MID07
>
> group3
>
> 188
>
> MID03
>
> MID08
>
> group3
>
>
>
> ___
> ___
>
> This e-mail contains information which is confidential. It is intended 
only for
> the use of the named recipient. If you have received this e-mail in 
error, please
> let us know by replying to the sender, and immediately delete it from your
> system.  Please note, that in these circumstances, the use, disclosure,
> distribution or copying of this information is strictly prohibited. KEMRI-
> Wellcome Trust Programme cannot accept any responsibility for the  
accuracy
> or completeness of this message as it has been transmitted over a public
> network. Although the Programme has taken reasonable precautions to ensure
> no viruses are present in emails, it cannot accept responsibility for any 
loss or
> damage arising from the use of the email or attachments. Any views 
expressed
> in this message are those of the individual sender, except where the 
sender
> specifically states them to be the views of KEMRI-Wellcome Trust 
Programme.
> ___
> ___
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
Osobní údaje: Informace o zpracování a ochraně osobních údajů obchodních 
partnerů PRECHEZA a.s. jsou zveřejněny na: 
https://www.precheza.cz/zasady-ochrany-osobnich-udaju/ | Information about 
processing and protection of business partner’s personal data are available on 
website: https://www.precheza.cz/en/personal-data-protection-principles/
Důvěrnost: Tento e-mail a jakékoliv k němu připojené dokumenty jsou důvěrné 
a podléhají tomuto právně závaznému prohláąení o vyloučení odpovědnosti: 
https://www.precheza.cz/01-dovetek/ | This email and any documents attached to 
it may be confidential and are subject to the legally binding disclaimer: 
https://www.precheza.cz/en/01-disclaimer/




__

This e-mail contains information which is confidential. It is intended only for 
the use of the named recipient. If you have received this e-mail in error, 
please 

[R] create groups from data with duplicates, such that each group has a duplicate represented once

2019-01-16 Thread Kevin Wamae
Hi, I have a sequencing run with ~3000 samples (attached dataset). The samples 
were initially tagged and amplified by PCR in duplicate. The tags used range 
from MID01 to MID26.

MID01-MID13 were used for pair 1 while MID14-MID26 were used for pair 2. The 
tags are re-used to allow samples to be pooled.

The pooling process will involve mixing samples with MID01-26 into the first 
group, the next group samples with MID01-26 into the second group and so on.

I'm hoping to get an R script that can create these groups such that for each 
group, any of the Tags appears only once. An example is shown below.

ID

TagA

TagB

group

180

MID03

MID10

group1

181

MID04

MID06

group1

182

MID05

MID07

group1

183

MID03

MID09

group2

184

MID04

MID10

group2

185

MID05

MID06

group2

186

MID01

MID06

group3

187

MID02

MID07

group3

188

MID03

MID08

group3



__

This e-mail contains information which is confidential. It is intended only for 
the use of the named recipient. If you have received this e-mail in error, 
please let us know by replying to the sender, and immediately delete it from 
your system.  Please note, that in these circumstances, the use, disclosure, 
distribution or copying of this information is strictly prohibited. 
KEMRI-Wellcome Trust Programme cannot accept any responsibility for the  
accuracy or completeness of this message as it has been transmitted over a 
public network. Although the Programme has taken reasonable precautions to 
ensure no viruses are present in emails, it cannot accept responsibility for 
any loss or damage arising from the use of the email or attachments. Any views 
expressed in this message are those of the individual sender, except where the 
sender specifically states them to be the views of KEMRI-Wellcome Trust 
Programme.
__
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] alternative for multiple if_else statements

2018-02-26 Thread Kevin Wamae
Dear Ellison, thank you for the feedback, we replaced dplyr::if_else with 
dplyr::case_when and it seems to do the trick.

Still, we have to write several statements to match all the respective years 
but it's working.

Let me see how we can implement your suggestion.

Regards
--
Kevin Wamae
On 26/02/2018, 14:57, "S Ellison" <s.elli...@lgcgroup.com> wrote:

That many ifelse statements is obviously rather a pain.

Would you not have got what you want with 

... paste("survey", year, sep="_") 
?

If that is not what you're looking for (eg because 'year' is the 
observation year and not the study start year), perhaps something that picks 
the minimum year for a subject or other relevant group might work? For example
paste("survey", ave(year, studyno, FUN=min), sep="_")


S Ellison

> -Original Message-
> From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Kevin
> Wamae
> Sent: 21 February 2018 20:34
> To: R-help@r-project.org
> Subject: [R] alternative for multiple if_else statements
> 
> Hi, I am having trouble trying to figure out why if_else is behaving the 
way it is,
> it may be my code or the way the data is structured.
> 
> Below is a snapshot of a database am working on and it represents a
> longitudinal survey of study participants in a trial with weekly follow 
up.
> 
> The variable "survey_start" represents the start of the study-defined one 
year
> follow up (which we called "survey_year").
> 
> I am trying to populate all subsequent entries for each participant, per 
survey
> year, with the entry "survey" followed by an underscore and the respective
> year, eg. survey_2014.
> 
> There are missing entries such as the participant represented here, wasn't
> available at the start of the 2015 survey. Also, some participants don’t 
have
> complete one-year follow ups but I still need to include them.
> 
> I have written two codes, first one fails while the second works, the only
> difference being I have reversed the order in which the entries are 
populated in
> the second code (from 2007-2016 to 2016-2007) and removed the if_else
> statement for 2015. Also noticed, that for the second code, which spans 
the
> years 2007-2016 (less 2015), if a participants entries start from 
2010-2016, the
> code fails.
> 
> Kindly assist in figuring this out...or better yet, an alternative.
> 
> trialData <- structure(list(study = c("site_1", "site_1", "site_1", 
"site_1",
> "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", 
"site_1",
> "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", 
"site_1",
> "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", 
"site_1",
> "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", 
"site_1",
> "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", 
"site_1",
> "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", 
"site_1",
> "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", 
"site_1",
> "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", 
"site_1",
> "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", 
"site_1",
> "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", 
"site_1",
> "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", 
"site_1",
> "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", 
"site_1",
> "site_1", "site_1", "site_1", "site_1", &

Re: [R] alternative for multiple if_else statements

2018-02-22 Thread Kevin Wamae
Dear Eric, thank you for that observation.

I realised that some of the participants have duplicated “survey_start” dates 
and when I corrected this, the code works.

Regards
--
Kevin Wamae
From: Eric Berger <ericjber...@gmail.com>
Date: Thursday, 22 February 2018 at 15:16
To: Kevin Wamae <kwa...@kemri-wellcome.org>
Cc: "R-help@r-project.org" <R-help@r-project.org>
Subject: Re: [R] alternative for multiple if_else statements

Hi Kevin,
I ran the code on the full data set and was able to reproduce the problem that 
you are facing.
My guess is that you have an error in your intuition and/or logic, and that 
this relates to the use of the subscript [1].
Specifically, on the full dataset, the condition
trialData$date[trialData$survey_start == "Y" & trialData$year == 2013 & 
trialData$site == "site_1"]

yields 412 matches, of which there are 9 unique ones, specifically

April 2,3,4,5,8,10,11,16,17

In the full data set the first element that appears, i.e. subscript[1], is 
"2013-04-04".

In the filtered data set the first element that appears is "2013-04-05".

I hope that is enough information for you to make further progress from here.

Best,
Eric



On Thu, Feb 22, 2018 at 1:28 PM, Kevin Wamae 
<kwa...@kemri-wellcome.org<mailto:kwa...@kemri-wellcome.org>> wrote:
Dear Eric, wow, this seems to do the trick. But I have encountered a problem.

I have tested it on the larger dataset and it seems to work on a filtered 
dataset but not on the whole dataset (attached). See below script..

#load packages
Library(dplyr)

#load data
trialData <- fread("trialData.txt") %>% mutate(date = as.Date(date,"%d/%m/%Y"))

#create blank variable
trialData$survey_year <- rep(NA_character_, nrow(trialData))

#attempt 1 fails: code for survey
trialData$survey_year[trialData$date >= trialData$date[trialData$survey_start 
== "Y" & trialData$year == 2013 & trialData$site == "site_1"][1] & 
trialData$date < trialData$date[trialData$month == 4 & trialData$year == 2014 & 
trialData$site == "site_1"][1]] <- "survey_2013"

#filter trialData
trialData <- trialData %>% filter(id == "id_786/3")

#attempt 2 works: code for survey
trialData$survey_year[trialData$date >= trialData$date[trialData$survey_start 
== "Y" & trialData$year == 2013 & trialData$site == "site_1"][1] & 
trialData$date < trialData$date[trialData$month == 4 & trialData$year == 2014 & 
trialData$site == "site_1"][1]] <- "survey_2013"



From: Eric Berger <ericjber...@gmail.com<mailto:ericjber...@gmail.com>>
Date: Thursday, 22 February 2018 at 13:05
To: Kevin Wamae <kwa...@kemri-wellcome.org<mailto:kwa...@kemri-wellcome.org>>
Cc: "R-help@r-project.org<mailto:R-help@r-project.org>" 
<R-help@r-project.org<mailto:R-help@r-project.org>>
Subject: Re: [R] alternative for multiple if_else statements

Hi,
1. I think the reason that the different ordering leads to different results is 
because of the following:
date[ some condition is true ][1]
will give you an NA if there are no rows where 'some condition holds'.
In the code that 'works' you don't have such a situation, but in the code 
that 'does not work' you presumably hit an NA before you get to the result that 
you really want.
2. I am not a big fan of your "nested if" layout. I think you could rewrite it 
more clearly - and without nesting - with something like

 > trialData$survey_year <- rep(NA_character_, nrow(trialData))
 > trialData$survey_year[ condition for survey_2007 ] <- "survey_2007"
 > trialData$survey_year[ condition for survey_2008 ] <- "survey_2008"
 > etc

HTH,
Eric

On Wed, Feb 21, 2018 at 10:33 PM, Kevin Wamae 
<kwa...@kemri-wellcome.org<mailto:kwa...@kemri-wellcome.org>> wrote:
Hi, I am having trouble trying to figure out why if_else is behaving the way it 
is, it may be my code or the way the data is structured.

Below is a snapshot of a database am working on and it represents a 
longitudinal survey of study participants in a trial with weekly follow up.

The variable "survey_start" represents the start of the study-defined one year 
follow up (which we called "survey_year").

I am trying to populate all subsequent entries for each participant, per survey 
year, with the entry "survey" followed by an underscore and the respective 
year, eg. survey_2014.

There are missing entries such as the participant represented here, wasn't 
available at the start of the 2015 survey. Also, some participants don’t have 
complete one-year follow ups but I still need to include them.

I have written two codes, first one fails while the second works, the only 
differenc

Re: [R] alternative for multiple if_else statements

2018-02-22 Thread Kevin Wamae
Dear Ista, thank you. Let me see how best I can implement this.

Regards
--
Kevin Wamae

On 22/02/2018, 16:58, "Ista Zahn" <istaz...@gmail.com> wrote:

I don't fully understand the logic you are trying to implement, but
something along the lines of

foo <- cut(trialData$date,
   breaks = as.Date(c("2007-01-01",
  "2008-05-01",
  "2009-04-01",
  "2010-05-01",
  "2011-05-01",
  "2012-04-01",
  "2013-04-01",
  "2014-04-01",
  "2015-04-01",
  "2016-03-01",
      "2017-01-01")))

might work.

Best,
Ista

On Wed, Feb 21, 2018 at 3:33 PM, Kevin Wamae <kwa...@kemri-wellcome.org> 
wrote:
> Hi, I am having trouble trying to figure out why if_else is behaving the 
way it is, it may be my code or the way the data is structured.
>
> Below is a snapshot of a database am working on and it represents a 
longitudinal survey of study participants in a trial with weekly follow up.
>
> The variable "survey_start" represents the start of the study-defined one 
year follow up (which we called "survey_year").
>
> I am trying to populate all subsequent entries for each participant, per 
survey year, with the entry "survey" followed by an underscore and the 
respective year, eg. survey_2014.
>
> There are missing entries such as the participant represented here, 
wasn't available at the start of the 2015 survey. Also, some participants don’t 
have complete one-year follow ups but I still need to include them.
>
> I have written two codes, first one fails while the second works, the 
only difference being I have reversed the order in which the entries are 
populated in the second code (from 2007-2016 to 2016-2007) and removed the 
if_else statement for 2015. Also noticed, that for the second code, which spans 
the years 2007-2016 (less 2015), if a participants entries start from 
2010-2016, the code fails.
>
> Kindly assist in figuring this out...or better yet, an alternative.
>
> trialData <- structure(list(study = c("site_1", "site_1", "site_1", 
"site_1",
> "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1",
> "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1",
> "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1",
> "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1",
> "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1",
> "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1",
> "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1",
> "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1",
> "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1",
> "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1",
> "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1",
> "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1",
> "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1",
> "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1",
> "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1",
> "site_1", "site_1", "site_1", "site_1"

[R] alternative for multiple if_else statements

2018-02-21 Thread Kevin Wamae
Hi, I am having trouble trying to figure out why if_else is behaving the way it 
is, it may be my code or the way the data is structured.

Below is a snapshot of a database am working on and it represents a 
longitudinal survey of study participants in a trial with weekly follow up.

The variable "survey_start" represents the start of the study-defined one year 
follow up (which we called "survey_year").

I am trying to populate all subsequent entries for each participant, per survey 
year, with the entry "survey" followed by an underscore and the respective 
year, eg. survey_2014.

There are missing entries such as the participant represented here, wasn't 
available at the start of the 2015 survey. Also, some participants don’t have 
complete one-year follow ups but I still need to include them.

I have written two codes, first one fails while the second works, the only 
difference being I have reversed the order in which the entries are populated 
in the second code (from 2007-2016 to 2016-2007) and removed the if_else 
statement for 2015. Also noticed, that for the second code, which spans the 
years 2007-2016 (less 2015), if a participants entries start from 2010-2016, 
the code fails.

Kindly assist in figuring this out...or better yet, an alternative.

trialData <- structure(list(study = c("site_1", "site_1", "site_1", 
"site_1",
"site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1",
"site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1",
"site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1",
"site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1",
"site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1",
"site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1",
"site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1",
"site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1",
"site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1",
"site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1",
"site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1",
"site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1",
"site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1",
"site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1",
"site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1",
"site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1",
"site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1",
"site_1", "site_1"), studyno = c("child_1", "child_1", "child_1",
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1",
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1",
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1",
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1",
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1",
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1",
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1",
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1",
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1",
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1",
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1",
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1",
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1",
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1",
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1",
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1",
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1",
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1",
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1",
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1",
"child_1", "child_1"), date = structure(c(16078, 16085, 16092,
16098, 16104, 16115, 16121, 16129, 16135, 16140, 16146, 16156,
16162, 16168, 16177, 16185, 16191, 16195, 16203, 16210, 16217,
16225, 16234, 16237, 16246, 16253, 16262, 16269, 16278, 16283,
16288, 16297, 16304, 16311, 16319, 16326, 16332, 16337, 16346,
16353, 16360, 16366, 16370, 16381, 16384, 16395, 16399, 16407,
16415, 16422, 16444, 16452, 16454, 16467, 16474, 16477, 16484,
16490, 16501, 16508, 16514, 16520, 16529, 16533, 16539, 16550,
16556, 16564, 16566, 16578, 16582, 16593, 16599, 16604, 16613,
16620, 16623, 16635, 16636, 16654, 16660, 1, 16673, 16681,
16688, 16693, 16702, 16706, 16714, 16721, 16728, 16734, 16745,
16749, 16757, 16764, 16769, 16778, 16785, 16792, 16805, 16812,
16819, 16830, 16832, 16839, 16846, 16856, 16862, 16867, 16877,
16884, 16890, 16898, 16904, 16912, 16917, 16923, 16936, 16938,
16953, 16960, 16966, 16973, 16980), class = "Date"), year = c(2014L,
2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L,
2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 

Re: [R] Populate one data frame with values from another dataframe for rows that match

2017-10-15 Thread Kevin Wamae
Pardon me, here’s @Eric<mailto:ericjber...@gmail.com>’s solution…

myDF1$studyno <- as.character(myDF1$studyno)
myDF2$studyno <- as.character(myDF2$studyno)
myDF3 <- merge(myDF1, myDF2, by="studyno", all.x=TRUE ) %>%
dplyr::mutate( pf_mcl = ifelse( is.na<http://is.na/>(pf_mcl.y), 
pf_mcl.x, pf_mcl.y ) ) %>%
dplyr::select( studyno, date, pf_mcl )

Regards
------
Kevin Wamae

From: Kevin Wamae <kwa...@kemri-wellcome.org>
Date: Sunday, 15 October 2017 at 14:03
To: William Dunlap <wdun...@tibco.com>
Cc: Bert Gunter <bgunter.4...@gmail.com>, Rui Barradas <ruipbarra...@sapo.pt>, 
Eric Berger <ericjber...@gmail.com>, R-help <R-help@r-project.org>
Subject: Re: [R] Populate one data frame with values from another dataframe for 
rows that match

Dear @William<mailto:wdun...@tibco.com>, thanks for the feedback. I have tested 
it on the larger dataset and noticed that it created two variables, pf_raw and 
pf_curated.

The output we were looking for, was one that takes the variable pf_mcl in 
curated dataset and replaces pf_mcl in matching rows within the raw dataset.

@Eric<mailto:ericjber...@gmail.com>’s solution was able to achieve that. 
Nonetheless, we do appreciate your solution.

Regards
--
Kevin Wamae

From: William Dunlap <wdun...@tibco.com>
Date: Saturday, 14 October 2017 at 20:21
To: Kevin Wamae <kwa...@kemri-wellcome.org>
Cc: Bert Gunter <bgunter.4...@gmail.com>, Rui Barradas <ruipbarra...@sapo.pt>, 
R-help <R-help@r-project.org>
Subject: Re: [R] Populate one data frame with values from another dataframe for 
rows that match

Your example used one distinct studyno in DF1 and one distinct pf_mcl in DF2.  
I think that makes it hard to see what is going on, but maybe I completely 
misunderstand the problem.  In any case, let's redefine myDF1 and myDF2.  Note 
that myDF1 contains a studyno not in myDF2 and vice versa.

myDF1 <- structure(list(studyno = c("J1000/9", "J895/7", "J931/6", "J666/6",
"J1000/9", "J1000/9"), date = structure(c(17123, 17127, 17135,
17144, 17148, 17155), class = "Date"), pf_mcl = c(NA_integer_,
2L, 3L, 4L, 5L, NA_integer_
), year = c(2016, 2016, 2016, 2016, 2016, 2016)), .Names = c("studyno",
"date", "pf_mcl", "year"), row.names = c(NA, 6L), class = "data.frame")

myDF2 <- structure(list(studyno = c("J740/4", "J1000/9", "J895/7", "J931/6",
"J609/1", "J941/3"), pf_mcl = c(101L, 102L, 103L, 104L, 105L, 106L)),
.Names = c("studyno", "pf_mcl"), row.names = c(NA, 6L), class = "data.frame")

m <- merge(myDF1, myDF2, by="studyno", all.x=TRUE, all.y=FALSE, 
suffixes=c(".raw", ".curated"))


The results are:

> myDF1
  studyno   date pf_mcl year
1 J1000/9 2016-11-18 NA 2016
2  J895/7 2016-11-22  2 2016
3  J931/6 2016-11-30  3 2016
4  J666/6 2016-12-09  4 2016
5 J1000/9 2016-12-13  5 2016
6 J1000/9 2016-12-20 NA 2016
> myDF2
  studyno pf_mcl
1  J740/4101
2 J1000/9102
3  J895/7103
4  J931/6104
5  J609/1105
6  J941/3106
> m
  studyno   date pf_mcl.raw year pf_mcl.curated
1 J1000/9 2016-11-18 NA 2016102
2 J1000/9 2016-12-13  5 2016102
3 J1000/9 2016-12-20 NA 2016102
4  J666/6 2016-12-09  4 2016 NA
5  J895/7 2016-11-22  2 2016103
6  J931/6 2016-11-30  3 2016104


Now your problem is to combine the columns pf_mcl.raw and pf_mcl.curated in the 
way you want.  ifelse() may be useful for that.


Bill Dunlap
TIBCO Software
wdunlap tibco.com<http://tibco.com>

On Fri, Oct 13, 2017 at 10:48 PM, Kevin Wamae 
<kwa...@kemri-wellcome.org<mailto:kwa...@kemri-wellcome.org>> wrote:
Dear @Bert 
Gunter<mailto:bgunter.4...@gmail.com<mailto:bgunter.4...@gmail.com>>, I tried 
merge and I faced many challenges. @Rui 
Barradas<mailto:ruipbarra...@sapo.pt<mailto:ruipbarra...@sapo.pt>> solution is 
working.

From: Bert Gunter <bgunter.4...@gmail.com<mailto:bgunter.4...@gmail.com>>
Date: Friday, 13 October 2017 at 22:44
To: Kevin Wamae <kwa...@kemri-wellcome.org<mailto:kwa...@kemri-wellcome.org>>
Cc: R-help <R-help@r-project.org<mailto:R-help@r-project.org>>
Subject: Re: [R] Populate one data frame with values from another dataframe for 
rows that match

?merge

Bert

On Oct 13, 2017 12:09 PM, "Kevin Wamae" 
<kwa...@kemri-wellcome.org<mailto:kwa...@kemri-wellcome.org><mailto:kwa...@kemri-wellcome.org<mailto:kwa...@kemri-wellcome.org>>>
 wrote:
I'm trying to populate the column “pf_mcl” in myDF1 with values from myDF

Re: [R] Populate one data frame with values from another dataframe for rows that match

2017-10-15 Thread Kevin Wamae
Dear @William<mailto:wdun...@tibco.com>, thanks for the feedback. I have tested 
it on the larger dataset and noticed that it created two variables, pf_raw and 
pf_curated.

The output we were looking for, was one that takes the variable pf_mcl in 
curated dataset and replaces pf_mcl in matching rows within the raw dataset.

@Eric<mailto:ericjber...@gmail.com>’s solution was able to achieve that. 
Nonetheless, we do appreciate your solution.

Regards
------
Kevin Wamae

From: William Dunlap <wdun...@tibco.com>
Date: Saturday, 14 October 2017 at 20:21
To: Kevin Wamae <kwa...@kemri-wellcome.org>
Cc: Bert Gunter <bgunter.4...@gmail.com>, Rui Barradas <ruipbarra...@sapo.pt>, 
R-help <R-help@r-project.org>
Subject: Re: [R] Populate one data frame with values from another dataframe for 
rows that match

Your example used one distinct studyno in DF1 and one distinct pf_mcl in DF2.  
I think that makes it hard to see what is going on, but maybe I completely 
misunderstand the problem.  In any case, let's redefine myDF1 and myDF2.  Note 
that myDF1 contains a studyno not in myDF2 and vice versa.

myDF1 <- structure(list(studyno = c("J1000/9", "J895/7", "J931/6", "J666/6",
"J1000/9", "J1000/9"), date = structure(c(17123, 17127, 17135,
17144, 17148, 17155), class = "Date"), pf_mcl = c(NA_integer_,
2L, 3L, 4L, 5L, NA_integer_
), year = c(2016, 2016, 2016, 2016, 2016, 2016)), .Names = c("studyno",
"date", "pf_mcl", "year"), row.names = c(NA, 6L), class = "data.frame")

myDF2 <- structure(list(studyno = c("J740/4", "J1000/9", "J895/7", "J931/6",
"J609/1", "J941/3"), pf_mcl = c(101L, 102L, 103L, 104L, 105L, 106L)),
.Names = c("studyno", "pf_mcl"), row.names = c(NA, 6L), class = "data.frame")

m <- merge(myDF1, myDF2, by="studyno", all.x=TRUE, all.y=FALSE, 
suffixes=c(".raw", ".curated"))


The results are:

> myDF1
  studyno   date pf_mcl year
1 J1000/9 2016-11-18 NA 2016
2  J895/7 2016-11-22  2 2016
3  J931/6 2016-11-30  3 2016
4  J666/6 2016-12-09  4 2016
5 J1000/9 2016-12-13  5 2016
6 J1000/9 2016-12-20 NA 2016
> myDF2
  studyno pf_mcl
1  J740/4101
2 J1000/9102
3  J895/7103
4  J931/6104
5  J609/1105
6  J941/3106
> m
  studyno   date pf_mcl.raw year pf_mcl.curated
1 J1000/9 2016-11-18 NA 2016102
2 J1000/9 2016-12-13  5 2016102
3 J1000/9 2016-12-20 NA 2016102
4  J666/6 2016-12-09  4 2016 NA
5  J895/7 2016-11-22  2 2016    103
6  J931/6 2016-11-30  3 2016104


Now your problem is to combine the columns pf_mcl.raw and pf_mcl.curated in the 
way you want.  ifelse() may be useful for that.


Bill Dunlap
TIBCO Software
wdunlap tibco.com<http://tibco.com>

On Fri, Oct 13, 2017 at 10:48 PM, Kevin Wamae 
<kwa...@kemri-wellcome.org<mailto:kwa...@kemri-wellcome.org>> wrote:
Dear @Bert 
Gunter<mailto:bgunter.4...@gmail.com<mailto:bgunter.4...@gmail.com>>, I tried 
merge and I faced many challenges. @Rui 
Barradas<mailto:ruipbarra...@sapo.pt<mailto:ruipbarra...@sapo.pt>> solution is 
working.

From: Bert Gunter <bgunter.4...@gmail.com<mailto:bgunter.4...@gmail.com>>
Date: Friday, 13 October 2017 at 22:44
To: Kevin Wamae <kwa...@kemri-wellcome.org<mailto:kwa...@kemri-wellcome.org>>
Cc: R-help <R-help@r-project.org<mailto:R-help@r-project.org>>
Subject: Re: [R] Populate one data frame with values from another dataframe for 
rows that match

?merge

Bert

On Oct 13, 2017 12:09 PM, "Kevin Wamae" 
<kwa...@kemri-wellcome.org<mailto:kwa...@kemri-wellcome.org><mailto:kwa...@kemri-wellcome.org<mailto:kwa...@kemri-wellcome.org>>>
 wrote:
I'm trying to populate the column “pf_mcl” in myDF1 with values from myDF2, 
where rows match based on column "studyno" but the solutions I have found so 
far don't seem to be giving me the desired output.

Below is a snapshot of the data.frames.

myDF1 <- structure(list(studyno = c("J1000/9", "J1000/9", "J1000/9", "J1000/9",
"J1000/9", "J1000/9"), date = structure(c(17123, 17127, 17135,
17144, 17148, 17155), class = "Date"), pf_mcl = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), year = c(2016, 2016, 2016, 2016, 2016, 2016)), .Names = c("studyno",
"date", "pf_mcl", "year"), row.names = c(NA, 6L), class = "data.frame")

myDF2 <- structure(list(studyno = c("J740/4", "J1000/9", "J895/7", "J931/6",
"J609/1&q

Re: [R] Populate one data frame with values from another dataframe for rows that match

2017-10-14 Thread Kevin Wamae
Dear @Eric<mailto:ericjber...@gmail.com>, thank you so very much for noticing 
that. When I tested @Rui<mailto:ruipbarra...@sapo.pt>’s solution, it was on a 
smaller dataset that had purely matching rows. I had considered including 
non-matching rows to evaluate what the alternative would be.

Also, I hadn’t even tested it on the larger dataset. I have now and noticed 
that it went further to omit rows that did not match, just like you said.

Your proposed solution works well.

Much appreciated…I’ll get in touch in case I encounter any problems


From: Eric Berger <ericjber...@gmail.com>
Date: Saturday, 14 October 2017 at 12:43
To: Kevin Wamae <kwa...@kemri-wellcome.org>
Cc: Rui Barradas <ruipbarra...@sapo.pt>, "r-help@r-project.org" 
<r-help@r-project.org>
Subject: Re: [R] Populate one data frame with values from another dataframe for 
rows that match

Hi Kevin,
I think there are issues with Rui's proposed solution. For example, if there 
are rows in myDF1 which have a studyno
which does not match any row in myDF2, then you will lose those rows. In your 
original request you said that you wanted to keep those rows.

To demonstrate my point I need to modify your sample data. Specifically, I 
changed some studyno settings in myDF1, and also the entries of pf_mcl in myDF1.

myDF1 <- structure(list(studyno = c("J1000/8", "J1000/9", "J1000/9", "J1000/9",
"J1000/5", "J1000/6"), date = structure(c(17123, 17127, 17135,
17144, 17148, 17155), class = "Date"), pf_mcl = c(1:6
), year = c(2016, 2016, 2016, 2016, 2016, 2016)), .Names = c("studyno",
"date", "pf_mcl", "year"), row.names = c(NA, 6L), class = "data.frame")

myDF2 <- structure(list(studyno = c("J740/4", "J1000/9", "J895/7", "J931/6",
"J609/1", "J941/3"), pf_mcl = c(0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("studyno",
"pf_mcl"), row.names = c(NA, 6L), class = "data.frame")

#Rui's proposal gives the following result
# studyno   date year pf_mcl
# 1 J1000/9 2016-11-22 2016  0
# 2 J1000/9 2016-11-30 2016  0
# 3 J1000/9 2016-12-09 2016  0

My proposal

library(dplyr)

myDF1$studyno <- as.character(myDF1$studyno)
myDF2$studyno <- as.character(myDF2$studyno)
myDF3 <- merge(myDF1, myDF2, by="studyno", all.x=TRUE ) %>%
dplyr::mutate( pf_mcl = ifelse( is.na<http://is.na>(pf_mcl.y), 
pf_mcl.x, pf_mcl.y ) ) %>%
dplyr::select( studyno, date, pf_mcl )

# The results of this approach
#  studyno   date pf_mcl
# 1 J1000/5 2016-12-13      5
# 2 J1000/6 2016-12-20  6
# 3 J1000/8 2016-11-18  1
# 4 J1000/9 2016-11-22  0
# 5 J1000/9 2016-11-30  0
# 6 J1000/9 2016-12-09  0

Comparing the two results you see that no rows have been dropped in my approach.

HTH,

Eric





On Sat, Oct 14, 2017 at 8:49 AM, Kevin Wamae 
<kwa...@kemri-wellcome.org<mailto:kwa...@kemri-wellcome.org>> wrote:
Dear @Rui Barradas, thank you for the solution. It works perfectly.


On 13/10/2017, 23:35, "Rui Barradas" 
<ruipbarra...@sapo.pt<mailto:ruipbarra...@sapo.pt>> wrote:

Hello,

Try the following.


myDF1$studyno <- as.character(myDF1$studyno)
myDF2$studyno <- as.character(myDF2$studyno)
i1 <- which(names(myDF1) == "pf_mcl")

merge(myDF1[-i1], myDF2, by = "studyno")


Hope this helps,

Rui Barradas

Em 13-10-2017 20:09, Kevin Wamae escreveu:
> I'm trying to populate the column “pf_mcl” in myDF1 with values from 
myDF2, where rows match based on column "studyno" but the solutions I have 
found so far don't seem to be giving me the desired output.
>
> Below is a snapshot of the data.frames.
>
> myDF1 <- structure(list(studyno = c("J1000/9", "J1000/9", "J1000/9", 
"J1000/9",
> "J1000/9", "J1000/9"), date = structure(c(17123, 17127, 17135,
> 17144, 17148, 17155), class = "Date"), pf_mcl = c(NA_integer_,
> NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
> ), year = c(2016, 2016, 2016, 2016, 2016, 2016)), .Names = c("studyno",
> "date", "pf_mcl", "year"), row.names = c(NA, 6L), class = "data.frame")
>
> myDF2 <- structure(list(studyno = c("J740/4", "J1000/9", "J895/7", 
"J931/6",
> "J609/1", "J941/3"), pf_mcl = c(0L, 0L, 0L, 0L, 0L, 0L)), .Names = 
c("studyno",
> "pf_mcl"), row.names = c(NA, 6L), class = "data.frame")
>
> myDF2 is a well curated subset of myDF1. Some rows in the two datasets 
match based on &

Re: [R] Populate one data frame with values from another dataframe for rows that match

2017-10-13 Thread Kevin Wamae
Dear @Rui Barradas, thank you for the solution. It works perfectly.


On 13/10/2017, 23:35, "Rui Barradas" <ruipbarra...@sapo.pt> wrote:

Hello,

Try the following.


myDF1$studyno <- as.character(myDF1$studyno)
myDF2$studyno <- as.character(myDF2$studyno)
i1 <- which(names(myDF1) == "pf_mcl")

merge(myDF1[-i1], myDF2, by = "studyno")


Hope this helps,

    Rui Barradas

Em 13-10-2017 20:09, Kevin Wamae escreveu:
> I'm trying to populate the column “pf_mcl” in myDF1 with values from 
myDF2, where rows match based on column "studyno" but the solutions I have 
found so far don't seem to be giving me the desired output.
>
> Below is a snapshot of the data.frames.
>
> myDF1 <- structure(list(studyno = c("J1000/9", "J1000/9", "J1000/9", 
"J1000/9",
> "J1000/9", "J1000/9"), date = structure(c(17123, 17127, 17135,
> 17144, 17148, 17155), class = "Date"), pf_mcl = c(NA_integer_,
> NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
> ), year = c(2016, 2016, 2016, 2016, 2016, 2016)), .Names = c("studyno",
> "date", "pf_mcl", "year"), row.names = c(NA, 6L), class = "data.frame")
>
> myDF2 <- structure(list(studyno = c("J740/4", "J1000/9", "J895/7", 
"J931/6",
> "J609/1", "J941/3"), pf_mcl = c(0L, 0L, 0L, 0L, 0L, 0L)), .Names = 
c("studyno",
> "pf_mcl"), row.names = c(NA, 6L), class = "data.frame")
>
> myDF2 is a well curated subset of myDF1. Some rows in the two datasets 
match based on "studyno", one may find that values are missing in myDF1$pf_mcl 
or the values are wrong.
>
> All I want to do is identify a matching row in myDF2 and populate 
myDF1$pf_mcl with the value in myDF2$pf_mcl. If a row does not match based on 
“studyno”, the value should remain the same.
>
> It's probably worth mentioning, the two data frames have other 
columns...I have selected a few for example purposes.
>
>
>
> __
>
> This e-mail contains information which is confidential. It is intended 
only for the use of the named recipient. If you have received this e-mail in 
error, please let us know by replying to the sender, and immediately delete it 
from your system.  Please note, that in these circumstances, the use, 
disclosure, distribution or copying of this information is strictly prohibited. 
KEMRI-Wellcome Trust Programme cannot accept any responsibility for the  
accuracy or completeness of this message as it has been transmitted over a 
public network. Although the Programme has taken reasonable precautions to 
ensure no viruses are present in emails, it cannot accept responsibility for 
any loss or damage arising from the use of the email or attachments. Any views 
expressed in this message are those of the individual sender, except where the 
sender specifically states them to be the views of KEMRI-Wellcome Trust 
Programme.
> __
>
>   [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



__

This e-mail contains information which is confidential. It is intended only for 
the use of the named recipient. If you have received this e-mail in error, 
please let us know by replying to the sender, and immediately delete it from 
your system.  Please note, that in these circumstances, the use, disclosure, 
distribution or copying of this information is strictly prohibited. 
KEMRI-Wellcome Trust Programme cannot accept any responsibility for the  
accuracy or completeness of this message as it has been transmitted over a 
public network. Although the Programme has taken reasonable precautions to 
ensure no viruses are present in emails, it cannot accept responsibility for 
any loss or damage arising from the use of the email or attachments. Any views 
expressed in this message are those of the individual sender, except where the 
sender specifically states them to be the views of KEMRI-Wellcome Trust 
Programme.
__
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Populate one data frame with values from another dataframe for rows that match

2017-10-13 Thread Kevin Wamae
Dear @Bert Gunter<mailto:bgunter.4...@gmail.com>, I tried merge and I faced 
many challenges. @Rui Barradas<mailto:ruipbarra...@sapo.pt> solution is working.

From: Bert Gunter <bgunter.4...@gmail.com>
Date: Friday, 13 October 2017 at 22:44
To: Kevin Wamae <kwa...@kemri-wellcome.org>
Cc: R-help <R-help@r-project.org>
Subject: Re: [R] Populate one data frame with values from another dataframe for 
rows that match

?merge

Bert

On Oct 13, 2017 12:09 PM, "Kevin Wamae" 
<kwa...@kemri-wellcome.org<mailto:kwa...@kemri-wellcome.org>> wrote:
I'm trying to populate the column “pf_mcl” in myDF1 with values from myDF2, 
where rows match based on column "studyno" but the solutions I have found so 
far don't seem to be giving me the desired output.

Below is a snapshot of the data.frames.

myDF1 <- structure(list(studyno = c("J1000/9", "J1000/9", "J1000/9", "J1000/9",
"J1000/9", "J1000/9"), date = structure(c(17123, 17127, 17135,
17144, 17148, 17155), class = "Date"), pf_mcl = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), year = c(2016, 2016, 2016, 2016, 2016, 2016)), .Names = c("studyno",
"date", "pf_mcl", "year"), row.names = c(NA, 6L), class = "data.frame")

myDF2 <- structure(list(studyno = c("J740/4", "J1000/9", "J895/7", "J931/6",
"J609/1", "J941/3"), pf_mcl = c(0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("studyno",
"pf_mcl"), row.names = c(NA, 6L), class = "data.frame")

myDF2 is a well curated subset of myDF1. Some rows in the two datasets match 
based on "studyno", one may find that values are missing in myDF1$pf_mcl or the 
values are wrong.

All I want to do is identify a matching row in myDF2 and populate myDF1$pf_mcl 
with the value in myDF2$pf_mcl. If a row does not match based on “studyno”, the 
value should remain the same.

It's probably worth mentioning, the two data frames have other columns...I have 
selected a few for example purposes.



__

This e-mail contains information which is confidential. It is intended only for 
the use of the named recipient. If you have received this e-mail in error, 
please let us know by replying to the sender, and immediately delete it from 
your system.  Please note, that in these circumstances, the use, disclosure, 
distribution or copying of this information is strictly prohibited. 
KEMRI-Wellcome Trust Programme cannot accept any responsibility for the  
accuracy or completeness of this message as it has been transmitted over a 
public network. Although the Programme has taken reasonable precautions to 
ensure no viruses are present in emails, it cannot accept responsibility for 
any loss or damage arising from the use of the email or attachments. Any views 
expressed in this message are those of the individual sender, except where the 
sender specifically states them to be the views of KEMRI-Wellcome Trust 
Programme.
__

[[alternative HTML version deleted]]

__
R-help@r-project.org<mailto:R-help@r-project.org> mailing list -- To 
UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__

This e-mail contains information which is confidential. It is intended only for 
the use of the named recipient. If you have received this e-mail in error, 
please let us know by replying to the sender, and immediately delete it from 
your system.  Please note, that in these circumstances, the use, disclosure, 
distribution or copying of this information is strictly prohibited. 
KEMRI-Wellcome Trust Programme cannot accept any responsibility for the  
accuracy or completeness of this message as it has been transmitted over a 
public network. Although the Programme has taken reasonable precautions to 
ensure no viruses are present in emails, it cannot accept responsibility for 
any loss or damage arising from the use of the email or attachments. Any views 
expressed in this message are those of the individual sender, except where the 
sender specifically states them to be the views of KEMRI-Wellcome Trust 
Programme.
__

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Populate one data frame with values from another dataframe for rows that match

2017-10-13 Thread Kevin Wamae
I'm trying to populate the column “pf_mcl” in myDF1 with values from myDF2, 
where rows match based on column "studyno" but the solutions I have found so 
far don't seem to be giving me the desired output.

Below is a snapshot of the data.frames.

myDF1 <- structure(list(studyno = c("J1000/9", "J1000/9", "J1000/9", "J1000/9",
"J1000/9", "J1000/9"), date = structure(c(17123, 17127, 17135,
17144, 17148, 17155), class = "Date"), pf_mcl = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), year = c(2016, 2016, 2016, 2016, 2016, 2016)), .Names = c("studyno",
"date", "pf_mcl", "year"), row.names = c(NA, 6L), class = "data.frame")

myDF2 <- structure(list(studyno = c("J740/4", "J1000/9", "J895/7", "J931/6",
"J609/1", "J941/3"), pf_mcl = c(0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("studyno",
"pf_mcl"), row.names = c(NA, 6L), class = "data.frame")

myDF2 is a well curated subset of myDF1. Some rows in the two datasets match 
based on "studyno", one may find that values are missing in myDF1$pf_mcl or the 
values are wrong.

All I want to do is identify a matching row in myDF2 and populate myDF1$pf_mcl 
with the value in myDF2$pf_mcl. If a row does not match based on “studyno”, the 
value should remain the same.

It's probably worth mentioning, the two data frames have other columns...I have 
selected a few for example purposes.



__

This e-mail contains information which is confidential. It is intended only for 
the use of the named recipient. If you have received this e-mail in error, 
please let us know by replying to the sender, and immediately delete it from 
your system.  Please note, that in these circumstances, the use, disclosure, 
distribution or copying of this information is strictly prohibited. 
KEMRI-Wellcome Trust Programme cannot accept any responsibility for the  
accuracy or completeness of this message as it has been transmitted over a 
public network. Although the Programme has taken reasonable precautions to 
ensure no viruses are present in emails, it cannot accept responsibility for 
any loss or damage arising from the use of the email or attachments. Any views 
expressed in this message are those of the individual sender, except where the 
sender specifically states them to be the views of KEMRI-Wellcome Trust 
Programme.
__

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset

2016-07-04 Thread Kevin Wamae
Hi Bert, The “status” at the end of the study does exist in the original 
dataset, what was missing was the time between events. And there exists so many 
events that fall between the first and last day to be explored in this work.

The suggestion I received then, was to compute time between the initial date 
for each individual and all sub subsequent events, up to the last day of the 
study. The rationale being, once I have that column of difference in days, I 
can then use it to make any other calculations that arise.

Let me try your suggested script and see how that goes..highly appreciated..

Regards
---
Kevin Wame 
 

On 7/4/16, 9:32 AM, "Bert Gunter" <bgunter.4...@gmail.com> wrote:

A kaplan-meier plot requires for each individual (in each treatment
group, if there are more than one):

1. Survival time,which in your case appears to mean time without disease;
2. Status at end of time on study: whether the individual was censored
(still without disease) or died (in your case, was diseased) on the
last date they are seen in the study.

AFAICT, the 2nd piece of information is not present in your data; if
this is so, then you cannot do the K-M plot or, indeed, any survival
analysis. That is, you can quit the analysis right now.

If you have the status, where is it?

If, for example, the last date for each individual is the date at
which disease is first seen, then you can simply convert the date
column to the Date class with ?as.Date (the year and month columns
appear to be useless as they repeat info already available in the date
columns), and then:

survtimes_byID <- with(datasetname, tapply(date, ID, function(x)diff(range(x

will give you a list of survival times (in days) by ID. See ?with,
?tapply for details.

If the status info is in some other form, then this advice should be
ignored of course and you have to incorporate it into your data in
some other way.


Cheers,
Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Sun, Jul 3, 2016 at 2:43 PM, Jeff Newmiller <jdnew...@dcn.davis.ca.us> wrote:
> There are a great many hits when I search on the keywords "kaplan meier plot 
> R"... so my first reaction is that you should be referring to some of the 
> existing packages for doing this type of analysis. I do not do this type of 
> analysis normally, so am probably not your best helper... perhaps someone 
> else will chime in if you show that you have read some existing KM examples.
>
> My second reaction is that if you want to avoid losing records you should 
> also avoid adding records. Your example extends from the first matching date 
> to and including the next matching date, which conflicts with analysis of 
> successive treatment periods. You may have a good reason for doing this, but 
> in my experience this is usually a mistake.
>
> Finally, I think you should more closely study the use of the ave function 
> that I already used if you want to work with the data in its original form. 
> It should not be too difficult to generate your diff_days column using ave if 
> you have the admin_period column that I showed you how to make.
> --
> Sent from my phone. Please excuse my brevity.
>
> On July 3, 2016 1:47:17 PM PDT, Kevin Wamae <kwa...@kemri-wellcome.org> wrote:
>>Hi Bert, my first task is to make a Kaplan Meier Plot to evaluate the
>>risk of developing disease in the treated vs the non-treated
>>individuals. I therefore figured it might be easier to compute dates
>>first as any further analysis will be based on time, in this case days.
>>I keep getting recommendations on how to tweak my analysis and keeps
>>coming down to dates between the start of drug administration and the
>>end of it.
>>
>>Can you suggest an “easier” way to go about this..
>>
>>Regards
>>---
>>Kevin Wame
>>
>>
>>On 7/3/16, 11:28 PM, "Bert Gunter" <bgunter.4...@gmail.com> wrote:
>>
>>I haven't followed this thread closely, but if it's not too late, I
>>might suggest that you stop worrying about how you want your data
>>frame to look and start worrying about you want to display/analyze
>>your data. As Jeff suggested, you and your supervisor are probably
>>being driven by paradigms from Excel, SPSS, or whatever that are
>>simply unnecessary for R. My guess would be that if you explained the
>>sort of analyses/plots you wish to do, you will find it can be done
>>fairly directly from your existing data. At the very least it would
>>give Jeff and 

Re: [R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset

2016-07-04 Thread Kevin Wamae
Hi Jeff, thanks and I will explore your suggestions too..

Regards
---
Kevin Wame 

 

On 7/4/16, 12:43 AM, "Jeff Newmiller" <jdnew...@dcn.davis.ca.us> wrote:

There are a great many hits when I search on the keywords "kaplan meier plot 
R"... so my first reaction is that you should be referring to some of the 
existing packages for doing this type of analysis. I do not do this type of 
analysis normally, so am probably not your best helper... perhaps someone else 
will chime in if you show that you have read some existing KM examples. 

My second reaction is that if you want to avoid losing records you should also 
avoid adding records. Your example extends from the first matching date to and 
including the next matching date, which conflicts with analysis of successive 
treatment periods. You may have a good reason for doing this, but in my 
experience this is usually a mistake. 

Finally, I think you should more closely study the use of the ave function that 
I already used if you want to work with the data in its original form. It 
should not be too difficult to generate your diff_days column using ave if you 
have the admin_period column that I showed you how to make. 
-- 
Sent from my phone. Please excuse my brevity.

On July 3, 2016 1:47:17 PM PDT, Kevin Wamae <kwa...@kemri-wellcome.org> wrote:
>Hi Bert, my first task is to make a Kaplan Meier Plot to evaluate the
>risk of developing disease in the treated vs the non-treated
>individuals. I therefore figured it might be easier to compute dates
>first as any further analysis will be based on time, in this case days.
>I keep getting recommendations on how to tweak my analysis and keeps
>coming down to dates between the start of drug administration and the
>end of it.
>
>Can you suggest an “easier” way to go about this.. 
>
>Regards
>---
>Kevin Wame 
> 
>
>On 7/3/16, 11:28 PM, "Bert Gunter" <bgunter.4...@gmail.com> wrote:
>
>I haven't followed this thread closely, but if it's not too late, I
>might suggest that you stop worrying about how you want your data
>frame to look and start worrying about you want to display/analyze
>your data. As Jeff suggested, you and your supervisor are probably
>being driven by paradigms from Excel, SPSS, or whatever that are
>simply unnecessary for R. My guess would be that if you explained the
>sort of analyses/plots you wish to do, you will find it can be done
>fairly directly from your existing data. At the very least it would
>give Jeff and other helpeRs a better idea of what you might need
>rather than what you and your supervisor think you need.
>
>
>Cheers,
>Bert
>
>
>Bert Gunter
>
>"The trouble with having an open mind is that people keep coming along
>and sticking things into it."
>-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
>On Sun, Jul 3, 2016 at 1:08 PM, Kevin Wamae <kwa...@kemri-wellcome.org>
>wrote:
>> Hi Jeff, It works on well on a dataset with 10 rows and I figure
>it will work well with the “real” dataset. You’ve been of great help
>and I am starting to make headway.
>>
>> It creates a new dataframe (result), as shown below that doesn’t
>quite have the result as I would want it.
>>
>> ID  admin_periodstart   end ddays
>> J1/31   5/11/07 8/13/07 94
>> J1/32   8/13/07 11/12/0791
>> J1/33   11/12/072/4/08 84
>> J1/34   2/4/08  5/5/08  91
>> J1/35   5/5/08   5/4/09364
>> J1/36   5/4/09   5/17/10378
>> J1/37   5/17/10 5/16/11 364
>> J10/1   1   5/11/07 8/13/07 94
>> J10/1   2   8/13/07 11/12/0791
>> J10/1   3   11/12/072/4/08  84
>> J10/1   4   2/4/085/5/0891
>> J10/1   5   5/5/085/8/09368
>> J10/1   6   5/8/09   5/17/10374
>> J10/1   7   5/17/10 5/16/11 364
>> J102/1  1   5/15/07 8/15/07 92
>> J102/1  2   8/15/07 11/13/0790
>> J102/1  3   11/13/072/5/08 84
>> J102/1  4   2/5/085/6/0891
>> J102/1  5   5/6/085/5/09364
>> J102/1  6   5/5/095/19/10   379
>>
>> My supervisor doesn’t want me to create a new dataset, she’s afraid I
>might lose some data…I cannot fight that.
>>
>> Like you mentioned earlier, I might be mixing up things which I think
>is what you alluded t

Re: [R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset

2016-07-03 Thread Kevin Wamae
Hi Bert, my first task is to make a Kaplan Meier Plot to evaluate the risk of 
developing disease in the treated vs the non-treated individuals. I therefore 
figured it might be easier to compute dates first as any further analysis will 
be based on time, in this case days. I keep getting recommendations on how to 
tweak my analysis and keeps coming down to dates between the start of drug 
administration and the end of it.

Can you suggest an “easier” way to go about this.. 

Regards
---
Kevin Wame 
 

On 7/3/16, 11:28 PM, "Bert Gunter" <bgunter.4...@gmail.com> wrote:

I haven't followed this thread closely, but if it's not too late, I
might suggest that you stop worrying about how you want your data
frame to look and start worrying about you want to display/analyze
your data. As Jeff suggested, you and your supervisor are probably
being driven by paradigms from Excel, SPSS, or whatever that are
simply unnecessary for R. My guess would be that if you explained the
sort of analyses/plots you wish to do, you will find it can be done
fairly directly from your existing data. At the very least it would
give Jeff and other helpeRs a better idea of what you might need
rather than what you and your supervisor think you need.


Cheers,
Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Sun, Jul 3, 2016 at 1:08 PM, Kevin Wamae <kwa...@kemri-wellcome.org> wrote:
> Hi Jeff, It works on well on a dataset with 10 rows and I figure it will 
> work well with the “real” dataset. You’ve been of great help and I am 
> starting to make headway.
>
> It creates a new dataframe (result), as shown below that doesn’t quite have 
> the result as I would want it.
>
> ID  admin_periodstart   end ddays
> J1/31   5/11/07 8/13/07 94
> J1/32   8/13/07 11/12/0791
> J1/33   11/12/072/4/08 84
> J1/34   2/4/08  5/5/08  91
> J1/35   5/5/08   5/4/09364
> J1/36   5/4/09   5/17/10378
> J1/37   5/17/10 5/16/11 364
> J10/1   1   5/11/07 8/13/07 94
> J10/1   2   8/13/07 11/12/0791
> J10/1   3   11/12/072/4/08  84
> J10/1   4   2/4/085/5/0891
> J10/1   5   5/5/085/8/09368
> J10/1   6   5/8/09   5/17/10374
> J10/1   7   5/17/10 5/16/11 364
> J102/1  1   5/15/07 8/15/07 92
> J102/1  2   8/15/07 11/13/0790
> J102/1  3   11/13/072/5/08 84
> J102/1  4   2/5/085/6/0891
> J102/1  5   5/6/085/5/09364
> J102/1  6   5/5/095/19/10   379
>
> My supervisor doesn’t want me to create a new dataset, she’s afraid I might 
> lose some data…I cannot fight that.
>
> Like you mentioned earlier, I might be mixing up things which I think is what 
> you alluded to earlier.
>
> After consultation with my supervisor, this is what we’ve agreed. For every 
> individual, given the start and end date, create a new column (say, 
> diff_days) and for every row that falls within the range of start and 
> end_date, get the difference between the date in that row and start date and 
> add it to the diff_days column. Below is an example of the result. As it can 
> be seen 5/11/2007 is the start while 2/4/2008 is the end. The diff_days has 
> been populated excluding the end date and that is because that is the start 
> of the study in 2008 that will continue into 2009 and thus from 2/4/2008, I 
> should compute diff_days till 2009 and so no (I hope this makes sense).
>
> ID  datedrug_admin  yearmonth   diff_days
> R1/35/11/2007   Y   20075   0
> R1/35/16/2007   20075   6
> R1/35/22/2007   20075   11
> R1/35/28/2007   20075   17
> R1/31/14/2008   20081   248
> R1/31/21/2008   20081   255
> R1/31/28/2008   20081   263
> R1/32/4/2008Y   20082
>
>
> Regards
> ---
> Kevin Wame
>
>
> On 7/3/16, 10:09 PM, "Jeff Newmiller" <jdnew...@dcn.davis.ca.us> wrote:
>
> Typo on the second line
>
> result <- (   result0
>   %>% select( -admin_period1 )
>   %>% inner_join( result0 %>% select( ID, admin_period1, end=start )
>

Re: [R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset

2016-07-03 Thread Kevin Wamae
Hi Jeff, It works on well on a dataset with 10 rows and I figure it will 
work well with the “real” dataset. You’ve been of great help and I am starting 
to make headway. 

It creates a new dataframe (result), as shown below that doesn’t quite have the 
result as I would want it.

ID  admin_periodstart   end ddays
J1/31   5/11/07 8/13/07 94
J1/32   8/13/07 11/12/0791
J1/33   11/12/072/4/08 84
J1/34   2/4/08  5/5/08  91
J1/35   5/5/08   5/4/09364
J1/36   5/4/09   5/17/10378
J1/37   5/17/10 5/16/11 364
J10/1   1   5/11/07 8/13/07 94
J10/1   2   8/13/07 11/12/0791
J10/1   3   11/12/072/4/08  84
J10/1   4   2/4/085/5/0891
J10/1   5   5/5/085/8/09368
J10/1   6   5/8/09   5/17/10374
J10/1   7   5/17/10 5/16/11 364
J102/1  1   5/15/07 8/15/07 92
J102/1  2   8/15/07 11/13/0790
J102/1  3   11/13/072/5/08 84
J102/1  4   2/5/085/6/0891
J102/1  5   5/6/085/5/09364
J102/1  6   5/5/095/19/10   379

My supervisor doesn’t want me to create a new dataset, she’s afraid I might 
lose some data…I cannot fight that.

Like you mentioned earlier, I might be mixing up things which I think is what 
you alluded to earlier.

After consultation with my supervisor, this is what we’ve agreed. For every 
individual, given the start and end date, create a new column (say, diff_days) 
and for every row that falls within the range of start and end_date, get the 
difference between the date in that row and start date and add it to the 
diff_days column. Below is an example of the result. As it can be seen 
5/11/2007 is the start while 2/4/2008 is the end. The diff_days has been 
populated excluding the end date and that is because that is the start of the 
study in 2008 that will continue into 2009 and thus from 2/4/2008, I should 
compute diff_days till 2009 and so no (I hope this makes sense).

ID  datedrug_admin  yearmonth   diff_days
R1/35/11/2007   Y   20075   0
R1/35/16/2007   20075   6
R1/35/22/2007   20075   11
R1/35/28/2007   20075   17
R1/31/14/2008   20081   248
R1/31/21/2008   20081   255
R1/31/28/2008   20081   263
R1/32/4/2008Y   20082   


Regards
---
Kevin Wame 
 

On 7/3/16, 10:09 PM, "Jeff Newmiller" <jdnew...@dcn.davis.ca.us> wrote:

Typo on the second line

result <- (   result0 
  %>% select( -admin_period1 )
  %>% inner_join( result0 %>% select( ID, admin_period1, end=start )
   , by = c( ID="ID", admin_period ="admin_period1" )
)
  %>% mutate( ddays = end - start )
  )
-- 
Sent from my phone. Please excuse my brevity.

On July 3, 2016 11:55:14 AM PDT, Kevin Wamae <kwa...@kemri-wellcome.org> wrote:
>Hi Jeff, “likes its Excel”, I don’t follow. Pardon me for any mix up.
>
>Thanks for the code.  After running it, this is the error I get.
>
>Error: cannot join on columns 'admin_period' x 'admin_period1': index
>out of bounds
>
>Regards
>---
>Kevin Wame | Ph.D. Student (IDeAL)
>KEMRI-Wellcome Trust Collaborative Research Programme
>Centre for Geographic Medicine Research
>P.O. Box 230-80108, Kilifi, Kenya
> 
>
>On 7/3/16, 9:34 PM, "Jeff Newmiller" <jdnew...@dcn.davis.ca.us> wrote:
>
>I still get the impression from your mixing of information types that
>you are thinking like this is Excel.
>
>Perhaps something like
>
>drug_study$admin_period  <- ave( "Y" == drug_study$drug_admin,
>drug_study$ID, FUN=cumsum )
>library(dplyr)
>result0 <- (   drug_study
>  %>% filter( 0 != admin_period )
>  %>% group_by( ID, admin_period )
>  %>% summarise( start = min( date ) )
>  %>% mutate( admin_period1 = admin_period -1 )
>  )
>result <- (   result0 
>  %>% select( -admin_period )
> %>% inner_join( result0 %>% select( ID, admin_period1, end=start )
> , by = c( ID="ID", admin_period ="admin_period1" )
>)
>  %>% mutate( ddays = end - start )
>  )
>-- 
>Sent from my phone. Please excuse my brevity.
>
>On July 3, 2016 10:24:51 AM PDT, Kevin Wamae
><kwa...@kemr

Re: [R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset

2016-07-03 Thread Kevin Wamae
Thanks Jeff, let me try it on the larger dataset.

Regards
---
Kevin Wame 
 

On 7/3/16, 10:09 PM, "Jeff Newmiller"  wrote:

result <- (   result0 
  %>% select( -admin_period1 )
  %>% inner_join( result0 %>% select( ID, admin_period1, end=start )
   , by = c( ID="ID", admin_period ="admin_period1" )
)
  %>% mutate( ddays = end - start )
  )


__

This e-mail contains information which is confidential. It is intended only for 
the use of the named recipient. If you have received this e-mail in error, 
please let us know by replying to the sender, and immediately delete it from 
your system.  Please note, that in these circumstances, the use, disclosure, 
distribution or copying of this information is strictly prohibited. 
KEMRI-Wellcome Trust Programme cannot accept any responsibility for the  
accuracy or completeness of this message as it has been transmitted over a 
public network. Although the Programme has taken reasonable precautions to 
ensure no viruses are present in emails, it cannot accept responsibility for 
any loss or damage arising from the use of the email or attachments. Any views 
expressed in this message are those of the individual sender, except where the 
sender specifically states them to be the views of KEMRI-Wellcome Trust 
Programme.
__
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset

2016-07-03 Thread Kevin Wamae
HI Jeff, it’s been an uphill task working with the dataset and I am not the 
first to complain. Nonetheless, data-cleaning is ongoing and since I cannot 
wait for that to get done, I decided to make the most of what the dataset looks 
like at this time. It appears the process may take a while.

Thanks for the script. From the output, I noticed that “result” contains the 
first and last date for each of the individuals and not taking into account the 
variable “drug-admin”. 

ID  start   end
J1/31/5/09  12/25/10
R1/31/4/07  12/15/08
R10/1   1/4/07  3/5/12

My aim is to pick the date, for example in 2007, where drug-admin == “Y” as my 
start and the date in the subsequent year (2008 in this case) where drug-admin 
== “Y” as my end. Then, I should populate the variable “study_id” with “start” 
up to the entry just above the one whose date matches “end”, as the output 
below shows (I hope its structure is maintained as I have copied it from 
R-Studio). The goal for now is to then get difference in days between “date” 
and “study_id” and still get to keep that column for “study_id” as I might use 
it later.

From the output, it can be seen that for this individual, the dates run from 
2007 to 2008. However, for some individuals, the dates run from 2008-2009, 
2009-2010 and so on. Therefore, I need to make the script deal with all the 
years as the dates range from 2001-2016

ID  datedrug_admin  yearmonth   study_id
R1/35/11/07 Y   20075   5/11/07
R1/35/16/07 20075   5/11/07
R1/35/22/07 20075   5/11/07
R1/35/28/07 20075   5/11/07
R1/36/5/07  20076   5/11/07
R1/36/11/07 20076   5/11/07
R1/36/18/07 20076   5/11/07
R1/36/25/07 20076   5/11/07
R1/37/2/07  20077   5/11/07
R1/37/16/07 20077   5/11/07
R1/37/29/07 20077   5/11/07
R1/38/2/07  20078   5/11/07
R1/38/7/07  20078   5/11/07
R1/38/13/07 20078   5/11/07
R1/39/18/07 20079   5/11/07
R1/39/24/07 20079   5/11/07
R1/310/6/07 200710  5/11/07
R1/310/8/07 200710  5/11/07
R1/310/15/07200710  5/11/07
R1/310/22/07200710  5/11/07
R1/310/29/07200710  5/11/07
R1/311/8/07 200711  5/11/07
R1/311/12/07200711  5/11/07
R1/311/19/07200711  5/11/07
R1/311/29/07200711  5/11/07
R1/312/6/07 200712  5/11/07
R1/312/10/07200712  5/11/07
R1/312/21/07200712  5/11/07
R1/31/7/08  20081   5/11/07
R1/31/14/08 20081   5/11/07
R1/31/21/08 20081   5/11/07
R1/31/28/08 20081   5/11/07
R1/32/4/08  Y   20082   


Regards
---
Kevin Wame 

###

###



On 7/3/16, 7:05 PM, "Jeff Newmiller"  wrote:

result <- setNames( data.frame( aggregate( date~ID, data=drug_study, FUN=min ), 
 aggregate( date~ID, data=drug_study, FUN=max )[2] ), c( "ID", "start", "end" ) 
)


__

This e-mail contains information which is confidential. It is intended only for 
the use of the named recipient. If you have received this e-mail in error, 
please let us know by replying to the sender, and immediately delete it from 
your system.  Please note, that in these circumstances, the use, disclosure, 
distribution or copying of this information is strictly prohibited. 
KEMRI-Wellcome Trust Programme cannot accept any responsibility for the  
accuracy or completeness of this message as it has been transmitted over a 
public network. Although the Programme has taken reasonable precautions to 
ensure no viruses are present in emails, it cannot accept responsibility for 
any loss or damage arising from the use of the email or attachments. Any views 
expressed in this message are those of the individual sender, except where the 
sender specifically states them to be the views of KEMRI-Wellcome Trust 
Programme.
__
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, 

Re: [R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset

2016-07-03 Thread Kevin Wamae
Hi Jeff, pardon me, I was surely not making it easy. I hope this time I will ☺

Attached is snippet of the dataset in csv format and below is the R.script I 
have managed so far.

---
---

drug_study <- read.csv("drug_study.csv", header = T); head(drug_study)
drug_study$date <- as.Date(drug_study$date, "%m/%d/%Y")
drug_study$study_id <- ""  #create new column

individual <- unique (drug_study$ID)  #vector of individuals
datalength <- dim(drug_study)[1]  #number of rows in dataframe

for (i in 1:length(individual)) {
  for (j in 1:datalength) {
start_admin <- drug_study[c(drug_study$ID == individual[i] & 
drug_study$year == 2007 & drug_study$drug_admin == "Y" & drug_study$month == 
5),2]  #capture date of start
end_admin <- drug_study[(drug_study$ID == individual[i] & drug_study$year 
== 2008 & drug_study$drug_admin == "Y" & drug_study$month == 2),2]#capture 
date of end

if(drug_study[j,1] == individual[i] & drug_study[j,2] >= start_admin & 
drug_study[j,2] < end_admin) {
  drug_study[j,6] <- paste(start_admin) #populate respective row if 
condition is met
} 
  } 
}
~
~

For this dataset, there exists three individuals, J1/3, R1/3, R10/1.

The script works for the last two individuals but not J1/3 with the error below:

~
~
Error in if (drug_study[j, 1] == individual[i] & drug_study[j, 2] >= 
start_admin &  : 
  argument is of length zero
~
~

I figured it’s because this individuals start_admin and end_admin dates aren’t 
captured because the if-loop fails. There’s my first problem, there are 
thousands of individuals with varying
start_admin and end_admin dates and I need a script to capture these for every 
individual.

Secondly, the above script is taking almost an hour to run for the entire 
dataset, just for the individuals whose start_admin and end_admin dates can be 
captured by the if-loop.

I need help in coming up with a script that will tackle the problem taking into 
account the different start_admin and end_admin dates and be resourceful with 
regards to time.

Regards
---
Kevin Kariuki

###
###

On 7/3/16, 8:42 AM, "Jeff Newmiller" <jdnew...@dcn.davis.ca.us> wrote:

You are making this hard on yourself by not paying attention the Posting Guide 
listed in the footer of every email on this list. You would probably also find 
[1] helpful also. 

[1] 
http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
-- 
Sent from my phone. Please excuse my brevity.

On July 2, 2016 3:41:07 PM PDT, Kevin Wamae <kwa...@kemri-wellcome.org> wrote:
>Hi Jeff, sorry for referring to you as Jennifer earlier, accept my
>apologies.
>
>I attached a sample dataset in the question, am afraid it must have
>failed to attach.
>
>I have attached it again..
>
>
>Regards
>---
>Kevin Kariuki
> 
>
>On 7/2/16, 7:37 PM, "Jeff Newmiller" <jdnew...@dcn.davis.ca.us> wrote:
>
>I can understand you not wanting to supply your actual data online, but
>only you know what your data looks like so only you can create a
>simulated data set that we could show you how to work with. 
>-- 
>Sent from my phone. Please excuse my brevity.
>
>On July 2, 2016 2:57:39 AM PDT, Kevin Wamae <kwa...@kemri-wellcome.org>
>wrote:
>>I have a drug-trial study dataset (attached image).
>>
>>Since its a large and complex dataset (at least to me) and I hope to
>be
>>as clear as possible with my question.
>>The dataset is from a study where individuals are given drugs and
>>followed up over a period spanning two

Re: [R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset

2016-07-02 Thread Kevin Wamae
Hi Jeff, sorry for referring to you as Jennifer earlier, accept my apologies.

I attached a sample dataset in the question, am afraid it must have failed to 
attach.

I have attached it again..


Regards
---
Kevin Kariuki
 

On 7/2/16, 7:37 PM, "Jeff Newmiller" <jdnew...@dcn.davis.ca.us> wrote:

I can understand you not wanting to supply your actual data online, but only 
you know what your data looks like so only you can create a simulated data set 
that we could show you how to work with. 
-- 
Sent from my phone. Please excuse my brevity.

On July 2, 2016 2:57:39 AM PDT, Kevin Wamae <kwa...@kemri-wellcome.org> wrote:
>I have a drug-trial study dataset (attached image).
>
>Since its a large and complex dataset (at least to me) and I hope to be
>as clear as possible with my question.
>The dataset is from a study where individuals are given drugs and
>followed up over a period spanning two consecutive years. Individuals
>do not start treatment on the same day and once they start, the
>variable "drug-admin" is marked "x" as well as the time they stop
>treatment in the following year.
>There exists another variable, "study_id", that I hope to populate as
>can be seen in the dataset, with the following conditions:
>
>For every individual
>•if the individual has entries that show they received drugs both
>on the start and end date (marked with the "x")
>•if the start of drug administration falls in month == 2 | 3 and
>end of administration falls in month == 2 | 4
>•then, using the date that marks the start of drug administration,
>populate the variable _"study_id"_ in all the rows that fall within the
>timeframe that the individual was given drugs but excluding the end of
>drug administration.
>I have tried my level best and while I have explored several examples
>online, I haven't managed to solve this. The dataset contains close to
>6000 individuals spanning 10 years and my best bet was to use a loop
>which keeps crushing R after running for close to 30min. I have also
>read that dplyr may do the job but my attempts have been in vain.
>
>sample code
>---
>individual <- unique (df$ID)  #vector of individuals
>datalength <- dim(df)[1]  #number of rows in dataframe
>
>for (i in 1:length(individual)) {
>  for (j in 1:datalength) {
>start_admin <- df[(df$year == 2007] & df$drug_admin == "x" & c(df$month
>== 2 | df$month == 3),1]  #capture date of start
>end_admin <- df[(df$year == 2008] & df$drug_admin == "x" & c(df$month
>== 2 | df$month == 4),1]#capture date of end
>
>if(df[datalength,1] == individual(i) & df[datalength,2] >= start_admin
>& df[datalength,2] < end_admin) {
>df[datalength,6] <- start_admin #populate respective row if condition
>is met
>  }
>}
>  }
>
>---
>
>Above is the code that keeps failing..
>
>Any help is highly appreciated
>
>
>__
>
>This e-mail contains information which is confidential. It is intended
>only for the use of the named recipient. If you have received this
>e-mail in error, please let us know by replying to the sender, and
>immediately delete it from your system.  Please note, that in these
>circumstances, the use, disclosure, distribution or copying of this
>information is strictly prohibited. KEMRI-Wellcome Trust Programme
>cannot accept any responsibility for the  accuracy or completeness of
>this message as it has been transmitted over a public network. Although
>the Programme has taken reasonable precautions to ensure no viruses are
>present in emails, it cannot accept responsibility for any loss or
>damage arising from the use of the email or attachments. Any views
>expressed in this message are those of the individual sender, except
>where the sender specifically states them to be the views of
>KEMRI-Wellcome Trust Programme.
>__
>
>
>
>
>__
>R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html

[R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset

2016-07-02 Thread Kevin Wamae
I have a drug-trial study dataset (attached image).

Since its a large and complex dataset (at least to me) and I hope to be as 
clear as possible with my question.
The dataset is from a study where individuals are given drugs and followed up 
over a period spanning two consecutive years. Individuals do not start 
treatment on the same day and once they start, the variable "drug-admin" is 
marked "x" as well as the time they stop treatment in the following year.
There exists another variable, "study_id", that I hope to populate as can be 
seen in the dataset, with the following conditions:

For every individual
•if the individual has entries that show they received drugs both on the 
start and end date (marked with the "x")
•if the start of drug administration falls in month == 2 | 3 and end of 
administration falls in month == 2 | 4
•then, using the date that marks the start of drug administration, populate 
the variable _"study_id"_ in all the rows that fall within the timeframe that 
the individual was given drugs but excluding the end of drug administration.
I have tried my level best and while I have explored several examples online, I 
haven't managed to solve this. The dataset contains close to 6000 individuals 
spanning 10 years and my best bet was to use a loop which keeps crushing R 
after running for close to 30min. I have also read that dplyr may do the job 
but my attempts have been in vain.

sample code
---
individual <- unique (df$ID)  #vector of individuals
datalength <- dim(df)[1]  #number of rows in dataframe

for (i in 1:length(individual)) {
  for (j in 1:datalength) {
start_admin <- df[(df$year == 2007] & df$drug_admin == "x" & c(df$month == 
2 | df$month == 3),1]  #capture date of start
end_admin <- df[(df$year == 2008] & df$drug_admin == "x" & c(df$month == 2 
| df$month == 4),1]#capture date of end

if(df[datalength,1] == individual(i) & df[datalength,2] >= start_admin & 
df[datalength,2] < end_admin) {
  df[datalength,6] <- start_admin #populate respective row if condition is 
met
  }
}
  }

---

Above is the code that keeps failing..

Any help is highly appreciated


__

This e-mail contains information which is confidential. It is intended only for 
the use of the named recipient. If you have received this e-mail in error, 
please let us know by replying to the sender, and immediately delete it from 
your system.  Please note, that in these circumstances, the use, disclosure, 
distribution or copying of this information is strictly prohibited. 
KEMRI-Wellcome Trust Programme cannot accept any responsibility for the  
accuracy or completeness of this message as it has been transmitted over a 
public network. Although the Programme has taken reasonable precautions to 
ensure no viruses are present in emails, it cannot accept responsibility for 
any loss or damage arising from the use of the email or attachments. Any views 
expressed in this message are those of the individual sender, except where the 
sender specifically states them to be the views of KEMRI-Wellcome Trust 
Programme.
__
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.