[R] Complex text parsing task
Hello Everyone, I have what I think is a complex text parsing task. I've provided some sample data below. There's a relatively simple version of the coding that needs to be done and a more complex version. If someone could help me out with either version, I'd greatly appreciate it. Here are my sample data. haveData - structure(list(profile_key = structure(c(1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 7L, 7L), .Label = c(001-001 , 001-002 , 001-003 , 001-004 , 001-005 , 001-006 , 001-007 ), class = factor), encounter_date = structure(c(9L, 10L, 11L, 12L, 13L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 4L, 7L, 7L), .Label = c( 2009-03-01 , 2009-03-22 , 2009-04-01 , 2010-03-01 , 2010-10-15 , 2010-11-15 , 2011-03-01 , 2011-03-14 , 2011-10-10 , 2011-10-24 , 2012-09-15 , 2012-10-05 , 2012-10-17 ), class = factor), raw = structure(c(9L, 12L, 16L, 13L, 10L, 7L, 6L, 3L, 2L, 4L, 14L, 15L, 1L, 5L, 8L, 11L), .Label = c( ... If patient KRAS result is wild type, they will start Erbitux. ... (Several lines of material) ... Ordered KRAS mutation test 11/11/2011. Results are still not available. ... , ... KRAS (mutated). Therefore did not prescribe Erbitux. ... , ... KRAS (mutated). Will not prescribe Erbitux due to mutation. ... , ... KRAS (Wild). ..., ... KRAS results are in. Patient has the mutation. ... , ... KRAS results still pending. Note that patient was negative for Lynch mutation. ..., ... KRAS test results pending. Note that patient was negative for Lynch mutation. ..., ... Ordered KRAS mutation testing on 02/15/2011. Results came back negative. ... (Several lines of material) ... Patient KRAS mutation test is negative. Will start Erbitux. ..., ... Ordered KRAS testing on 10/10/2010. Results not yet available. If patient has a mutaton, will start Erbitux. ..., ... Ordered KRAS testing. Waiting for results. ..., ... Patient is KRAS negative. Started Erbitux on 03/01/2011. ..., ... Received KRAS results on 10/20/2010. Test results indicate tumor is wild type. Ua Protein positve. ER/PR positive. HER2/neu positve. ..., ... Still need to order KRAS mutation testing. ... , ... Tumor is negative for KRAS mutation. ..., ... Tumor is wild type. Patient is eligible to receive Eribtux. ..., ... Will conduct KRAS mutation testing prior to initiation of therapy with Erbitux. ... ), class = factor)), .Names = c(profile_key, encounter_date, raw), row.names = c(NA, -16L), class = data.frame) The following code displays the results of so-called simple coding. Simple coding KRASpatient - c(001-001, 001-002, 001-003, 001-004, 001-005, 001-006, 001-007) KRAStested - c(2,3,2,2,2,3,3) KRASwild - c(1,0,2,0,3,1,3) KRASmutant - c(4,2,2,3,1,2,2) simpleData - data.frame(KRASpatient, KRAStested, KRASwild, KRASmutant) simpleData Here, KRAStested is calculated by summing all references to KRAS for each patient. Wild is calculated by summing all references to wild type, wild, and negative that come within 20 words of the closest reference to KRAS. Mutant is calculated by summing all references to mutant, mutated, and positive that occur within 20 words of the closest reference to KRAS. The second kind of coding is what I'm referring to as complex coding. The following code displays the results of this type of coding. Complex coding KRAStested - c(2,1,0,2,2,2,3) KRASwild - c(1,0,0,0,3,0,3) KRASmutant - c(0,0,0,3,0,1,0) complexData - data.frame(KRASpatient, KRAStested, KRASwild, KRASmutant) complexData The results of complex coding differ substantially from those obtained under simple coding and I think illustrate the potential problems with that approach. With complex coding, the goal would be to identify and sum only true references to KRAS testing and true references to the result of that testing (either wild type/negative or mutant/positive). True references to KRAS testing would be identified using a set of qualifiers that eliminate the false references. So, for example, one of the patients in my (made up) sample data has the phrase Will conduct KRAS mutation testing prior to initiation of therapy with Erbitux in their medical record. In this case, Will is a qualifier that indicates this is not a true reference to KRAS testing. For this exercise, other qualifiers related to KRAS testing would include need, order (but not the past tense ordered), wait, waiting, await, and awaiting. To be a qualifier, these terms would need to occur within 12 words of the closest true reference to KRAS. True references to the results of testing would also be identified using a set of qualifiers that eliminate false references. Here the list of qualifiers would include if, lynch, kras mutation test, kras mutation testing and for kras mutation. Qualifiers would need to come within 12 words of a true reference to KRAS testing. There's an additional wrinkle for identifying true references to the results of testing. One also needs
Re: [R] Complex text parsing task
Hi Nick, Can you elaborate (hopefully in a constructive way) on what it is that you find objectionable about my post? Thanks, Paul --- On Mon, 5/21/12, Nick Gayeski n...@wildfishconservancy.org wrote: From: Nick Gayeski n...@wildfishconservancy.org Subject: RE: [R] Complex text parsing task To: 'Paul Miller' pjmiller...@yahoo.com, r-help@r-project.org Received: Monday, May 21, 2012, 10:36 AM Please stop sending these emails! -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Paul Miller Sent: Monday, May 21, 2012 8:32 AM To: r-help@r-project.org Subject: [R] Complex text parsing task Hello Everyone, I have what I think is a complex text parsing task. I've provided some sample data below. There's a relatively simple version of the coding that needs to be done and a more complex version. If someone could help me out with either version, I'd greatly appreciate it. Here are my sample data. haveData - structure(list(profile_key = structure(c(1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 7L, 7L), .Label = c(001-001 , 001-002 , 001-003 , 001-004 , 001-005 , 001-006 , 001-007 ), class = factor), encounter_date = structure(c(9L, 10L, 11L, 12L, 13L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 4L, 7L, 7L), .Label = c( 2009-03-01 , 2009-03-22 , 2009-04-01 , 2010-03-01 , 2010-10-15 , 2010-11-15 , 2011-03-01 , 2011-03-14 , 2011-10-10 , 2011-10-24 , 2012-09-15 , 2012-10-05 , 2012-10-17 ), class = factor), raw = structure(c(9L, 12L, 16L, 13L, 10L, 7L, 6L, 3L, 2L, 4L, 14L, 15L, 1L, 5L, 8L, 11L), .Label = c( ... If patient KRAS result is wild type, they will start Erbitux. ... (Several lines of material) ... Ordered KRAS mutation test 11/11/2011. Results are still not available. ... , ... KRAS (mutated). Therefore did not prescribe Erbitux. ... , ... KRAS (mutated). Will not prescribe Erbitux due to mutation. ... , ... KRAS (Wild). ..., ... KRAS results are in. Patient has the mutation. ... , ... KRAS results still pending. Note that patient was negative for Lynch mutation. ..., ... KRAS test results pending. Note that patient was negative for Lynch mutation. ..., ... Ordered KRAS mutation testing on 02/15/2011. Results came back negative. ... (Several lines of material) ... Patient KRAS mutation test is negative. Will start Erbitux. ..., ... Ordered KRAS testing on 10/10/2010. Results not yet available. If patient has a mutaton, will start Erbitux. ..., ... Ordered KRAS testing. Waiting for results. ..., ... Patient is KRAS negative. Started Erbitux on 03/01/2011. ..., ... Received KRAS results on 10/20/2010. Test results indicate tumor is wild type. Ua Protein positve. ER/PR positive. HER2/neu positve. ..., ... Still need to order KRAS mutation testing. ... , ... Tumor is negative for KRAS mutation. ..., ... Tumor is wild type. Patient is eligible to receive Eribtux. ..., ... Will conduct KRAS mutation testing prior to initiation of therapy with Erbitux. ... ), class = factor)), .Names = c(profile_key, encounter_date, raw), row.names = c(NA, -16L), class = data.frame) The following code displays the results of so-called simple coding. Simple coding KRASpatient - c(001-001, 001-002, 001-003, 001-004, 001-005, 001-006, 001-007) KRAStested - c(2,3,2,2,2,3,3) KRASwild - c(1,0,2,0,3,1,3) KRASmutant - c(4,2,2,3,1,2,2) simpleData - data.frame(KRASpatient, KRAStested, KRASwild, KRASmutant) simpleData Here, KRAStested is calculated by summing all references to KRAS for each patient. Wild is calculated by summing all references to wild type, wild, and negative that come within 20 words of the closest reference to KRAS. Mutant is calculated by summing all references to mutant, mutated, and positive that occur within 20 words of the closest reference to KRAS. The second kind of coding is what I'm referring to as complex coding. The following code displays the results of this type of coding. Complex coding KRAStested - c(2,1,0,2,2,2,3) KRASwild - c(1,0,0,0,3,0,3) KRASmutant - c(0,0,0,3,0,1,0) complexData - data.frame(KRASpatient, KRAStested, KRASwild, KRASmutant) complexData The results of complex coding differ substantially from those obtained under simple coding and I think illustrate the potential problems with that approach. With complex coding, the goal would be to identify and sum only true references to KRAS testing and true references to the result of that testing (either wild type/negative or mutant/positive). True references to KRAS testing would be identified using a set of qualifiers that eliminate the false references. So, for example, one of the patients in my (made up) sample data has the phrase Will conduct KRAS mutation testing prior to initiation of therapy with Erbitux in their medical record. In this case, Will is a qualifier
Re: [R] Complex text parsing task
Hi Paul, I do not think that Nick's comment was really meant to be directed at you. He is probably just tired of getting so many emails from R-help. Nick, to stop getting emails if you no longer want them, try following the link at the bottom of every single email you have received from R-help...you can unsubscribe yourself from there if you want. If you like R-help but just do not like the quantity of emails, you could consider switching your subscription to a daily digest so you just get one email. Alternately, you could create a special folder in your email for R-help messages, and create a filter that automatically sends all message from R-help to that special folder so you still have them all but they do not clutter up your inbox. Cheers, Josh On Mon, May 21, 2012 at 8:53 AM, Paul Miller pjmiller...@yahoo.com wrote: Hi Nick, Can you elaborate (hopefully in a constructive way) on what it is that you find objectionable about my post? Thanks, Paul --- On Mon, 5/21/12, Nick Gayeski n...@wildfishconservancy.org wrote: From: Nick Gayeski n...@wildfishconservancy.org Subject: RE: [R] Complex text parsing task To: 'Paul Miller' pjmiller...@yahoo.com, r-help@r-project.org Received: Monday, May 21, 2012, 10:36 AM Please stop sending these emails! -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Paul Miller Sent: Monday, May 21, 2012 8:32 AM To: r-help@r-project.org Subject: [R] Complex text parsing task Hello Everyone, I have what I think is a complex text parsing task. I've provided some sample data below. There's a relatively simple version of the coding that needs to be done and a more complex version. If someone could help me out with either version, I'd greatly appreciate it. Here are my sample data. haveData - structure(list(profile_key = structure(c(1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 7L, 7L), .Label = c(001-001 , 001-002 , 001-003 , 001-004 , 001-005 , 001-006 , 001-007 ), class = factor), encounter_date = structure(c(9L, 10L, 11L, 12L, 13L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 4L, 7L, 7L), .Label = c( 2009-03-01 , 2009-03-22 , 2009-04-01 , 2010-03-01 , 2010-10-15 , 2010-11-15 , 2011-03-01 , 2011-03-14 , 2011-10-10 , 2011-10-24 , 2012-09-15 , 2012-10-05 , 2012-10-17 ), class = factor), raw = structure(c(9L, 12L, 16L, 13L, 10L, 7L, 6L, 3L, 2L, 4L, 14L, 15L, 1L, 5L, 8L, 11L), .Label = c( ... If patient KRAS result is wild type, they will start Erbitux. ... (Several lines of material) ... Ordered KRAS mutation test 11/11/2011. Results are still not available. ... , ... KRAS (mutated). Therefore did not prescribe Erbitux. ... , ... KRAS (mutated). Will not prescribe Erbitux due to mutation. ... , ... KRAS (Wild). ..., ... KRAS results are in. Patient has the mutation. ... , ... KRAS results still pending. Note that patient was negative for Lynch mutation. ..., ... KRAS test results pending. Note that patient was negative for Lynch mutation. ..., ... Ordered KRAS mutation testing on 02/15/2011. Results came back negative. ... (Several lines of material) ... Patient KRAS mutation test is negative. Will start Erbitux. ..., ... Ordered KRAS testing on 10/10/2010. Results not yet available. If patient has a mutaton, will start Erbitux. ..., ... Ordered KRAS testing. Waiting for results. ..., ... Patient is KRAS negative. Started Erbitux on 03/01/2011. ..., ... Received KRAS results on 10/20/2010. Test results indicate tumor is wild type. Ua Protein positve. ER/PR positive. HER2/neu positve. ..., ... Still need to order KRAS mutation testing. ... , ... Tumor is negative for KRAS mutation. ..., ... Tumor is wild type. Patient is eligible to receive Eribtux. ..., ... Will conduct KRAS mutation testing prior to initiation of therapy with Erbitux. ... ), class = factor)), .Names = c(profile_key, encounter_date, raw), row.names = c(NA, -16L), class = data.frame) The following code displays the results of so-called simple coding. Simple coding KRASpatient - c(001-001, 001-002, 001-003, 001-004, 001-005, 001-006, 001-007) KRAStested - c(2,3,2,2,2,3,3) KRASwild - c(1,0,2,0,3,1,3) KRASmutant - c(4,2,2,3,1,2,2) simpleData - data.frame(KRASpatient, KRAStested, KRASwild, KRASmutant) simpleData Here, KRAStested is calculated by summing all references to KRAS for each patient. Wild is calculated by summing all references to wild type, wild, and negative that come within 20 words of the closest reference to KRAS. Mutant is calculated by summing all references to mutant, mutated, and positive that occur within 20 words of the closest reference to KRAS. The second kind of coding is what I'm referring to as complex coding. The following code displays the results of this type of coding. Complex coding KRAStested - c(2,1,0,2,2,2,3) KRASwild - c(1,0,0,0,3,0,3
Re: [R] Complex text parsing task
Hi Josh, Thanks for pointing this out. It hadn't occurred to me that someone might post something like this to indicate they would like to receive fewer or no messages. Paul --- On Mon, 5/21/12, Joshua Wiley jwiley.ps...@gmail.com wrote: From: Joshua Wiley jwiley.ps...@gmail.com Subject: Re: [R] Complex text parsing task To: Paul Miller pjmiller...@yahoo.com Cc: Nick Gayeski n...@wildfishconservancy.org, r-help@r-project.org Received: Monday, May 21, 2012, 11:01 AM Hi Paul, I do not think that Nick's comment was really meant to be directed at you. He is probably just tired of getting so many emails from R-help. Nick, to stop getting emails if you no longer want them, try following the link at the bottom of every single email you have received from R-help...you can unsubscribe yourself from there if you want. If you like R-help but just do not like the quantity of emails, you could consider switching your subscription to a daily digest so you just get one email. Alternately, you could create a special folder in your email for R-help messages, and create a filter that automatically sends all message from R-help to that special folder so you still have them all but they do not clutter up your inbox. Cheers, Josh On Mon, May 21, 2012 at 8:53 AM, Paul Miller pjmiller...@yahoo.com wrote: Hi Nick, Can you elaborate (hopefully in a constructive way) on what it is that you find objectionable about my post? Thanks, Paul --- On Mon, 5/21/12, Nick Gayeski n...@wildfishconservancy.org wrote: From: Nick Gayeski n...@wildfishconservancy.org Subject: RE: [R] Complex text parsing task To: 'Paul Miller' pjmiller...@yahoo.com, r-help@r-project.org Received: Monday, May 21, 2012, 10:36 AM Please stop sending these emails! -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Paul Miller Sent: Monday, May 21, 2012 8:32 AM To: r-help@r-project.org Subject: [R] Complex text parsing task Hello Everyone, I have what I think is a complex text parsing task. I've provided some sample data below. There's a relatively simple version of the coding that needs to be done and a more complex version. If someone could help me out with either version, I'd greatly appreciate it. Here are my sample data. haveData - structure(list(profile_key = structure(c(1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 7L, 7L), .Label = c(001-001 , 001-002 , 001-003 , 001-004 , 001-005 , 001-006 , 001-007 ), class = factor), encounter_date = structure(c(9L, 10L, 11L, 12L, 13L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 4L, 7L, 7L), .Label = c( 2009-03-01 , 2009-03-22 , 2009-04-01 , 2010-03-01 , 2010-10-15 , 2010-11-15 , 2011-03-01 , 2011-03-14 , 2011-10-10 , 2011-10-24 , 2012-09-15 , 2012-10-05 , 2012-10-17 ), class = factor), raw = structure(c(9L, 12L, 16L, 13L, 10L, 7L, 6L, 3L, 2L, 4L, 14L, 15L, 1L, 5L, 8L, 11L), .Label = c( ... If patient KRAS result is wild type, they will start Erbitux. ... (Several lines of material) ... Ordered KRAS mutation test 11/11/2011. Results are still not available. ... , ... KRAS (mutated). Therefore did not prescribe Erbitux. ... , ... KRAS (mutated). Will not prescribe Erbitux due to mutation. ... , ... KRAS (Wild). ..., ... KRAS results are in. Patient has the mutation. ... , ... KRAS results still pending. Note that patient was negative for Lynch mutation. ..., ... KRAS test results pending. Note that patient was negative for Lynch mutation. ..., ... Ordered KRAS mutation testing on 02/15/2011. Results came back negative. ... (Several lines of material) ... Patient KRAS mutation test is negative. Will start Erbitux. ..., ... Ordered KRAS testing on 10/10/2010. Results not yet available. If patient has a mutaton, will start Erbitux. ..., ... Ordered KRAS testing. Waiting for results. ..., ... Patient is KRAS negative. Started Erbitux on 03/01/2011. ..., ... Received KRAS results on 10/20/2010. Test results indicate tumor is wild type. Ua Protein positve. ER/PR positive. HER2/neu positve. ..., ... Still need to order KRAS mutation testing. ... , ... Tumor is negative for KRAS mutation. ..., ... Tumor is wild type. Patient is eligible to receive Eribtux. ..., ... Will conduct KRAS mutation testing prior to initiation of therapy with Erbitux. ... ), class = factor)), .Names = c(profile_key, encounter_date, raw), row.names = c(NA, -16L), class = data.frame) The following code displays the results of so-called simple coding. Simple coding KRASpatient - c(001-001, 001-002, 001-003, 001-004, 001-005, 001-006, 001-007) KRAStested - c(2,3,2,2,2,3,3) KRASwild - c(1,0,2,0,3,1,3) KRASmutant - c(4,2,2,3,1,2,2) simpleData - data.frame