[R] Complex text parsing task

2012-05-21 Thread Paul Miller
Hello Everyone,

I have what I think is a complex text parsing task. I've provided some sample 
data below. There's a relatively simple version of the coding that needs to be 
done and a more complex version. If someone could help me out with either 
version, I'd greatly appreciate it.

Here are my sample data.

haveData - 
structure(list(profile_key = structure(c(1L, 1L, 2L, 2L, 2L, 
3L, 3L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 7L, 7L), .Label = c(001-001 , 
001-002 , 001-003 , 001-004 , 001-005 , 001-006 , 001-007 
), class = factor), encounter_date = structure(c(9L, 10L, 11L, 
12L, 13L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 4L, 7L, 7L), .Label = c( 2009-03-01 
, 
 2009-03-22 ,  2009-04-01 ,  2010-03-01 ,  2010-10-15 , 
 2010-11-15 ,  2011-03-01 ,  2011-03-14 ,  2011-10-10 , 
 2011-10-24 ,  2012-09-15 ,  2012-10-05 ,  2012-10-17 
), class = factor), raw = structure(c(9L, 12L, 16L, 13L, 10L, 
7L, 6L, 3L, 2L, 4L, 14L, 15L, 1L, 5L, 8L, 11L), .Label = c( ... If patient 
KRAS result is wild type, they will start Erbitux. ... (Several lines of 
material) ... Ordered KRAS mutation test 11/11/2011. Results are still not 
available. ... , 
 ... KRAS (mutated). Therefore did not prescribe Erbitux. ... , 
 ... KRAS (mutated). Will not prescribe Erbitux due to mutation. ... , 
 ... KRAS (Wild). ...,  ... KRAS results are in. Patient has the mutation. 
... , 
 ... KRAS results still pending. Note that patient was negative for Lynch 
mutation. ..., 
 ... KRAS test results pending. Note that patient was negative for Lynch 
mutation. ..., 
 ... Ordered KRAS mutation testing on 02/15/2011. Results came back negative. 
... (Several lines of material) ... Patient KRAS mutation test is negative. 
Will start Erbitux. ..., 
 ... Ordered KRAS testing on 10/10/2010. Results not yet available. If patient 
has a mutaton, will start Erbitux. ..., 
 ... Ordered KRAS testing. Waiting for results. ...,  ... Patient is KRAS 
negative. Started Erbitux on 03/01/2011. ..., 
 ... Received KRAS results on 10/20/2010. Test results indicate tumor is wild 
type. Ua Protein positve. ER/PR positive. HER2/neu positve. ..., 
 ... Still need to order KRAS mutation testing. ... ,  ... Tumor is negative 
for KRAS mutation. ..., 
 ... Tumor is wild type. Patient is eligible to receive Eribtux. ..., 
 ... Will conduct KRAS mutation testing prior to initiation of therapy with 
Erbitux. ...
), class = factor)), .Names = c(profile_key, encounter_date, 
raw), row.names = c(NA, -16L), class = data.frame)

The following code displays the results of so-called simple coding.

 Simple coding 

KRASpatient - c(001-001, 001-002, 001-003, 001-004, 001-005, 
001-006,  001-007)
KRAStested - c(2,3,2,2,2,3,3)
KRASwild - c(1,0,2,0,3,1,3)
KRASmutant - c(4,2,2,3,1,2,2)
simpleData - data.frame(KRASpatient, KRAStested, KRASwild, KRASmutant) 
simpleData

Here, KRAStested is calculated by summing all references to KRAS for each 
patient. Wild is calculated by summing all references to wild type, wild, 
and negative that come within 20 words of the closest reference to KRAS. 
Mutant is calculated by summing all references to mutant, mutated, and 
positive that occur within 20 words of the closest reference to KRAS.   

The second kind of coding is what I'm referring to as complex coding.  The 
following code displays the results of this type of coding.

 Complex coding 

KRAStested - c(2,1,0,2,2,2,3)
KRASwild - c(1,0,0,0,3,0,3)
KRASmutant - c(0,0,0,3,0,1,0)
complexData - data.frame(KRASpatient, KRAStested, KRASwild, KRASmutant) 
complexData

The results of complex coding differ substantially from those obtained under 
simple coding and I think illustrate the potential problems with that 
approach. With complex coding, the goal would be to identify and sum only 
true references to KRAS testing and true references to the result of that 
testing (either wild type/negative or mutant/positive).

True references to KRAS testing would be identified using a set of qualifiers 
that eliminate the false references. So, for example, one of the patients in my 
(made up) sample data has the phrase Will conduct KRAS mutation testing prior 
to initiation of therapy with Erbitux in their medical record. In this case, 
Will is a qualifier that indicates this is not a true reference to KRAS 
testing. For this exercise, other qualifiers related to KRAS testing would 
include need, order (but not the past tense ordered), wait, waiting, 
await, and awaiting.
To be a qualifier, these terms would need to occur within 12 words of the 
closest true reference to KRAS.

True references to the results of testing would also be identified using a set 
of qualifiers that eliminate false references. Here the list of qualifiers 
would include if, lynch, kras mutation test, kras mutation testing and 
for kras mutation. Qualifiers would need to come within 12 words of a true 
reference to KRAS testing.

There's an additional wrinkle for identifying true references to the results of 
testing. One also needs 

Re: [R] Complex text parsing task

2012-05-21 Thread Paul Miller
Hi Nick,

Can you elaborate (hopefully in a constructive way) on what it is that you find 
objectionable about my post?

Thanks,

Paul

--- On Mon, 5/21/12, Nick Gayeski n...@wildfishconservancy.org wrote:

 From: Nick Gayeski n...@wildfishconservancy.org
 Subject: RE: [R] Complex text parsing task
 To: 'Paul Miller' pjmiller...@yahoo.com, r-help@r-project.org
 Received: Monday, May 21, 2012, 10:36 AM
 Please stop sending these emails!
 
 
 -Original Message-
 From: r-help-boun...@r-project.org
 [mailto:r-help-boun...@r-project.org]
 On
 Behalf Of Paul Miller
 Sent: Monday, May 21, 2012 8:32 AM
 To: r-help@r-project.org
 Subject: [R] Complex text parsing task
 
 Hello Everyone,
 
 I have what I think is a complex text parsing task. I've
 provided some
 sample data below. There's a relatively simple version of
 the coding that
 needs to be done and a more complex version. If someone
 could help me out
 with either version, I'd greatly appreciate it.
 
 Here are my sample data.
 
 haveData -
 structure(list(profile_key = structure(c(1L, 1L, 2L, 2L, 2L,
 3L, 3L, 4L, 4L,
 5L, 5L, 5L, 6L, 6L, 7L, 7L), .Label = c(001-001 ,
 001-002 , 001-003 , 001-004 , 001-005 , 001-006 ,
 001-007 
 ), class = factor), encounter_date = structure(c(9L, 10L,
 11L, 12L, 13L,
 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 4L, 7L, 7L), .Label = c(
 2009-03-01 , 
 2009-03-22 ,  2009-04-01 ,  2010-03-01 ,  2010-10-15
 ,  2010-11-15
 ,  2011-03-01 ,  2011-03-14 ,  2011-10-10 , 
 2011-10-24 , 
 2012-09-15 ,  2012-10-05 ,  2012-10-17 
 ), class = factor), raw = structure(c(9L, 12L, 16L, 13L,
 10L, 7L, 6L, 3L,
 2L, 4L, 14L, 15L, 1L, 5L, 8L, 11L), .Label = c( ... If
 patient KRAS result
 is wild type, they will start Erbitux. ... (Several lines of
 material) ...
 Ordered KRAS mutation test 11/11/2011. Results are still not
 available. ...
 ,  ... KRAS (mutated). Therefore did not prescribe
 Erbitux. ... ,  ...
 KRAS (mutated). Will not prescribe Erbitux due to mutation.
 ... ,  ...
 KRAS (Wild). ...,  ... KRAS results are in. Patient has
 the mutation. ...
 ,  ... KRAS results still pending. Note that patient was
 negative for
 Lynch mutation. ...,  ... KRAS test results pending. Note
 that patient was
 negative for Lynch mutation. ...,  ... Ordered KRAS
 mutation testing on
 02/15/2011. Results came back negative. ... (Several lines
 of material) ...
 Patient KRAS mutation test is negative. Will start Erbitux.
 ...,  ...
 Ordered KRAS testing on 10/10/2010. Results not yet
 available. If patient
 has a mutaton, will start Erbitux. ...,  ... Ordered KRAS
 testing. Waiting
 for results. ...,  ... Patient is KRAS negative. Started
 Erbitux on
 03/01/2011. ...,  ... Received KRAS results on 10/20/2010.
 Test results
 indicate tumor is wild type. Ua Protein positve. ER/PR
 positive. HER2/neu
 positve. ...,  ... Still need to order KRAS mutation
 testing. ... ,  ...
 Tumor is negative for KRAS mutation. ...,  ... Tumor is
 wild type. Patient
 is eligible to receive Eribtux. ...,  ... Will conduct
 KRAS mutation
 testing prior to initiation of therapy with Erbitux. ...
 ), class = factor)), .Names = c(profile_key,
 encounter_date, raw),
 row.names = c(NA, -16L), class = data.frame)
 
 The following code displays the results of so-called
 simple coding.
 
  Simple coding 
 
 KRASpatient - c(001-001, 001-002, 001-003,
 001-004, 001-005,
 001-006,  001-007) KRAStested -
 c(2,3,2,2,2,3,3) KRASwild -
 c(1,0,2,0,3,1,3) KRASmutant - c(4,2,2,3,1,2,2)
 simpleData -
 data.frame(KRASpatient, KRAStested, KRASwild, KRASmutant)
 simpleData
 
 Here, KRAStested is calculated by summing all references to
 KRAS for each
 patient. Wild is calculated by summing all references to
 wild type,
 wild, and negative that come within 20 words of the
 closest reference to
 KRAS. Mutant is calculated by summing all references to
 mutant, mutated,
 and positive that occur within 20 words of the closest
 reference to KRAS.
 
 
 The second kind of coding is what I'm referring to as
 complex coding.  The
 following code displays the results of this type of coding.
 
  Complex coding 
 
 KRAStested - c(2,1,0,2,2,2,3)
 KRASwild - c(1,0,0,0,3,0,3)
 KRASmutant - c(0,0,0,3,0,1,0)
 complexData - data.frame(KRASpatient, KRAStested,
 KRASwild, KRASmutant)
 complexData
 
 The results of complex coding differ substantially from
 those obtained
 under simple coding and I think illustrate the potential
 problems with
 that approach. With complex coding, the goal would be to
 identify and sum
 only true references to KRAS testing and true references to
 the result of
 that testing (either wild type/negative or
 mutant/positive).
 
 True references to KRAS testing would be identified using a
 set of
 qualifiers that eliminate the false references. So, for
 example, one of the
 patients in my (made up) sample data has the phrase Will
 conduct KRAS
 mutation testing prior to initiation of therapy with
 Erbitux in their
 medical record. In this case, Will is a qualifier

Re: [R] Complex text parsing task

2012-05-21 Thread Joshua Wiley
Hi Paul,

I do not think that Nick's comment was really meant to be directed at
you.  He is probably just tired of getting so many emails from R-help.

Nick, to stop getting emails if you no longer want them, try following
the link at the bottom of every single email you have received from
R-help...you can unsubscribe yourself from there if you want.  If you
like R-help but just do not like the quantity of emails, you could
consider switching your subscription to a daily digest so you just get
one email.  Alternately, you could create a special folder in your
email for R-help messages, and create a filter that automatically
sends all message from R-help to that special folder so you still have
them all but they do not clutter up your inbox.

Cheers,

Josh

On Mon, May 21, 2012 at 8:53 AM, Paul Miller pjmiller...@yahoo.com wrote:
 Hi Nick,

 Can you elaborate (hopefully in a constructive way) on what it is that you 
 find objectionable about my post?

 Thanks,

 Paul

 --- On Mon, 5/21/12, Nick Gayeski n...@wildfishconservancy.org wrote:

 From: Nick Gayeski n...@wildfishconservancy.org
 Subject: RE: [R] Complex text parsing task
 To: 'Paul Miller' pjmiller...@yahoo.com, r-help@r-project.org
 Received: Monday, May 21, 2012, 10:36 AM
 Please stop sending these emails!


 -Original Message-
 From: r-help-boun...@r-project.org
 [mailto:r-help-boun...@r-project.org]
 On
 Behalf Of Paul Miller
 Sent: Monday, May 21, 2012 8:32 AM
 To: r-help@r-project.org
 Subject: [R] Complex text parsing task

 Hello Everyone,

 I have what I think is a complex text parsing task. I've
 provided some
 sample data below. There's a relatively simple version of
 the coding that
 needs to be done and a more complex version. If someone
 could help me out
 with either version, I'd greatly appreciate it.

 Here are my sample data.

 haveData -
 structure(list(profile_key = structure(c(1L, 1L, 2L, 2L, 2L,
 3L, 3L, 4L, 4L,
 5L, 5L, 5L, 6L, 6L, 7L, 7L), .Label = c(001-001 ,
 001-002 , 001-003 , 001-004 , 001-005 , 001-006 ,
 001-007 
 ), class = factor), encounter_date = structure(c(9L, 10L,
 11L, 12L, 13L,
 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 4L, 7L, 7L), .Label = c(
 2009-03-01 , 
 2009-03-22 ,  2009-04-01 ,  2010-03-01 ,  2010-10-15
 ,  2010-11-15
 ,  2011-03-01 ,  2011-03-14 ,  2011-10-10 , 
 2011-10-24 , 
 2012-09-15 ,  2012-10-05 ,  2012-10-17 
 ), class = factor), raw = structure(c(9L, 12L, 16L, 13L,
 10L, 7L, 6L, 3L,
 2L, 4L, 14L, 15L, 1L, 5L, 8L, 11L), .Label = c( ... If
 patient KRAS result
 is wild type, they will start Erbitux. ... (Several lines of
 material) ...
 Ordered KRAS mutation test 11/11/2011. Results are still not
 available. ...
 ,  ... KRAS (mutated). Therefore did not prescribe
 Erbitux. ... ,  ...
 KRAS (mutated). Will not prescribe Erbitux due to mutation.
 ... ,  ...
 KRAS (Wild). ...,  ... KRAS results are in. Patient has
 the mutation. ...
 ,  ... KRAS results still pending. Note that patient was
 negative for
 Lynch mutation. ...,  ... KRAS test results pending. Note
 that patient was
 negative for Lynch mutation. ...,  ... Ordered KRAS
 mutation testing on
 02/15/2011. Results came back negative. ... (Several lines
 of material) ...
 Patient KRAS mutation test is negative. Will start Erbitux.
 ...,  ...
 Ordered KRAS testing on 10/10/2010. Results not yet
 available. If patient
 has a mutaton, will start Erbitux. ...,  ... Ordered KRAS
 testing. Waiting
 for results. ...,  ... Patient is KRAS negative. Started
 Erbitux on
 03/01/2011. ...,  ... Received KRAS results on 10/20/2010.
 Test results
 indicate tumor is wild type. Ua Protein positve. ER/PR
 positive. HER2/neu
 positve. ...,  ... Still need to order KRAS mutation
 testing. ... ,  ...
 Tumor is negative for KRAS mutation. ...,  ... Tumor is
 wild type. Patient
 is eligible to receive Eribtux. ...,  ... Will conduct
 KRAS mutation
 testing prior to initiation of therapy with Erbitux. ...
 ), class = factor)), .Names = c(profile_key,
 encounter_date, raw),
 row.names = c(NA, -16L), class = data.frame)

 The following code displays the results of so-called
 simple coding.

  Simple coding 

 KRASpatient - c(001-001, 001-002, 001-003,
 001-004, 001-005,
 001-006,  001-007) KRAStested -
 c(2,3,2,2,2,3,3) KRASwild -
 c(1,0,2,0,3,1,3) KRASmutant - c(4,2,2,3,1,2,2)
 simpleData -
 data.frame(KRASpatient, KRAStested, KRASwild, KRASmutant)
 simpleData

 Here, KRAStested is calculated by summing all references to
 KRAS for each
 patient. Wild is calculated by summing all references to
 wild type,
 wild, and negative that come within 20 words of the
 closest reference to
 KRAS. Mutant is calculated by summing all references to
 mutant, mutated,
 and positive that occur within 20 words of the closest
 reference to KRAS.


 The second kind of coding is what I'm referring to as
 complex coding.  The
 following code displays the results of this type of coding.

  Complex coding 

 KRAStested - c(2,1,0,2,2,2,3)
 KRASwild - c(1,0,0,0,3,0,3

Re: [R] Complex text parsing task

2012-05-21 Thread Paul Miller
Hi Josh,

Thanks for pointing this out. It hadn't occurred to me that someone might post 
something like this to indicate they would like to receive fewer or no 
messages. 

Paul 

--- On Mon, 5/21/12, Joshua Wiley jwiley.ps...@gmail.com wrote:

 From: Joshua Wiley jwiley.ps...@gmail.com
 Subject: Re: [R] Complex text parsing task
 To: Paul Miller pjmiller...@yahoo.com
 Cc: Nick Gayeski n...@wildfishconservancy.org, r-help@r-project.org
 Received: Monday, May 21, 2012, 11:01 AM
 Hi Paul,
 
 I do not think that Nick's comment was really meant to be
 directed at
 you.  He is probably just tired of getting so many
 emails from R-help.
 
 Nick, to stop getting emails if you no longer want them, try
 following
 the link at the bottom of every single email you have
 received from
 R-help...you can unsubscribe yourself from there if you
 want.  If you
 like R-help but just do not like the quantity of emails, you
 could
 consider switching your subscription to a daily digest so
 you just get
 one email.  Alternately, you could create a special
 folder in your
 email for R-help messages, and create a filter that
 automatically
 sends all message from R-help to that special folder so you
 still have
 them all but they do not clutter up your inbox.
 
 Cheers,
 
 Josh
 
 On Mon, May 21, 2012 at 8:53 AM, Paul Miller pjmiller...@yahoo.com
 wrote:
  Hi Nick,
 
  Can you elaborate (hopefully in a constructive way) on
 what it is that you find objectionable about my post?
 
  Thanks,
 
  Paul
 
  --- On Mon, 5/21/12, Nick Gayeski n...@wildfishconservancy.org
 wrote:
 
  From: Nick Gayeski n...@wildfishconservancy.org
  Subject: RE: [R] Complex text parsing task
  To: 'Paul Miller' pjmiller...@yahoo.com,
 r-help@r-project.org
  Received: Monday, May 21, 2012, 10:36 AM
  Please stop sending these emails!
 
 
  -Original Message-
  From: r-help-boun...@r-project.org
  [mailto:r-help-boun...@r-project.org]
  On
  Behalf Of Paul Miller
  Sent: Monday, May 21, 2012 8:32 AM
  To: r-help@r-project.org
  Subject: [R] Complex text parsing task
 
  Hello Everyone,
 
  I have what I think is a complex text parsing task.
 I've
  provided some
  sample data below. There's a relatively simple
 version of
  the coding that
  needs to be done and a more complex version. If
 someone
  could help me out
  with either version, I'd greatly appreciate it.
 
  Here are my sample data.
 
  haveData -
  structure(list(profile_key = structure(c(1L, 1L,
 2L, 2L, 2L,
  3L, 3L, 4L, 4L,
  5L, 5L, 5L, 6L, 6L, 7L, 7L), .Label = c(001-001
 ,
  001-002 , 001-003 , 001-004 , 001-005 ,
 001-006 ,
  001-007 
  ), class = factor), encounter_date =
 structure(c(9L, 10L,
  11L, 12L, 13L,
  5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 4L, 7L, 7L), .Label
 = c(
  2009-03-01 , 
  2009-03-22 ,  2009-04-01 ,  2010-03-01 , 
 2010-10-15
  ,  2010-11-15
  ,  2011-03-01 ,  2011-03-14 ,  2011-10-10 ,
 
  2011-10-24 , 
  2012-09-15 ,  2012-10-05 ,  2012-10-17 
  ), class = factor), raw = structure(c(9L, 12L,
 16L, 13L,
  10L, 7L, 6L, 3L,
  2L, 4L, 14L, 15L, 1L, 5L, 8L, 11L), .Label = c(
 ... If
  patient KRAS result
  is wild type, they will start Erbitux. ... (Several
 lines of
  material) ...
  Ordered KRAS mutation test 11/11/2011. Results are
 still not
  available. ...
  ,  ... KRAS (mutated). Therefore did not
 prescribe
  Erbitux. ... ,  ...
  KRAS (mutated). Will not prescribe Erbitux due to
 mutation.
  ... ,  ...
  KRAS (Wild). ...,  ... KRAS results are in.
 Patient has
  the mutation. ...
  ,  ... KRAS results still pending. Note that
 patient was
  negative for
  Lynch mutation. ...,  ... KRAS test results
 pending. Note
  that patient was
  negative for Lynch mutation. ...,  ... Ordered
 KRAS
  mutation testing on
  02/15/2011. Results came back negative. ...
 (Several lines
  of material) ...
  Patient KRAS mutation test is negative. Will start
 Erbitux.
  ...,  ...
  Ordered KRAS testing on 10/10/2010. Results not
 yet
  available. If patient
  has a mutaton, will start Erbitux. ...,  ...
 Ordered KRAS
  testing. Waiting
  for results. ...,  ... Patient is KRAS negative.
 Started
  Erbitux on
  03/01/2011. ...,  ... Received KRAS results on
 10/20/2010.
  Test results
  indicate tumor is wild type. Ua Protein positve.
 ER/PR
  positive. HER2/neu
  positve. ...,  ... Still need to order KRAS
 mutation
  testing. ... ,  ...
  Tumor is negative for KRAS mutation. ...,  ...
 Tumor is
  wild type. Patient
  is eligible to receive Eribtux. ...,  ... Will
 conduct
  KRAS mutation
  testing prior to initiation of therapy with
 Erbitux. ...
  ), class = factor)), .Names = c(profile_key,
  encounter_date, raw),
  row.names = c(NA, -16L), class = data.frame)
 
  The following code displays the results of
 so-called
  simple coding.
 
   Simple coding 
 
  KRASpatient - c(001-001, 001-002,
 001-003,
  001-004, 001-005,
  001-006,  001-007) KRAStested -
  c(2,3,2,2,2,3,3) KRASwild -
  c(1,0,2,0,3,1,3) KRASmutant - c(4,2,2,3,1,2,2)
  simpleData -
  data.frame