Re: Concept annotation questions and keep JCas results in a file

Pei Chen Sat, 07 Sep 2013 08:38:36 -0700

Samir,
xcas will eventually be deprecated/replaced with the preferred/more compact
xmi format--


/*

 
*******************************************************************************************

 * N O T E :     The XML format (XCAS) that this Cas Consumer outputs,

is eventually

 *               being superceeded by the more standardized and compact

XMI format.  However

 *               it is used currently as the expected form for remote

services, and there is

 *               existing tooling for doing stand-alone component

development and debugging

 *               that uses this format to populate an initial CAS.  So

it is not

 *               deprecated yet;  it is also being kept for

compatibility with older versions.

 *

 *               New code should consider using the XmiWriterCasConsumer

where possible,

 *               which uses the current XMI format for XML

externalizations of the CAS

 
*******************************************************************************************

 */



On Fri, Sep 6, 2013 at 11:34 PM, samir chabou <[email protected]> wrote:

> Hi Richard,
> I had a look to these methods they can allow me to implement my
> requirement. Do you have an idea if there is a preferrence of using
> readXCas/writeXCas rather than readXmi/writeXmi or it is just a matter of
> having different possibilities of read/write from/to different file format.
> Thanks
> Samir
>
>
>   ------------------------------
>  *From:* Richard Eckart de Castilho <[email protected]>
> *To:* [email protected]; samir chabou <[email protected]>
> *Sent:* Friday, September 6, 2013 3:29:19 AM
> *Subject:* Re: Concept annotation questions and keep JCas results in a
> file
>
> Hi,
>
> you might want to take a look at convenience methods in the recently
> released Apache uimaFIT 2.0.0:
>
> CasIOUtil
>   readXCas(JCas, File)
>   readXmi(JCas, File)
>   writeXCas(JCas, File)
>   writeXmi(JCas, File)
>
> Cheers,
>
> -- Richard
>
> On 06.09.2013, at 06:28, samir chabou <[email protected]> wrote:
>
> > Hi Tim, Pei and James
> > 1) I tryied List l = JCasUtil.selectCovered(jcas, BaseToken.class, i) it
> answer perfectly my requirement, thanks Tim.
> > 2) Now; I need to  NLP a medical question using the clinical pipeline
> and I need to keep the JCas result in a file or any persistent way because
> i need to use it later in my processing. Is it possible to do this and is
> it possible to recall this  JCas later in my processing ?
> >
> > Thanks
> > Samir
> > From: samir chabou <[email protected]>
> > To: "[email protected]" <[email protected]>
> > Sent: Thursday, August 29, 2013 2:51:12 PM
> > Subject: Re: Concept annotation questions
> >
> > Thanks Tim,
> > it looks a better and cleaner way. It means the List l =
> JCasUtil.selectCovered(jcas, BaseToken.class, i) will give me the
> intersection between the BaseTokens and IdentifiedAnnotations. If my base
> token is in the list so the base token is also an IdentifiedAnnotation.
> I'll give it a try some time next week and let you know.
> > Thanks
> > Samir
> >
> >
> > From: Tim Miller <[email protected]>
> > To: [email protected]
> > Sent: Thursday, August 29, 2013 1:07:58 PM
> > Subject: Re: Concept annotation questions
> >
> > Samir,
> > You may be able to use the JCasUtil class from Uimafit to do something
> like the following:
> >
> > for each IdentifiedAnnotation i:
> >    List l = JCasUtil.selectCovered(jcas, BaseToken.class, i)
> >
> >
> > (this is java-ish pseudocode obviously). Then the list you get of tokens
> will all have the same type as the IdentifiedAnnotation i. Would that solve
> your problem?
> > Tim
> >
> > On 08/29/2013 12:29 PM, samir chabou wrote:
> >> Hi James and Pei,
> >> I also need to know what is the medical type (Sympto, Drug , procedure,
> relation) of a given word token. Since in the typeystem hierarchy wordtoken
> is not under the same inheritance tree than identifiedAnnotation . I’m
> currently iterating on all wordTokens and compare each
> wordToken.CoveredText to the annotations.CovredText in the
> identifiedAnnotation. I found this a long process. James, do you think the
> patch  <<I could create a patch for you that would help with determining
> which words from the text matched a dictionary entry >> that you are
> planning to create will permit also this requirement ? or can you suggest
> me some thing better than I’m currently doing.
> >>
> >> Thanks
> >> Samir
> >>
> >> From: "Masanz, James J." <[email protected]>
> >> To: "'[email protected]'" <[email protected]>
> >> Sent: Thursday, August 29, 2013 10:18:40 AM
> >> Subject: RE: Concept annotation questions
> >>
> >> Hi Dennis,
> >>
> >> Thanks for explaining why you are interested in finding out which words
> in the original text cause a particular concept to be annotated.  We are
> currently working on getting Apache cTAKES 3.1 out.  Depending on your
> timeline, after that is done, perhaps I could create a patch for you that
> would help with determining which words from the text matched a dictionary
> entry, rather than just the begin offset of the first word and the end
> offset of the last word.
> >>
> >> As far as the chunking, the fact “liver” and “and” are being tagged as
> O-chunks explains why the dictionary lookup component is not finding liver
> cancer or lung cancer in “cancer of colon, liver and lung”
> >>
> >> I’ll try that sentence with the latest chunker model (which will be in
> cTAKES 3.1) and see if it assigns correct chunk tags for that sentence.
> >>
> >> -- James
> >>
> >> From: [email protected] [mailto:
> [email protected]] On Behalf Of
> Dennis Lee Hon Kit
> >> Sent: Wednesday, August 28, 2013 2:33 PM
> >> To: [email protected]
> >> Subject: Re: Concept annotation questions
> >>
> >> Hi James & Pei,
> >>
> >> Thank you for your replies and sorry for my late reply as I have been
> away.
> >>
> >> Q1 – The longest span could work and is one of the options we are
> looking at but when there are overlaps it can get complicated.  In the
> following example, the longest would work.  We can take start with 01, and
> ignore 02 and 03 because their start positions overlap the end position of
> 01, and then continue with 04.  But I don’t think it will always be this
> straight forward as the being/end string positions may not always be a good
> indicator of what exactly in the original text was coded.
> >>
> >> 00 Invasive ductal carcinoma of the left breast with bone metastases.
> >> 01 Invasive ductal carcinoma of the left breast
> 408643008|Infiltrating duct carcinoma of breast (disorder)|
> >> 02                                      breast with bone
> 56873002|Bone structure of sternum (body structure)|
> >> 03                                      breast with bone metastases
> 94297009|Secondary malignant neoplasm of female breast (disorder)|
> >> 04                                                  bone metastases
> 94222008|Secondary malignant neoplasm of bone (disorder)|
> >>
> >> Q2 – As we are beginners, we are not at the level where we are
> comfortable with modifying cTakes or even know where to begin modifying
> cTakes but that would be an option in the future.  Going back to the
> example of “cancer of liver” and using the begin/end position of the string
> that was used to identify the concept, the original string would be “cancer
> of colon, lung and liver.”  The CUI that was identified was C0345904, which
> has 209 (137 unique) descriptions for all languages.  Examples of English
> terms include:
> >>     • CA - Liver cancer
> >>     • Cancer of Liver
> >>     • cancer of the liver
> >>     • Cancer, Hepatic
> >>     • CANCER, HEPATOCELLULAR
> >>     • Malignant hepatic neoplasm
> >>     • Malignant liver tumor
> >>     • Malignant liver tumour
> >>     • Malignant neoplasm of liver
> >>     • malignant neoplasm of liver (diagnosis)
> >>     • Malignant neoplasm of liver unspecified
> >>     • Malignant neoplasm of liver unspecified (disorder)
> >>     • Malignant neoplasm of liver, not specified as primary or secondary
> >>     • Malignant neoplasm of liver, NOS
> >>     • Malignant neoplasm of liver, unspecified
> >>     • malignant neosplasm of the liver
> >>     • Malignant tumor of liver
> >>     • Malignant tumor of liver (disorder)
> >>     • Malignant tumour of liver
> >> It would seem suboptimal to go through each of the descriptions to try
> and determine which was the UMLS term that was used in the coding.  It is
> important for us to know which part of the string is matched because
> something like “Invasive ductal carcinoma of the left breast” will be
> matched to the SNOMED CT concept “408643008|Infiltrating duct carcinoma of
> breast (disorder)|”, but we would like to know that “left” was not matched
> and would like to post-coordinate the expression to indicate the left
> breast, i.e.: 408643008|Infiltrating duct carcinoma of breast
> (disorder)|:363698007|Finding site (attribute)|=80248007|Left breast
> structure (body structure)|.  When there are other qualifiers like
> severity, chronicity and episodicity that may be ignored when matching, we
> would like to capture it at the level of granularity specified in the
> original text.
> >>
> >> In terms of the chunking, here is what I see for “cancer of colon, lung
> and liver”:
> >>     • NP: cancer of colon, lung and liver
> >>     • PP: of
> >>     • NP: colon, lung and liver
> >> For “cancer of colon, liver and lung” here is what I see:
> >>     • NP: cancer of colon,
> >>     • PP: of
> >>     • NP: colon
> >>     • O: liver
> >>     • O: and
> >>     • NP: lung
> >> Q3 – To answer Pei’s question, we are not looking at the preferred name
> from the UMLS, just which term was used.
> >>
> >> Regards,
> >> Dennis
> >>
> >> From: Chen, Pei
> >> Sent: Thursday, August 22, 2013 12:27 PM
> >> To: [email protected]
> >> Subject: RE: Concept annotation questions
> >>
> >> Also,
> >> > 3)… or the exact description that was returned in the UMLS?
> >> I presume you mean to save the preferred name from UMLS?  If so, this
> seems to be a common request- see:
> https://issues.apache.org/jira/browse/CTAKES-224
> >>
> >> --Pei
> >>
> >> From: Masanz, James J. [mailto:[email protected]]
> >> Sent: Thursday, August 22, 2013 3:24 PM
> >> To: '[email protected]'
> >> Subject: RE: Concept annotation questions
> >>
> >>
> >> Welcome to the cTAKES community.
> >>
> >> Q1 – some people use the longest span.
> >> Q2 &Q3 – can you just use the text from the dictionary “Malignant
> neoplasm of liver (disorder)“.  Alternatively you could modify cTAKES to
> save the text of the words that it matches when it is performing dictionary
> lookup. I would guess there is a term in the UMLS dictionary with the same
> code as Malignant neoplasm of liver (disorder) that just has the words
> “cancer of liver”, but there isn’t anything in cTAKES to give that to you
> just through a configuration change.
> >>
> >> For “cancer of colon, liver and lung“, can you look at the chunk  tag
> for liver.  If it’s in a separate noun phrase (NP) from “cancer of colon”
> that would account for why cancer is not getting tied to liver in that case
> (but wouldn’t account for why the chunker is creating as a separate noun
> phrase)
> >>
> >> -- James
> >>
> >> From: [email protected] [mailto:
> [email protected]] On Behalf Of
> Dennis Lee Hon Kit
> >> Sent: Wednesday, August 21, 2013 1:10 PM
> >> To: [email protected]
> >> Subject: Concept annotation questions
> >>
> >> Hi Everyone,
> >>
> >> We are new to cTakes so please bear with our questions.  We are using
> cTakes to annotate things like encounter diagnoses and referral notes and
> are especially interested with the SNOMED CT encodings.  But we are not
> sure how to make sense of all the outputs.
> >>
> >> Example #1
> >>
> >> In the example below, “cancer of colon, lung and liver” has been
> encoded with SNOMED CT and additional concepts that do not apply have been
> removed (e.g., general “cancer” concept, lung, colon and liver structures,
> etc).  They have been plotted out by the begin/end positions.  If the terms
> to do not align, its probably because the email only accepts plain text and
> a mono-spaced font is not the default.
> >>
> >> cancer of colon, lung and liver
> >> cancer of colon, lung and liver  93870000|Malignant neoplasm of liver
> (disorder)|
> >> cancer of colon, lung            363358000|Malignant tumor of lung
> (disorder)|
> >> cancer of colon                  363406005|Malignant tumor of colon
> (disorder)|
> >>
> >> Question (1) – We had to do quite a bit of post-processing to remove
> inactive concepts, subtype concepts, concepts that are part of the defining
> attributes, etc.  Are there a set of guidelines to help sort out the CUI or
> SNOMED CT codes that have been identified?
> >> Question (2) – How can we determine that “93870000|Malignant neoplasm
> of liver (disorder)|” refers to “cancer of liver” as opposed to using the
> begin/end string, which points to “cancer of colon, lung and liver”?
> Certainly we can try to do additional parsing but there are a lot of
> different scenarios to take into account.
> >> Question (3) – This relates to question 2, are we able to identify the
> original terms that were used for the concept matching or the exact
> description that was returned in the UMLS?  While the CUI is helpful, the
> CUI can refer to tens or even hundreds of descriptions.
> >>
> >> Example #2
> >>
> >> Switching the position of colon, lung and liver can result in different
> encodings.  Once again, after removing additional concepts not needed
> (i.e., “cancer” and “colon structure”), we get the following.  What
> happened to liver and lung cancer?
> >>
> >> cancer of colon, liver and lung
> >> cancer of colon                  363406005|Malignant tumor of colon
> (disorder)|
> >>                            lung  39607008|Lung structure (body
> structure)|
> >>
> >> We have more questions but will start with these.  Thank you in advance.
> >>
> >> Regards,
> >> Dennis
> >>
> >>
> >
> >
> >
> >
> >
>
>
>

Re: Concept annotation questions and keep JCas results in a file

Reply via email to