Samir, xcas will eventually be deprecated/replaced with the preferred/more compact xmi format--
/* ******************************************************************************************* * N O T E : The XML format (XCAS) that this Cas Consumer outputs, is eventually * being superceeded by the more standardized and compact XMI format. However * it is used currently as the expected form for remote services, and there is * existing tooling for doing stand-alone component development and debugging * that uses this format to populate an initial CAS. So it is not * deprecated yet; it is also being kept for compatibility with older versions. * * New code should consider using the XmiWriterCasConsumer where possible, * which uses the current XMI format for XML externalizations of the CAS ******************************************************************************************* */ On Fri, Sep 6, 2013 at 11:34 PM, samir chabou <[email protected]> wrote: > Hi Richard, > I had a look to these methods they can allow me to implement my > requirement. Do you have an idea if there is a preferrence of using > readXCas/writeXCas rather than readXmi/writeXmi or it is just a matter of > having different possibilities of read/write from/to different file format. > Thanks > Samir > > > ------------------------------ > *From:* Richard Eckart de Castilho <[email protected]> > *To:* [email protected]; samir chabou <[email protected]> > *Sent:* Friday, September 6, 2013 3:29:19 AM > *Subject:* Re: Concept annotation questions and keep JCas results in a > file > > Hi, > > you might want to take a look at convenience methods in the recently > released Apache uimaFIT 2.0.0: > > CasIOUtil > readXCas(JCas, File) > readXmi(JCas, File) > writeXCas(JCas, File) > writeXmi(JCas, File) > > Cheers, > > -- Richard > > On 06.09.2013, at 06:28, samir chabou <[email protected]> wrote: > > > Hi Tim, Pei and James > > 1) I tryied List l = JCasUtil.selectCovered(jcas, BaseToken.class, i) it > answer perfectly my requirement, thanks Tim. > > 2) Now; I need to NLP a medical question using the clinical pipeline > and I need to keep the JCas result in a file or any persistent way because > i need to use it later in my processing. Is it possible to do this and is > it possible to recall this JCas later in my processing ? > > > > Thanks > > Samir > > From: samir chabou <[email protected]> > > To: "[email protected]" <[email protected]> > > Sent: Thursday, August 29, 2013 2:51:12 PM > > Subject: Re: Concept annotation questions > > > > Thanks Tim, > > it looks a better and cleaner way. It means the List l = > JCasUtil.selectCovered(jcas, BaseToken.class, i) will give me the > intersection between the BaseTokens and IdentifiedAnnotations. If my base > token is in the list so the base token is also an IdentifiedAnnotation. > I'll give it a try some time next week and let you know. > > Thanks > > Samir > > > > > > From: Tim Miller <[email protected]> > > To: [email protected] > > Sent: Thursday, August 29, 2013 1:07:58 PM > > Subject: Re: Concept annotation questions > > > > Samir, > > You may be able to use the JCasUtil class from Uimafit to do something > like the following: > > > > for each IdentifiedAnnotation i: > > List l = JCasUtil.selectCovered(jcas, BaseToken.class, i) > > > > > > (this is java-ish pseudocode obviously). Then the list you get of tokens > will all have the same type as the IdentifiedAnnotation i. Would that solve > your problem? > > Tim > > > > On 08/29/2013 12:29 PM, samir chabou wrote: > >> Hi James and Pei, > >> I also need to know what is the medical type (Sympto, Drug , procedure, > relation) of a given word token. Since in the typeystem hierarchy wordtoken > is not under the same inheritance tree than identifiedAnnotation . I’m > currently iterating on all wordTokens and compare each > wordToken.CoveredText to the annotations.CovredText in the > identifiedAnnotation. I found this a long process. James, do you think the > patch <<I could create a patch for you that would help with determining > which words from the text matched a dictionary entry >> that you are > planning to create will permit also this requirement ? or can you suggest > me some thing better than I’m currently doing. > >> > >> Thanks > >> Samir > >> > >> From: "Masanz, James J." <[email protected]> > >> To: "'[email protected]'" <[email protected]> > >> Sent: Thursday, August 29, 2013 10:18:40 AM > >> Subject: RE: Concept annotation questions > >> > >> Hi Dennis, > >> > >> Thanks for explaining why you are interested in finding out which words > in the original text cause a particular concept to be annotated. We are > currently working on getting Apache cTAKES 3.1 out. Depending on your > timeline, after that is done, perhaps I could create a patch for you that > would help with determining which words from the text matched a dictionary > entry, rather than just the begin offset of the first word and the end > offset of the last word. > >> > >> As far as the chunking, the fact “liver” and “and” are being tagged as > O-chunks explains why the dictionary lookup component is not finding liver > cancer or lung cancer in “cancer of colon, liver and lung” > >> > >> I’ll try that sentence with the latest chunker model (which will be in > cTAKES 3.1) and see if it assigns correct chunk tags for that sentence. > >> > >> -- James > >> > >> From: [email protected] [mailto: > [email protected]] On Behalf Of > Dennis Lee Hon Kit > >> Sent: Wednesday, August 28, 2013 2:33 PM > >> To: [email protected] > >> Subject: Re: Concept annotation questions > >> > >> Hi James & Pei, > >> > >> Thank you for your replies and sorry for my late reply as I have been > away. > >> > >> Q1 – The longest span could work and is one of the options we are > looking at but when there are overlaps it can get complicated. In the > following example, the longest would work. We can take start with 01, and > ignore 02 and 03 because their start positions overlap the end position of > 01, and then continue with 04. But I don’t think it will always be this > straight forward as the being/end string positions may not always be a good > indicator of what exactly in the original text was coded. > >> > >> 00 Invasive ductal carcinoma of the left breast with bone metastases. > >> 01 Invasive ductal carcinoma of the left breast > 408643008|Infiltrating duct carcinoma of breast (disorder)| > >> 02 breast with bone > 56873002|Bone structure of sternum (body structure)| > >> 03 breast with bone metastases > 94297009|Secondary malignant neoplasm of female breast (disorder)| > >> 04 bone metastases > 94222008|Secondary malignant neoplasm of bone (disorder)| > >> > >> Q2 – As we are beginners, we are not at the level where we are > comfortable with modifying cTakes or even know where to begin modifying > cTakes but that would be an option in the future. Going back to the > example of “cancer of liver” and using the begin/end position of the string > that was used to identify the concept, the original string would be “cancer > of colon, lung and liver.” The CUI that was identified was C0345904, which > has 209 (137 unique) descriptions for all languages. Examples of English > terms include: > >> • CA - Liver cancer > >> • Cancer of Liver > >> • cancer of the liver > >> • Cancer, Hepatic > >> • CANCER, HEPATOCELLULAR > >> • Malignant hepatic neoplasm > >> • Malignant liver tumor > >> • Malignant liver tumour > >> • Malignant neoplasm of liver > >> • malignant neoplasm of liver (diagnosis) > >> • Malignant neoplasm of liver unspecified > >> • Malignant neoplasm of liver unspecified (disorder) > >> • Malignant neoplasm of liver, not specified as primary or secondary > >> • Malignant neoplasm of liver, NOS > >> • Malignant neoplasm of liver, unspecified > >> • malignant neosplasm of the liver > >> • Malignant tumor of liver > >> • Malignant tumor of liver (disorder) > >> • Malignant tumour of liver > >> It would seem suboptimal to go through each of the descriptions to try > and determine which was the UMLS term that was used in the coding. It is > important for us to know which part of the string is matched because > something like “Invasive ductal carcinoma of the left breast” will be > matched to the SNOMED CT concept “408643008|Infiltrating duct carcinoma of > breast (disorder)|”, but we would like to know that “left” was not matched > and would like to post-coordinate the expression to indicate the left > breast, i.e.: 408643008|Infiltrating duct carcinoma of breast > (disorder)|:363698007|Finding site (attribute)|=80248007|Left breast > structure (body structure)|. When there are other qualifiers like > severity, chronicity and episodicity that may be ignored when matching, we > would like to capture it at the level of granularity specified in the > original text. > >> > >> In terms of the chunking, here is what I see for “cancer of colon, lung > and liver”: > >> • NP: cancer of colon, lung and liver > >> • PP: of > >> • NP: colon, lung and liver > >> For “cancer of colon, liver and lung” here is what I see: > >> • NP: cancer of colon, > >> • PP: of > >> • NP: colon > >> • O: liver > >> • O: and > >> • NP: lung > >> Q3 – To answer Pei’s question, we are not looking at the preferred name > from the UMLS, just which term was used. > >> > >> Regards, > >> Dennis > >> > >> From: Chen, Pei > >> Sent: Thursday, August 22, 2013 12:27 PM > >> To: [email protected] > >> Subject: RE: Concept annotation questions > >> > >> Also, > >> > 3)… or the exact description that was returned in the UMLS? > >> I presume you mean to save the preferred name from UMLS? If so, this > seems to be a common request- see: > https://issues.apache.org/jira/browse/CTAKES-224 > >> > >> --Pei > >> > >> From: Masanz, James J. [mailto:[email protected]] > >> Sent: Thursday, August 22, 2013 3:24 PM > >> To: '[email protected]' > >> Subject: RE: Concept annotation questions > >> > >> > >> Welcome to the cTAKES community. > >> > >> Q1 – some people use the longest span. > >> Q2 &Q3 – can you just use the text from the dictionary “Malignant > neoplasm of liver (disorder)“. Alternatively you could modify cTAKES to > save the text of the words that it matches when it is performing dictionary > lookup. I would guess there is a term in the UMLS dictionary with the same > code as Malignant neoplasm of liver (disorder) that just has the words > “cancer of liver”, but there isn’t anything in cTAKES to give that to you > just through a configuration change. > >> > >> For “cancer of colon, liver and lung“, can you look at the chunk tag > for liver. If it’s in a separate noun phrase (NP) from “cancer of colon” > that would account for why cancer is not getting tied to liver in that case > (but wouldn’t account for why the chunker is creating as a separate noun > phrase) > >> > >> -- James > >> > >> From: [email protected] [mailto: > [email protected]] On Behalf Of > Dennis Lee Hon Kit > >> Sent: Wednesday, August 21, 2013 1:10 PM > >> To: [email protected] > >> Subject: Concept annotation questions > >> > >> Hi Everyone, > >> > >> We are new to cTakes so please bear with our questions. We are using > cTakes to annotate things like encounter diagnoses and referral notes and > are especially interested with the SNOMED CT encodings. But we are not > sure how to make sense of all the outputs. > >> > >> Example #1 > >> > >> In the example below, “cancer of colon, lung and liver” has been > encoded with SNOMED CT and additional concepts that do not apply have been > removed (e.g., general “cancer” concept, lung, colon and liver structures, > etc). They have been plotted out by the begin/end positions. If the terms > to do not align, its probably because the email only accepts plain text and > a mono-spaced font is not the default. > >> > >> cancer of colon, lung and liver > >> cancer of colon, lung and liver 93870000|Malignant neoplasm of liver > (disorder)| > >> cancer of colon, lung 363358000|Malignant tumor of lung > (disorder)| > >> cancer of colon 363406005|Malignant tumor of colon > (disorder)| > >> > >> Question (1) – We had to do quite a bit of post-processing to remove > inactive concepts, subtype concepts, concepts that are part of the defining > attributes, etc. Are there a set of guidelines to help sort out the CUI or > SNOMED CT codes that have been identified? > >> Question (2) – How can we determine that “93870000|Malignant neoplasm > of liver (disorder)|” refers to “cancer of liver” as opposed to using the > begin/end string, which points to “cancer of colon, lung and liver”? > Certainly we can try to do additional parsing but there are a lot of > different scenarios to take into account. > >> Question (3) – This relates to question 2, are we able to identify the > original terms that were used for the concept matching or the exact > description that was returned in the UMLS? While the CUI is helpful, the > CUI can refer to tens or even hundreds of descriptions. > >> > >> Example #2 > >> > >> Switching the position of colon, lung and liver can result in different > encodings. Once again, after removing additional concepts not needed > (i.e., “cancer” and “colon structure”), we get the following. What > happened to liver and lung cancer? > >> > >> cancer of colon, liver and lung > >> cancer of colon 363406005|Malignant tumor of colon > (disorder)| > >> lung 39607008|Lung structure (body > structure)| > >> > >> We have more questions but will start with these. Thank you in advance. > >> > >> Regards, > >> Dennis > >> > >> > > > > > > > > > > > > >
