Re: Processing large batches of files in cTAKES [EXTERNAL]

Miller, Timothy Tue, 29 Jan 2019 13:23:30 -0800

OK, if you can see xml tags in the right pane, that means that ctakes is trying 
to process the xml markup as well as the text. Can you change your python 
pre-process to just write plaintext files with only the text from the note, and 
not xml? And then process that? I think there are probably cases where having 
xml in the text would confuse some of the  modules and cause them to run 
slowly. You also will get weird outputs, I've seen "<span>" get annotated as a 
"body measurement finding" when we accidentally processed some html once.
Tim

-----Original Message-----
From: "Baas,Leah" 
<leah.b...@sanfordhealth.org<mailto:%22Baas,leah%22%20%3cleah.b...@sanfordhealth.org%3e>>
To: "Miller, Timothy" 
<timothy.mil...@childrens.harvard.edu<mailto:%22Miller,%20timothy%22%20%3ctimothy.mil...@childrens.harvard.edu%3e>>,
 user@ctakes.apache.org 
<user@ctakes.apache.org<mailto:%22u...@ctakes.apache.org%22%20%3cu...@ctakes.apache.org%3e>>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 21:15:54 +0000

Yes, I’ve been following those instructions to view the .xmi files in the CVD.  
The right pane shows the text of the XML file.

Leah

From: "Miller, Timothy" <timothy.mil...@childrens.harvard.edu>
Date: Tuesday, January 29, 2019 at 3:00 PM
To: "Baas,Leah" <leah.b...@sanfordhealth.org>, "user@ctakes.apache.org" 
<user@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

So after you process all the notes do you follow the instructions on the wiki 
page that say:
You can view information in the XMI files using the UIMA Cas Visual Debugger 
(CVD).

Execute bin/runctakesCVD
Select File > Read Type System File
Select TypeSystem.xml in resources/org/apache/ctakes/typesystem/types/
Select File > Read XMI CAS File
Select any .xmi file in your outputDirectory

and look at that .xmi file? If so, what do you see in the right pane? The text 
of the note or the text of an xml file?
Tim

-----Original Message-----
From: "Baas,Leah" 
<leah.b...@sanfordhealth.org<mailto:%22Baas,leah%22%20%3cleah.b...@sanfordhealth.org%3e>>
To: "Miller, Timothy" 
<timothy.mil...@childrens.harvard.edu<mailto:%22Miller,%20timothy%22%20%3ctimothy.mil...@childrens.harvard.edu%3e>>,
 user@ctakes.apache.org 
<user@ctakes.apache.org<mailto:%22u...@ctakes.apache.org%22%20%3cu...@ctakes.apache.org%3e>>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 20:45:58 +0000

It is not CDA format. I used Python’s ElementTree module to generate XML files 
containing the clinical notes for each subject in my dataset. When I run the 
Default Clinical Pipeline, I can successfully generate XMI output files for 
each XML file in my input directory. The following WARNING message appears 
multiple times over the course of the processing (not sure if this is at all 
related to the issue at hand):

Jan 29, 2019 2:02:56 PM org.apache.uima.util.MessageReport 
decreasingWithTrace(51)
WARNING: Message count: 1; Feature 
org.apache.ctakes.typesystem.type.textsem.Predicate:relations is marked 
multipleReferencesAllowed=false, but it has multiple references.  These will be 
serialized in duplicate. Message count indicates messages skipped to avoid 
potential flooding. Set FINE logging level for stacktrace.

Leah

From: "Miller, Timothy" <timothy.mil...@childrens.harvard.edu>
Date: Tuesday, January 29, 2019 at 2:28 PM
To: "Baas,Leah" <leah.b...@sanfordhealth.org>, "user@ctakes.apache.org" 
<user@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

Well if you're processing XML files that will likely cause a problem with this 
script, it's expecting plain text files in a directory. Maybe Sean can chime in 
on whether it's possible to use an XML collection reader with the 
runClinicalPipeline.sh script? Is it CDA format?
Tim

-----Original Message-----
From: "Baas,Leah" 
<leah.b...@sanfordhealth.org<mailto:%22Baas,leah%22%20%3cleah.b...@sanfordhealth.org%3e>>
To: "Miller, Timothy" 
<timothy.mil...@childrens.harvard.edu<mailto:%22Miller,%20timothy%22%20%3ctimothy.mil...@childrens.harvard.edu%3e>>,
 user@ctakes.apache.org 
<user@ctakes.apache.org<mailto:%22u...@ctakes.apache.org%22%20%3cu...@ctakes.apache.org%3e>>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 20:21:17 +0000

Hi Tim,

Thanks again for working through this with me. I hadn’t read through the time 
stamps carefully enough to notice the one-time cost of startup.

I did replicate your setup by copying/pasting 7 of my XML input files into an 
empty directory. Here’s what I saw:

  1.  For the startup-- 20 seconds between the first time-stamped log message:

29 Jan 2019 14:02:35  INFO SentenceDetector - Sentence detector model file: 
org/apache/ctakes/core/sentdetect/sd-med-model.zip

                and the first log message doing processing:
29 Jan 2019 14:02:55  INFO SentenceDetector - Starting processing.

  1.  Once started up, 12 seconds to process the notes.

29 Jan 2019 14:03:07  INFO ClearNLPSemanticRoleLabelerAE - Finished processing

Does this help narrow things down?

Leah

From: "Miller, Timothy" <timothy.mil...@childrens.harvard.edu>
Date: Tuesday, January 29, 2019 at 1:58 PM
To: "Baas,Leah" <leah.b...@sanfordhealth.org>, "user@ctakes.apache.org" 
<user@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

I haven't used that script myself, but I just tried it now on some notes from 
mtsamples. Maybe you can try to replicate that setup? I just copy/pasted the 7 
allergy/immunology notes [1] into 7 text files in an empty directory. Here's 
what I see:

1) It is pretty slow to start up -- but this is a one time cost (~50 seconds). 
I'm looking at the time between the very first time-stamped log message:
29 Jan 2019 14:51:51  INFO SentenceDetector - Sentence detector model file: 
org/apache/ctakes/core/sentdetect/sd-med-model.zip

and the first log message doing processing:

29 Jan 2019 14:52:40  INFO SentenceDetector - Starting processing

2) Once started up, it processes the notes in about 14s. This is actually 
slower than expected but this is a lot faster than you were seeing. I"m looking 
at the time between the start of processing just above and the last log message 
before it quits:

29 Jan 2019 14:52:54  INFO ClearNLPSemanticRoleLabelerAE - Finished processing

If you can replicate this input/output setup and approximate timing in your VM 
first, then we can see whether it's a function of your notes or your setup.

Tim

[1] 
https://www.mtsamples.com/site/pages/browse.asp?type=3-Allergy%20/%20Immunology<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mtsamples.com_site_pages_browse.asp-3Ftype-3D3-2DAllergy-2520_-2520Immunology&d=DwMGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=mrw9Hkq5tgV2AJpZMfTcbtAXSa2A59SwIOtsBR73mFs&s=dzNYtO-sdz1-shXn2KbCVDJQbxNh-i5mMutk0H-8ifc&e=>

-----Original Message-----
From: "Baas,Leah" 
<leah.b...@sanfordhealth.org<mailto:%22Baas,leah%22%20%3cleah.b...@sanfordhealth.org%3e>>
To: user@ctakes.apache.org 
<user@ctakes.apache.org<mailto:%22u...@ctakes.apache.org%22%20%3cu...@ctakes.apache.org%3e>>,
 timothy.mil...@childrens.harvard.edu 
<timothy.mil...@childrens.harvard.edu<mailto:%22timothy.mil...@childrens.harvard.edu%22%20%3ctimothy.mil...@childrens.harvard.edu%3e>>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 19:33:34 +0000

Hi again Tim,

I am trying to check which version of the dictionary I am using when running 
the Default Clinical Pipeline. I have been running the pipeline according to 
the instructions detailed 
here<https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_CTAKES_Default-2BClinical-2BPipeline&d=DwMGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=jgvtkadUTVhxxDm24op4l0wy5Gr3jtNrWgRsUw93nKs&s=-iPRvjXA71f66iWz53vhCbU6a1JqiEwWZ03YmfUPf-Y&e=>.
 However, I haven’t been able to find documentation specifying which dictionary 
version is built into this pipeline. There must be a simple way to check—I am 
just ignorant. Could you enlighten me?

Thanks,

Leah

From: "Baas,Leah" <leah.b...@sanfordhealth.org>
Date: Tuesday, January 29, 2019 at 12:23 PM
To: "user@ctakes.apache.org" <user@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

Tim,

Thanks for your quick response! Probably unsurprisingly, I’ll have to do some 
googling to learn how to check those things. If you could point me in the right 
direction, that’d be great!

Thanks again,

Leah

From: "Miller, Timothy" <timothy.mil...@childrens.harvard.edu>
Reply-To: "user@ctakes.apache.org" <user@ctakes.apache.org>
Date: Tuesday, January 29, 2019 at 12:14 PM
To: "user@ctakes.apache.org" <user@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

I am able to process that number of files in a reasonable amount of time (maybe 
an hour) on an average desktop. Luckily, debugging your setup should be much 
easier than doing a scaleout. A few possibilities:

* You are running the old (slow) dictionary instead of the new fast one
* Your document has extremely long sentences
* Your VM is _extremely_ resource constrained and is thrashing constantly

Do you know how to check these things?
Tim

-----Original Message-----
From: "Baas,Leah" 
<leah.b...@sanfordhealth.org<mailto:%22Baas,leah%22%20%3cleah.b...@sanfordhealth.org%3e>>
Reply-to: <user@ctakes.apache.org>
To: user@ctakes.apache.org 
<user@ctakes.apache.org<mailto:%22u...@ctakes.apache.org%22%20%3cu...@ctakes.apache.org%3e>>
Subject: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 17:58:48 +0000

Hi all,

I would like to process a batch of 13,414 files (avg file size 6.2 KB) using 
the default clinical pipeline. I am new to cTAKES and computer programming, and 
I’m looking for guidance on how to process these files with maximum time/CPU 
efficiency. I am currently running my program on an Ubuntu VM with 3 CPUs. It 
takes me 28 seconds (real time) to process one 6.0 KB file. I’m reading up on 
parallel processing strategies, but would be grateful for any suggestions, 
tips, etc. that you might have!

Thanks,

Leah

-----------------------------------------------------------------------
Confidentiality Notice: This e-mail message, including any attachments,
is for the sole use of the intended recipient(s) and may contain
privileged and confidential information.  Any unauthorized review, use,
disclosure or distribution is prohibited.  If you are not the intended
recipient, please contact the sender by reply e-mail and destroy
all copies of the original message.

Re: Processing large batches of files in cTAKES [EXTERNAL]

Reply via email to