Well if you're processing XML files that will likely cause a problem with this script, it's expecting plain text files in a directory. Maybe Sean can chime in on whether it's possible to use an XML collection reader with the runClinicalPipeline.sh script? Is it CDA format? Tim
-----Original Message----- From: "Baas,Leah" <leah.b...@sanfordhealth.org<mailto:%22Baas,leah%22%20%3cleah.b...@sanfordhealth.org%3e>> To: "Miller, Timothy" <timothy.mil...@childrens.harvard.edu<mailto:%22Miller,%20timothy%22%20%3ctimothy.mil...@childrens.harvard.edu%3e>>, user@ctakes.apache.org <user@ctakes.apache.org<mailto:%22u...@ctakes.apache.org%22%20%3cu...@ctakes.apache.org%3e>> Subject: Re: Processing large batches of files in cTAKES [EXTERNAL] Date: Tue, 29 Jan 2019 20:21:17 +0000 Hi Tim, Thanks again for working through this with me. I hadn’t read through the time stamps carefully enough to notice the one-time cost of startup. I did replicate your setup by copying/pasting 7 of my XML input files into an empty directory. Here’s what I saw: 1. For the startup-- 20 seconds between the first time-stamped log message: 29 Jan 2019 14:02:35 INFO SentenceDetector - Sentence detector model file: org/apache/ctakes/core/sentdetect/sd-med-model.zip and the first log message doing processing: 29 Jan 2019 14:02:55 INFO SentenceDetector - Starting processing. 1. Once started up, 12 seconds to process the notes. 29 Jan 2019 14:03:07 INFO ClearNLPSemanticRoleLabelerAE - Finished processing Does this help narrow things down? Leah From: "Miller, Timothy" <timothy.mil...@childrens.harvard.edu> Date: Tuesday, January 29, 2019 at 1:58 PM To: "Baas,Leah" <leah.b...@sanfordhealth.org>, "user@ctakes.apache.org" <user@ctakes.apache.org> Subject: Re: Processing large batches of files in cTAKES [EXTERNAL] I haven't used that script myself, but I just tried it now on some notes from mtsamples. Maybe you can try to replicate that setup? I just copy/pasted the 7 allergy/immunology notes [1] into 7 text files in an empty directory. Here's what I see: 1) It is pretty slow to start up -- but this is a one time cost (~50 seconds). I'm looking at the time between the very first time-stamped log message: 29 Jan 2019 14:51:51 INFO SentenceDetector - Sentence detector model file: org/apache/ctakes/core/sentdetect/sd-med-model.zip and the first log message doing processing: 29 Jan 2019 14:52:40 INFO SentenceDetector - Starting processing 2) Once started up, it processes the notes in about 14s. This is actually slower than expected but this is a lot faster than you were seeing. I"m looking at the time between the start of processing just above and the last log message before it quits: 29 Jan 2019 14:52:54 INFO ClearNLPSemanticRoleLabelerAE - Finished processing If you can replicate this input/output setup and approximate timing in your VM first, then we can see whether it's a function of your notes or your setup. Tim [1] https://www.mtsamples.com/site/pages/browse.asp?type=3-Allergy%20/%20Immunology<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mtsamples.com_site_pages_browse.asp-3Ftype-3D3-2DAllergy-2520_-2520Immunology&d=DwMGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=mrw9Hkq5tgV2AJpZMfTcbtAXSa2A59SwIOtsBR73mFs&s=dzNYtO-sdz1-shXn2KbCVDJQbxNh-i5mMutk0H-8ifc&e=> -----Original Message----- From: "Baas,Leah" <leah.b...@sanfordhealth.org<mailto:%22Baas,leah%22%20%3cleah.b...@sanfordhealth.org%3e>> To: user@ctakes.apache.org <user@ctakes.apache.org<mailto:%22u...@ctakes.apache.org%22%20%3cu...@ctakes.apache.org%3e>>, timothy.mil...@childrens.harvard.edu <timothy.mil...@childrens.harvard.edu<mailto:%22timothy.mil...@childrens.harvard.edu%22%20%3ctimothy.mil...@childrens.harvard.edu%3e>> Subject: Re: Processing large batches of files in cTAKES [EXTERNAL] Date: Tue, 29 Jan 2019 19:33:34 +0000 Hi again Tim, I am trying to check which version of the dictionary I am using when running the Default Clinical Pipeline. I have been running the pipeline according to the instructions detailed here<https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_CTAKES_Default-2BClinical-2BPipeline&d=DwMGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=jgvtkadUTVhxxDm24op4l0wy5Gr3jtNrWgRsUw93nKs&s=-iPRvjXA71f66iWz53vhCbU6a1JqiEwWZ03YmfUPf-Y&e=>. However, I haven’t been able to find documentation specifying which dictionary version is built into this pipeline. There must be a simple way to check—I am just ignorant. Could you enlighten me? Thanks, Leah From: "Baas,Leah" <leah.b...@sanfordhealth.org> Date: Tuesday, January 29, 2019 at 12:23 PM To: "user@ctakes.apache.org" <user@ctakes.apache.org> Subject: Re: Processing large batches of files in cTAKES [EXTERNAL] Tim, Thanks for your quick response! Probably unsurprisingly, I’ll have to do some googling to learn how to check those things. If you could point me in the right direction, that’d be great! Thanks again, Leah From: "Miller, Timothy" <timothy.mil...@childrens.harvard.edu> Reply-To: "user@ctakes.apache.org" <user@ctakes.apache.org> Date: Tuesday, January 29, 2019 at 12:14 PM To: "user@ctakes.apache.org" <user@ctakes.apache.org> Subject: Re: Processing large batches of files in cTAKES [EXTERNAL] I am able to process that number of files in a reasonable amount of time (maybe an hour) on an average desktop. Luckily, debugging your setup should be much easier than doing a scaleout. A few possibilities: * You are running the old (slow) dictionary instead of the new fast one * Your document has extremely long sentences * Your VM is _extremely_ resource constrained and is thrashing constantly Do you know how to check these things? Tim -----Original Message----- From: "Baas,Leah" <leah.b...@sanfordhealth.org<mailto:%22Baas,leah%22%20%3cleah.b...@sanfordhealth.org%3e>> Reply-to: <user@ctakes.apache.org> To: user@ctakes.apache.org <user@ctakes.apache.org<mailto:%22u...@ctakes.apache.org%22%20%3cu...@ctakes.apache.org%3e>> Subject: Processing large batches of files in cTAKES [EXTERNAL] Date: Tue, 29 Jan 2019 17:58:48 +0000 Hi all, I would like to process a batch of 13,414 files (avg file size 6.2 KB) using the default clinical pipeline. I am new to cTAKES and computer programming, and I’m looking for guidance on how to process these files with maximum time/CPU efficiency. I am currently running my program on an Ubuntu VM with 3 CPUs. It takes me 28 seconds (real time) to process one 6.0 KB file. I’m reading up on parallel processing strategies, but would be grateful for any suggestions, tips, etc. that you might have! Thanks, Leah ----------------------------------------------------------------------- Confidentiality Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain privileged and confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message.