Re: Processing Extraordinarily Long Documents

2019-02-28 Thread Michael Trepanier
Hi Ron,

Hugely appreciate the response. Do you know what the max document size you
fed through your pipeline was? Below is a line-histogram of our note length
vs. processing time (ns). At the lower end, we're seeing a similar drop-off
after around 20,000 chars, with a more or less exponential growth in
runtime from there on out.

[image: image.png]
Our current setup is leveraging 256 Spark Executors (essentially JVMs),
each with 7G of RAM and 1 core, and then feeding partitions with ~ 20,000
notes each into these. With this config, we burned through 99% of the notes
in less than a day, but ended up spinning on the partitions which contained
the larger notes for nearly a week afterwards. For your implementation,
could you please provide what the specs were and how long it took to
process the 84M docs?

Regards,

Mike

On Thu, Feb 28, 2019 at 10:11 AM Price, Ronald  wrote:

> Mike,
>
> We’ve fully processed 84M documents through CTAKES on 3 separate
> occasions.  We constructed a pipeline that has 30 separately controlled
> sub-queues.  We have the ability to target processing of documents to
> specific queues.  We allocate and target 5-10 queues for processing of
> large documents.  Similar to you, we have a small percentage (3%-4%) of
> documents that are over 15K in size.  The bulk of our documents are less
> then 3K.  In our environment and through some detailed performance
> analysis, we determined that the performance breakpoint occurs once
> documents get above 12K-13K.  We also target processing as many as 10
> annotators in a single pass of the corpus.  This approach has worked well
> for us.
>
>
>
> Thanks,
>
> Ron
>
>
>
>
>
>
>
>
>
> *From: *Michael Trepanier 
> *Date: *Thursday, February 28, 2019 at 11:57 AM
> *To: *"user@ctakes.apache.org" 
> *Cc: *"Price, Ronald" 
> *Subject: *Re: Processing Extraordinarily Long Documents
>
>
>
> Hi Dima,
>
>
>
> Thanks for the feedback! As our pipeline develops, we'll be building in
> additional functionality (eg. Temporal Relations) that require context
> greater than that of a sentence. Given this, partitioning on document
> length and shunting those to another queue is an excellent solution.
>
>
>
> Thanks,
>
>
>
> Mike
>
>
>
> On Thu, Feb 28, 2019 at 4:08 AM Dligach, Dmitriy  wrote:
>
> Hi Mike,
>
>
>
> We also observed this issue. Splitting large documents into smaller ones
> is an option, but you have to make sure you preserve the integrity of
> individual sentences or you might loose some concept mentions. Since you
> are using cTAKES only for ontology mapping, I don’t think you need to worry
> about the integrity of linguistic units larger than a sentence.
>
>
>
> FWIF, our solution to this problem was to create a separate queue for
> large documents and process them independently from the smaller documents.
>
>
>
> Best,
>
>
> Dima
>
>
>
>
>
>
>
> On Feb 27, 2019, at 16:59, Michael Trepanier  wrote:
>
>
>
> Hi,
>
>
>
> We currently have a pipeline which is generating ontology mappings for a
> repository of clinical notes. However, this repository contains documents
> which, after RTF parsing, can contain over 900,000 characters (albeit this
> is a very small percentage of notes, out of ~13 million, around 50 contain
> more than 100k chars). Looking at some averages across the dataset, it is
> clear that the processing time is exponentially related to the note length:
>
>
>
> 0-1 chars: 0.9 seconds (11 million notes)
>
> 1-2 chars: 5.625 seconds (1.5 million notes)
>
> 21-22 chars: 4422 seconds/1.22 hours (3 notes)
>
> 90-100 chars: 103237 seconds/28.6 hours (1 note)
>
>
>
> Given these results, splitting the longer docs into partitions would speed
> up the pipeline considerably. However, our team has some concerns over how
> that might impact the context aware steps of the cTAKES pipeline. How would
> the results from splitting a doc on its sentences or paragraphs compare to
> feeding in an entire doc? Does the default pipeline API support a way to
> use segments instead of the entire document text?
>
>
>
> Regards,
>
>
>
> Mike
>
>
>
>
>
>
>
>
> --
>
> *Error! Filename not specified.*
>
> Mike Trepanier| Senior Big Data Engineer | MetiStream, Inc. |
> m...@metistream.com | 845 - 270 - 3129 (m) | www.metistream.com
>


-- 
[image: MetiStream Logo - 500]
Mike Trepanier| Senior Big Data Engineer | MetiStream, Inc. |
m...@metistream.com | 845 - 270 - 3129 (m) | www.metistream.com


Re: Processing Extraordinarily Long Documents

2019-02-28 Thread Michael Trepanier
Hi Dima,

Thanks for the feedback! As our pipeline develops, we'll be building in
additional functionality (eg. Temporal Relations) that require context
greater than that of a sentence. Given this, partitioning on document
length and shunting those to another queue is an excellent solution.

Thanks,

Mike

On Thu, Feb 28, 2019 at 4:08 AM Dligach, Dmitriy  wrote:

> Hi Mike,
>
> We also observed this issue. Splitting large documents into smaller ones
> is an option, but you have to make sure you preserve the integrity of
> individual sentences or you might loose some concept mentions. Since you
> are using cTAKES only for ontology mapping, I don’t think you need to worry
> about the integrity of linguistic units larger than a sentence.
>
> FWIF, our solution to this problem was to create a separate queue for
> large documents and process them independently from the smaller documents.
>
> Best,
>
> Dima
>
>
>
>
> On Feb 27, 2019, at 16:59, Michael Trepanier  wrote:
>
> Hi,
>
> We currently have a pipeline which is generating ontology mappings for a
> repository of clinical notes. However, this repository contains documents
> which, after RTF parsing, can contain over 900,000 characters (albeit this
> is a very small percentage of notes, out of ~13 million, around 50 contain
> more than 100k chars). Looking at some averages across the dataset, it is
> clear that the processing time is exponentially related to the note length:
>
> 0-1 chars: 0.9 seconds (11 million notes)
> 1-2 chars: 5.625 seconds (1.5 million notes)
> 21-22 chars: 4422 seconds/1.22 hours (3 notes)
> 90-100 chars: 103237 seconds/28.6 hours (1 note)
>
> Given these results, splitting the longer docs into partitions would speed
> up the pipeline considerably. However, our team has some concerns over how
> that might impact the context aware steps of the cTAKES pipeline. How would
> the results from splitting a doc on its sentences or paragraphs compare to
> feeding in an entire doc? Does the default pipeline API support a way to
> use segments instead of the entire document text?
>
> Regards,
>
> Mike
>
>
>

-- 
[image: MetiStream Logo - 500]
Mike Trepanier| Senior Big Data Engineer | MetiStream, Inc. |
m...@metistream.com | 845 - 270 - 3129 (m) | www.metistream.com


Processing Extraordinarily Long Documents

2019-02-27 Thread Michael Trepanier
Hi,

We currently have a pipeline which is generating ontology mappings for a
repository of clinical notes. However, this repository contains documents
which, after RTF parsing, can contain over 900,000 characters (albeit this
is a very small percentage of notes, out of ~13 million, around 50 contain
more than 100k chars). Looking at some averages across the dataset, it is
clear that the processing time is exponentially related to the note length:

0-1 chars: 0.9 seconds (11 million notes)
1-2 chars: 5.625 seconds (1.5 million notes)
21-22 chars: 4422 seconds/1.22 hours (3 notes)
90-100 chars: 103237 seconds/28.6 hours (1 note)

Given these results, splitting the longer docs into partitions would speed
up the pipeline considerably. However, our team has some concerns over how
that might impact the context aware steps of the cTAKES pipeline. How would
the results from splitting a doc on its sentences or paragraphs compare to
feeding in an entire doc? Does the default pipeline API support a way to
use segments instead of the entire document text?

Regards,

Mike


cTAKES Blocking when Run in Separate JVMs

2018-12-21 Thread Michael Trepanier
Hi,

I am running multiple cTAKES pipelines on a single machine in parallel,
each in their own JVM. Looking across the logs of each JVM, it appears that
severe blocking is occurring after the annotations are generated for a
particular segment. In particular, it looks like only one JVM is processing
at a time, while the other form a queue. Nearly all of the processes seem
to be halting during the SimpleSegmentAnnotator, just prior to the
initialization of the Sentence Detector.

18/12/20 22:44:33 INFO AbstractJCasTermAnnotator: Finished processing
18/12/20 22:44:36 WARN DocumentIDAnnotationUtil: Unable to find
DocumentIDAnnotation
 HERE
18/12/20 23:06:07 INFO ae.SentenceDetector: Starting processing.
18/12/20 23:06:07 INFO ae.TokenizerAnnotatorPTB: process(JCas) in
org.apache.ctakes.core.ae.TokenizerAnnotatorPTB
18/12/20 23:06:07 INFO ae.LvgAnnotator: process(JCas)
18/12/20 23:06:11 INFO ae.ContextDependentTokenizerAnnotator: process(JCas)
18/12/20 23:06:11 INFO postagger.POSTagger: process(JCas)
18/12/20 23:06:12 INFO AbstractJCasTermAnnotator: Starting processing
18/12/20 23:06:12 INFO AbstractJCasTermAnnotator: Finished processing
18/12/20 23:06:15 WARN DocumentIDAnnotationUtil: Unable to find
DocumentIDAnnotation
 HERE
18/12/20 23:27:22 INFO ae.SentenceDetector: Starting processing.
18/12/20 23:27:22 INFO ae.TokenizerAnnotatorPTB: process(JCas) in
org.apache.ctakes.core.ae.TokenizerAnnotatorPTB


I was wondering what could be causing this hold up? The JVMs share the
cTAKES resources and UMLS dictionaries - these were not duplicated for each
instance.

Thanks,

Mike

-- 
[image: MetiStream Logo - 500]
Mike Trepanier| Senior Big Data Engineer | MetiStream, Inc. |
m...@metistream.com | 845 - 270 - 3129 (m) | www.metistream.com


Re: How do I add a dictionary (like NCI) to cTakes lookup?

2018-08-20 Thread Michael Trepanier
Ory,

In response to Gandhi's comments, the video below outlines custom
dictionary creation in detail:

https://www.youtube.com/watch?v=4aOnafv-NQs

Best,

Mike



On Mon, Aug 20, 2018 at 2:09 AM, Gandhi Rajan Natarajan <
gandhi.natara...@arisglobal.com> wrote:

> Hi Ory,
>
> I guess RxNORM and SNOMED_CT dictionaries are loaded by default. If you
> want to lookup from other dictionaries like MEDDRA etc. , you may have to
> create your custom dictionary using cTAKES dictionary generator GUI. That’s
> what I did to include MEDDRA dictionary terms.
>
> -Original Message-
> From: Ory Henn 
> Sent: Monday, August 20, 2018 1:52 PM
> To: user@ctakes.apache.org
> Cc: Guy Gildor 
> Subject: How do I add a dictionary (like NCI) to cTakes lookup?
>
> Hello,
> New user here (-;
> I've downloaded and installed cTakes (+UMLS + all resources), and am
> trying to parse a single document.
> I see that cTakes (CVD/CPE) identifies CUIs only from RxNORM and SNOMED_CT.
>
> 1. What is the way to make cTakes look in more UMLS dictionaries? I
> specifically need NCI as well.
> 2. Is there an easy way to make cTakes look in all UMLS dictionaries?
> Thanks,
> Ory
>
> --
>
>  www.trialjectory.com 
> This email and any files transmitted with it are confidential and intended
> solely for the use of the individual or entity to whom they are addressed.
> If you are not the named addressee you should not disseminate, distribute
> or copy this e-mail. Please notify the sender or system manager by email
> immediately if you have received this e-mail by mistake and delete this
> e-mail from your system. If you are not the intended recipient you are
> notified that disclosing, copying, distributing or taking any action in
> reliance on the contents of this information is strictly prohibited and
> against the law.
>



-- 
[image: MetiStream Logo - 500]
Mike Trepanier| Big Data Engineer | MetiStream, Inc. |  m...@metistream.com |
845 - 270 - 3129 (m) | www.metistream.com


Packaging cTAKES in a Jar - LVG Related Configuration Error

2018-03-28 Thread Michael Trepanier
Hi All,

I am attempting to package cTAKES in a jar while while avoiding it copying
the lvg related files to /tmp/ as it does
in 
/ctakes/trunk/ctakes-lvg/src/main/java/org/apache/ctakes/lvg/ae/LvgAnnotator.java.

Everything works up until cTAKES tries to path the lvg.properties file
within the jar down to gov.nih.nlm.nls.lvg.Lib.SetConfiguration, where the
code attempts to create a FileInputStream from a resource contained within
a jar, which throws the below exception.


** Configuration Error:
jar:file:\D:\ctakes\ctakes-local\lib\ctakes-assembly-4.0.jar!\org\apache\ctakes\lvg\data\config\lvg.properties
(The filename, directory name, or volume label syntax is incorrect)
** Error: problem of opening/reading config file:
'jar:file:\D:\ctakes\ctakes-local\lib\ctakes-assembly-4.0.jar!\org\apache\ctakes\lvg\data\config\lvg.properties'.
Use -x option to specify the config file path.

While I likely can't avoid the above scenario without changing cTAKES'
dependencies, I was wondering two things:

1) If it would be possible to set the LVG_DIR to a non absolute path
instead of AUTO_MODE and have it function properly?

2) Oddly enough, despite logging this error, cTAKES appears to be running
fine locally. Should I not be concerned about these configuration errors as
they don't seem to be impacting anything? Is there a downstream way I can
check that the properties file is being correctly read? Or is cTAKES
chugging through the default pipeline evidence enough that I need not worry?

Best,

Mike



-- 
[image: MetiStream Logo - 500]
Mike Trepanier| Big Data Engineer | MetiStream, Inc. |  m...@metistream.com |
845 - 270 - 3129 (m) | www.metistream.com


Re: [EXTERNAL] Leveraging cTAKES without a UMLS Credential Check

2018-03-10 Thread Michael Trepanier
Appreciate the responses folks. We'll look into either developing our own
dictionary or using one which requires a credentialed download, as opposed
to the source-force link.

Best,

Mike

On Sat, Mar 10, 2018 at 5:53 AM, Smith, Lincoln <lincoln.sm...@highmark.com>
wrote:

> I was running the same question by everyone a while back. From what we
> understand so far it seems that this may go away once you build and load
> your own dictionary vs using the default you mentioned. But we' haven't
> tested that yet. Lincoln
>
>
>
> *From:* Michael Trepanier [mailto:m...@metistream.com]
> *Sent:* Friday, March 09, 2018 4:42 PM
> *To:* user@ctakes.apache.org
> *Subject:* [EXTERNAL] Leveraging cTAKES without a UMLS Credential Check
>
>
>
> Hi All,
>
>
>
> Is it possible to avoid the UMLS credential check each time cTAKES is run?
> It seems like cTAKES would be configurable in such a way to use UMLS
> credentials to acquire the sno_rx_16abterms dictionary once, and then not
> need to check against UMLS in future runs.
>
>
>
> In particular, I am thinking for instances where cTAKES is being run
> either offline or in a highly parallel fashion and there is a chance UMLS
> could be bombarded with credential checks. If there is documentation on how
> to configure such a setup, I would greatly appreciate it if someone could
> point me to it.
>
>
>
> Regards,
>
>
>
> Mike
>
>
>
> --
>
> [image: MetiStream Logo - 500]
>
> Mike Trepanier| Big Data Engineer | MetiStream, Inc. |
> m...@metistream.com | 845 - 270 - 3129 <(845)%20270-3129> (m) |
> www.metistream.com
>
> --
>
> The information contained in this transmission may contain privileged and
> confidential information including personal information protected by
> federal and/or state privacy laws. It is intended only for the use of the
> addressee named above. If you are not the intended recipient, you are
> hereby notified that any review, dissemination, distribution or duplication
> of this communication is strictly prohibited. If you are not the intended
> recipient, please contact the sender by reply email and destroy all copies
> of the original message. Highmark Health is a Pennsylvania nonprofit
> corporation. This communication may come from Highmark Health or one of its
> subsidiaries or affiliated businesses.
>



-- 
[image: MetiStream Logo - 500]
Mike Trepanier| Big Data Engineer | MetiStream, Inc. |  m...@metistream.com |
845 - 270 - 3129 (m) | www.metistream.com


Leveraging cTAKES without a UMLS Credential Check

2018-03-09 Thread Michael Trepanier
Hi All,

Is it possible to avoid the UMLS credential check each time cTAKES is run?
It seems like cTAKES would be configurable in such a way to use UMLS
credentials to acquire the sno_rx_16abterms dictionary once, and then not
need to check against UMLS in future runs.

In particular, I am thinking for instances where cTAKES is being run either
offline or in a highly parallel fashion and there is a chance UMLS could be
bombarded with credential checks. If there is documentation on how to
configure such a setup, I would greatly appreciate it if someone could
point me to it.

Regards,

Mike

-- 
[image: MetiStream Logo - 500]
Mike Trepanier| Big Data Engineer | MetiStream, Inc. |  m...@metistream.com |
845 - 270 - 3129 (m) | www.metistream.com


Re: Setting the Lvg Resources Location in lvg.properties

2017-09-29 Thread Michael Trepanier
Taking the resources out of my fat jar resolved this issue. I should add
that as I'm running this pipeline in Spark I had to set the related HSQLDBs
to read-only to permit simultaneous reads. Is there any reason they are not
set to read-only to begin with?

Mike

On Thu, Sep 28, 2017 at 6:27 PM, James Masanz <masanz.ja...@gmail.com>
wrote:

>
> I would expect that if you copy the LVG resources out of your UberJar, it
> should resolve the issue.
> Modifying the lvg.properties file generally causes problems.
> Following the hints in the 30/Jul/17 23:06 update to CTAKES-445
> <https://issues.apache.org/jira/browse/CTAKES-445>  should work without
> your having to modify the lvg.properties file.
>
> I haven't tested the patch to CTAKES-445
> <https://issues.apache.org/jira/browse/CTAKES-445> myself yet so I don't
> know whether it takes care of the problem in this case. I do know that the
> ctakes-lvg code does a change directory (cd)  to where it expects the LVG
> resources to be, or at least that's what it used to do when I last looked
> at it.  I suspect trying to cd into a jar is the problem you are seeing.
> I'll have to revisit that when I look at that patch.
>
> -- James
>
>
>
> On Tue, Sep 26, 2017 at 5:53 PM, Michael Trepanier <m...@metistream.com>
> wrote:
>
>> I am attempting to run cTAKES from an executable UberJar. While the fast
>> pipeline seems to run correctly (in terms of producing an output), when
>> stepping through the LvgAnnotator related steps, cTAKES produces the below
>> error.
>>
>> 26 Sep 2017 22:47:01  INFO LvgAnnotator - URL for lvg.properties 
>> =file:/home/mike/ctakes-assembly-4.0.jar!/org/apache/ctakes/lvg/data/config/lvg.properties
>> 26 Sep 2017 22:47:01  INFO SentenceDetector - Sentence detector model file: 
>> org/apache/ctakes/core/sentdetect/sd-med-model.zip
>> 26 Sep 2017 22:47:01  INFO TokenizerAnnotatorPTB - Initializing 
>> org.apache.ctakes.core.ae.TokenizerAnnotatorPTB
>> 26 Sep 2017 22:47:01  INFO LvgCmdApiResourceImpl - Loading NLM Norm and Lvg 
>> with config file = 
>> jar:file:/home/mike/ctakes-assembly-4.0.jar!/org/apache/ctakes/lvg/data/config/lvg.properties
>> 26 Sep 2017 22:47:01  INFO LvgCmdApiResourceImpl -   config file absolute 
>> path = 
>> /home/mike/jar:file:/home/mike/ctakes-assembly-4.0.jar!/org/apache/ctakes/lvg/data/config/lvg.properties
>> 26 Sep 2017 22:47:01  INFO LvgCmdApiResourceImpl - cwd = /home/mike
>> 26 Sep 2017 22:47:01  INFO LvgCmdApiResourceImpl - cd 
>> jar:file:/home/mike/ctakes-assembly-4.0.jar!/org/apache/ctakes/lvg/
>> ** Configuration Error: 
>> jar:file:/home/mike/ctakes-assembly-4.0.jar!/org/apache/ctakes/lvg/data/config/lvg.properties
>>  (No such file or directory)
>> ** Error: problem of opening/reading config file: 
>> 'jar:file:/home/mike/ctakes-assembly-4.0.jar!/org/apache/ctakes/lvg/data/config/lvg.properties'.
>>  Use -x option to specify the config file path.
>> ** Configuration Error: 
>> jar:file:/home/mike/ctakes-assembly-4.0.jar!/org/apache/ctakes/lvg/data/config/lvg.properties
>>  (No such file or directory)
>> ** Error: problem of opening/reading config file: 
>> 'jar:file:/home/mike/ctakes-assembly-4.0.jar!/org/apache/ctakes/lvg/data/config/lvg.properties'.
>>  Use -x option to specify the config file path.
>>
>> Would taking the additional cTAKES resources out of the UberJar resolve
>> this issue? And if so, can I use the lvg.properties file to set where these
>> resources should be?
>>
>> Note, as mentioned before, this error does not cause cTAKES to crash; I
>> am just worried it may be impacting the output. As well, I have implemented
>> the patch outlined at https://issues.apache.org/jira/browse/CTAKES-445
>>
>>
>> Regards,
>>
>> Mike
>>
>> --
>> [image: MetiStream Logo - 500]
>> Mike Trepanier| Big Data Engineer | MetiStream, Inc. |
>> m...@metistream.com | 845 - 270 - 3129 <(845)%20270-3129> (m) |
>> www.metistream.com
>>
>
>


-- 
[image: MetiStream Logo - 500]
Mike Trepanier| Big Data Engineer | MetiStream, Inc. |  m...@metistream.com |
845 - 270 - 3129 (m) | www.metistream.com


Re: cTAKES Fast Pipeline Failing

2017-09-03 Thread Michael Trepanier
We're using one of the cTAKES 4.0 convenience binaries. The two wgets from
my install script are shown below (mirroring what's shown in the install):


wget -P /usr/local
http://mirrors.sonic.net/apache//ctakes/ctakes-4.0.0/apache-ctakes-4.0.0-bin.tar.gz
;
wget -P /usr/local
http://sourceforge.net/projects/ctakesresources/files/ctakes-resources-4.0-bin.zip


I'm wondering if this is now tied to the serializability of part of the
fast pipeline (as opposed to the default). We're not using the maven
dependency due some issues with lvgannotator outlined here:
https://issues.apache.org/jira/browse/CTAKES-445

However, there appears to be a new patch as of three hours ago, so I need
to do some investigating there.

Mike


On Fri, Sep 1, 2017 at 6:56 PM, James Masanz <masanz.ja...@gmail.com> wrote:

> I think that in late April Sean Finan fixed a problem that was resulting
> in
> Caused by: java.lang.StringIndexOutOfBoundsException: String index out of 
> range:
> -7
>
> Are you using cTAKES 4.0 (either from the convenience binary download or
> as a maven dependency) or are you using cTAKES in some other way
>
> -- James
>
>
> On Fri, Sep 1, 2017 at 3:13 PM, Michael Trepanier <m...@metistream.com>
> wrote:
>
>> Hi All,
>>
>> We've been attempting to scale our cTAKES Pipeline on top of Spark, so
>> we've switched form using the "getDefaultPipeline" method to the
>> "getFastPipeline" method to boost the processing speed. However, while the
>> default pipeline works fine with Spark, the fast pipeline is throwing the
>> below error (edited down to the cTAKES portion of the stack trace):
>>
>>
>> Caused by: org.apache.uima.resource.ResourceInitializationException:
>> MESSAGE LOCALIZATION FAILED: Can't find resource for bundle
>> java.util.PropertyResourceBundle, key Could not construct
>> org.apache.ctakes.dictionary.lookup2.dictionary.UmlsJdbcRare
>> WordDictionary
>> at org.apache.ctakes.dictionary.lookup2.ae.AbstractJCasTermAnno
>> tator.initialize(AbstractJCasTermAnnotator.java:131)
>> at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine
>> _impl.initializeAnalysisComponent(PrimitiveAnalysisEngine_impl.java:266)
>> ... 44 more
>> Caused by: 
>> org.apache.uima.analysis_engine.annotator.AnnotatorContextException:
>> MESSAGE LOCALIZATION FAILED: Can't find resource for bundle
>> java.util.PropertyResourceBundle, key Could not construct
>> org.apache.ctakes.dictionary.lookup2.dictionary.UmlsJdbcRare
>> WordDictionary
>> at org.apache.ctakes.dictionary.lookup2.dictionary.DictionaryDe
>> scriptorParser.parseDictionary(DictionaryDescriptorParser.java:199)
>> at org.apache.ctakes.dictionary.lookup2.dictionary.DictionaryDe
>> scriptorParser.parseDictionaries(DictionaryDescriptorParser.java:156)
>> at org.apache.ctakes.dictionary.lookup2.dictionary.DictionaryDe
>> scriptorParser.parseDescriptor(DictionaryDescriptorParser.java:128)
>> at org.apache.ctakes.dictionary.lookup2.ae.AbstractJCasTermAnno
>> tator.initialize(AbstractJCasTermAnnotator.java:129)
>> ... 45 more
>> Caused by: java.lang.reflect.InvocationTargetException
>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> Method)
>> at sun.reflect.NativeConstructorAccessorImpl.newInstance(Native
>> ConstructorAccessorImpl.java:62)
>> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(De
>> legatingConstructorAccessorImpl.java:45)
>> at java.lang.reflect.Constructor.newInstance(Constructor.java:4
>> 23)
>> at org.apache.ctakes.dictionary.lookup2.dictionary.DictionaryDe
>> scriptorParser.parseDictionary(DictionaryDescriptorParser.java:196)
>> ... 48 more
>> Caused by: java.lang.StringIndexOutOfBoundsException: String index out
>> of range: -7
>> at java.lang.String.substring(String.java:1967)
>> at org.apache.ctakes.dictionary.lookup2.util.JdbcConnectionFact
>> ory.getConnectionUrl(JdbcConnectionFactory.java:110)
>> at org.apache.ctakes.dictionary.lookup2.util.JdbcConnectionFact
>> ory.getConnection(JdbcConnectionFactory.java:63)
>> at org.apache.ctakes.dictionary.lookup2.dictionary.JdbcRareWord
>> Dictionary.(JdbcRareWordDictionary.java:91)
>> at org.apache.ctakes.dictionary.lookup2.dictionary.JdbcRareWord
>> Dictionary.(JdbcRareWordDictionary.java:72)
>> at org.apache.ctakes.dictionary.lookup2.dictionary.UmlsJdbcRare
>> WordDictionary.(UmlsJdbcRareWordDictionary.java:31)
>> ... 53 more
>>
>>
>> So, looking in "get

cTAKES Fast Pipeline Failing

2017-09-01 Thread Michael Trepanier
Hi All,

We've been attempting to scale our cTAKES Pipeline on top of Spark, so
we've switched form using the "getDefaultPipeline" method to the
"getFastPipeline" method to boost the processing speed. However, while the
default pipeline works fine with Spark, the fast pipeline is throwing the
below error (edited down to the cTAKES portion of the stack trace):


Caused by: org.apache.uima.resource.ResourceInitializationException:
MESSAGE LOCALIZATION FAILED: Can't find resource for bundle
java.util.PropertyResourceBundle, key Could not construct
org.apache.ctakes.dictionary.lookup2.dictionary.UmlsJdbcRareWordDictionary
at
org.apache.ctakes.dictionary.lookup2.ae.AbstractJCasTermAnnotator.initialize(AbstractJCasTermAnnotator.java:131)
at
org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.initializeAnalysisComponent(PrimitiveAnalysisEngine_impl.java:266)
... 44 more
Caused by:
org.apache.uima.analysis_engine.annotator.AnnotatorContextException:
MESSAGE LOCALIZATION FAILED: Can't find resource for bundle
java.util.PropertyResourceBundle, key Could not construct
org.apache.ctakes.dictionary.lookup2.dictionary.UmlsJdbcRareWordDictionary
at
org.apache.ctakes.dictionary.lookup2.dictionary.DictionaryDescriptorParser.parseDictionary(DictionaryDescriptorParser.java:199)
at
org.apache.ctakes.dictionary.lookup2.dictionary.DictionaryDescriptorParser.parseDictionaries(DictionaryDescriptorParser.java:156)
at
org.apache.ctakes.dictionary.lookup2.dictionary.DictionaryDescriptorParser.parseDescriptor(DictionaryDescriptorParser.java:128)
at
org.apache.ctakes.dictionary.lookup2.ae.AbstractJCasTermAnnotator.initialize(AbstractJCasTermAnnotator.java:129)
... 45 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at
org.apache.ctakes.dictionary.lookup2.dictionary.DictionaryDescriptorParser.parseDictionary(DictionaryDescriptorParser.java:196)
... 48 more
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of
range: -7
at java.lang.String.substring(String.java:1967)
at
org.apache.ctakes.dictionary.lookup2.util.JdbcConnectionFactory.getConnectionUrl(JdbcConnectionFactory.java:110)
at
org.apache.ctakes.dictionary.lookup2.util.JdbcConnectionFactory.getConnection(JdbcConnectionFactory.java:63)
at
org.apache.ctakes.dictionary.lookup2.dictionary.JdbcRareWordDictionary.(JdbcRareWordDictionary.java:91)
at
org.apache.ctakes.dictionary.lookup2.dictionary.JdbcRareWordDictionary.(JdbcRareWordDictionary.java:72)
at
org.apache.ctakes.dictionary.lookup2.dictionary.UmlsJdbcRareWordDictionary.(UmlsJdbcRareWordDictionary.java:31)
... 53 more


So, looking in "getConnectionUrl," we have this method:

static private String getConnectionUrl( final String jdbcUrl ) throws
SQLException {
  final String urlDbPath = jdbcUrl.substring( HSQL_FILE_PREFIX.length()
);
  final String urlFilePath = urlDbPath + HSQL_DB_EXT;
  try {
 final URL url = FileLocator.getResource( urlFilePath );
 final String urlString = url.toExternalForm();
 return urlString.substring( 0, urlString.length() -
HSQL_DB_EXT.length() ); // <---
  } catch ( FileNotFoundException fnfE ) {
 throw new SQLException( "No Hsql DB exists at Url", fnfE );
  }

The substring method indicated above appears to be what is causing the
error - for some reason the "urlString" variable has a length of zero. This
seems to indicate that there is something wrong with the cTAKES resources.
However, that isn't making much sense to me as the default pipeline, which
also relies on the resources package, is working fine. Has anyone
encountered something like this before? Does the fast pipeline require some
additional resources?

As well, for the Spark implementation, we've put the cTAKES jars and
resources on each executor at the same location, and are specifying this in
on the executor classpath.

Thanks,

Mike
-- 
[image: MetiStream Logo - 500]
Mike Trepanier| Big Data Engineer | MetiStream, Inc. |  m...@metistream.com |
845 - 270 - 3129 (m) | www.metistream.com


Implementation Improvements for cTAKES on top of Spark

2017-07-25 Thread Michael Trepanier
Hi,

I am currently leveraging cTAKES inside of Apache Spark and have
written a function that takes in a single clinical note as as string
and does the following:

1) Sets the UMLS system properties.
2) Instantiates JCAS object.
3) Runs the default pipeline
4) (Not shown below) Grabs the annotations and places them in a JSON
object for each note.

  def generateAnnotations(paragraph:String): String = {
System.setProperty("ctakes.umlsuser", "MY_UMLS_USERNAME")
System.setProperty("ctakes.umlspw", "MY_UMLS_PASSWORD")

var jcas = 
JCasFactory.createJCas("org.apache.ctakes.typesystem.types.TypeSystem")
var aed = ClinicalPipelineFactory.getDefaultPipeline()
jcas.setDocumentText(paragraph)
SimplePipeline.runPipeline(jcas, aed)
...

This function is being implemented as a UDF which is applied to a
Spark Dataframe with clinical notes in each row. I have two
implementation questions that follow:

1) When cTAKES is being applied iteratively to clinical notes, is it
necessary to instantiate a new JCAS object for every annotation? Or
can the same JCAS object be utilized over and over with the document
text being changed?
2) For each application of this function, the
UmlsDictionaryLookupAnnotator has to connect to UMLS using the
provided UMLS information. This Is there any way to instead perform
this step locally? Ie. ingest UMLS and place it in either HDFS or just
mount it somewhere on each node? I'm worried about spamming the UMLS
server in this step, and about how long this seems to take.

Thanks,

Mike


-- 

Mike Trepanier| Big Data Engineer | MetiStream, Inc. |
m...@metistream.com | 845 - 270 - 3129 (m) | www.metistream.com