Re: Implementation Improvements for cTAKES on top of Spark

Michael Trepanier Fri, 11 Aug 2017 11:09:04 -0700

Chris,

We actually were basing our implementation off of
https://github.com/selinachu/SparkStreamingCTK which I believe came from
your team, but updated it for cTAKES 4.0. For those trying to do this,
you'll likely run into issues tied to the lvg annotator outlined here:
https://issues.apache.org/jira/browse/CTAKES-445


The comments provide a solution (essentially, ensure the cTAKES resources
zip is on your classpath). In a cluster environment, this means putting
them on every node at that particular classpath location. Fingers crossed
that in some future implementations cTAKES can just be zipped up in a
fat-jar with no issues.

Mike

On Fri, Jul 28, 2017 at 1:17 PM, Mattmann, Chris A (3010) <
[email protected]> wrote:

> FYI for interest, my JPL team implemented a prototype of this in 2015:
>
> https://www.mail-archive.com/[email protected]/msg01082.html
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Principal Data Scientist, Engineering Administrative Office (3010)
> Manager, NSF & Open Source Projects Formulation and Development Offices
> (8212)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 180-503E, Mailstop: 180-503
> Email: [email protected]
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
> On 7/28/17, 11:19 AM, "Michael Trepanier" <[email protected]> wrote:
>
>     That's an excellent suggestion! In a Spark implementation, build the
>     pipeline outside of the map function and pass the aed in as an input.
>     Then just ensure the jcas object persists between each mapping
>     iteration and leverage the reset method.
>
>     On Fri, Jul 28, 2017 at 11:11 AM, Abramowitsch, Peter
>     <[email protected]> wrote:
>     > About your second question with UMLS,  You can build the pipeline
>     > initially and it will verify the license info, then just reuse the
>     > pipeline on each call.
>     >
>     >
>     >
>     > On 7/25/17, 4:53 PM, "Michael Trepanier" <[email protected]>
> wrote:
>     >
>     >>Hi,
>     >>
>     >>I am currently leveraging cTAKES inside of Apache Spark and have
>     >>written a function that takes in a single clinical note as as string
>     >>and does the following:
>     >>
>     >>1) Sets the UMLS system properties.
>     >>2) Instantiates JCAS object.
>     >>3) Runs the default pipeline
>     >>4) (Not shown below) Grabs the annotations and places them in a JSON
>     >>object for each note.
>     >>
>     >>  def generateAnnotations(paragraph:String): String = {
>     >>    System.setProperty("ctakes.umlsuser", "MY_UMLS_USERNAME")
>     >>    System.setProperty("ctakes.umlspw", "MY_UMLS_PASSWORD")
>     >>
>     >>    var jcas =
>     >>JCasFactory.createJCas("org.apache.ctakes.typesystem.
> types.TypeSystem")
>     >>    var aed = ClinicalPipelineFactory.getDefaultPipeline()
>     >>    jcas.setDocumentText(paragraph)
>     >>    SimplePipeline.runPipeline(jcas, aed)
>     >>    ...
>     >>
>     >>This function is being implemented as a UDF which is applied to a
>     >>Spark Dataframe with clinical notes in each row. I have two
>     >>implementation questions that follow:
>     >>
>     >>1) When cTAKES is being applied iteratively to clinical notes, is it
>     >>necessary to instantiate a new JCAS object for every annotation? Or
>     >>can the same JCAS object be utilized over and over with the document
>     >>text being changed?
>     >>2) For each application of this function, the
>     >>UmlsDictionaryLookupAnnotator has to connect to UMLS using the
>     >>provided UMLS information. This Is there any way to instead perform
>     >>this step locally? Ie. ingest UMLS and place it in either HDFS or
> just
>     >>mount it somewhere on each node? I'm worried about spamming the UMLS
>     >>server in this step, and about how long this seems to take.
>     >>
>     >>Thanks,
>     >>
>     >>Mike
>     >>
>     >>
>     >>--
>     >>
>     >>Mike Trepanier| Big Data Engineer | MetiStream, Inc. |
>     >>[email protected] | 845 - 270 - 3129 (m) | www.metistream.com
>     >
>
>
>
>     --
>
>     Mike Trepanier| Big Data Engineer | MetiStream, Inc. |
>     [email protected] | 845 - 270 - 3129 (m) | www.metistream.com
>
>
>


-- 
[image: MetiStream Logo - 500]
Mike Trepanier| Big Data Engineer | MetiStream, Inc. |  [email protected] |
845 - 270 - 3129 (m) | www.metistream.com

Re: Implementation Improvements for cTAKES on top of Spark

Reply via email to