Re: Implementation Improvements for cTAKES on top of Spark

Mattmann, Chris A (3010) Fri, 28 Jul 2017 13:17:47 -0700

FYI for interest, my JPL team implemented a prototype of this in 2015:

https://www.mail-archive.com/[email protected]/msg01082.html




++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010)
Manager, NSF & Open Source Projects Formulation and Development Offices (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 

On 7/28/17, 11:19 AM, "Michael Trepanier" <[email protected]> wrote:

    That's an excellent suggestion! In a Spark implementation, build the
    pipeline outside of the map function and pass the aed in as an input.
    Then just ensure the jcas object persists between each mapping
    iteration and leverage the reset method.
    
    On Fri, Jul 28, 2017 at 11:11 AM, Abramowitsch, Peter
    <[email protected]> wrote:
    > About your second question with UMLS,  You can build the pipeline
    > initially and it will verify the license info, then just reuse the
    > pipeline on each call.
    >
    >
    >
    > On 7/25/17, 4:53 PM, "Michael Trepanier" <[email protected]> wrote:
    >
    >>Hi,
    >>
    >>I am currently leveraging cTAKES inside of Apache Spark and have
    >>written a function that takes in a single clinical note as as string
    >>and does the following:
    >>
    >>1) Sets the UMLS system properties.
    >>2) Instantiates JCAS object.
    >>3) Runs the default pipeline
    >>4) (Not shown below) Grabs the annotations and places them in a JSON
    >>object for each note.
    >>
    >>  def generateAnnotations(paragraph:String): String = {
    >>    System.setProperty("ctakes.umlsuser", "MY_UMLS_USERNAME")
    >>    System.setProperty("ctakes.umlspw", "MY_UMLS_PASSWORD")
    >>
    >>    var jcas =
    >>JCasFactory.createJCas("org.apache.ctakes.typesystem.types.TypeSystem")
    >>    var aed = ClinicalPipelineFactory.getDefaultPipeline()
    >>    jcas.setDocumentText(paragraph)
    >>    SimplePipeline.runPipeline(jcas, aed)
    >>    ...
    >>
    >>This function is being implemented as a UDF which is applied to a
    >>Spark Dataframe with clinical notes in each row. I have two
    >>implementation questions that follow:
    >>
    >>1) When cTAKES is being applied iteratively to clinical notes, is it
    >>necessary to instantiate a new JCAS object for every annotation? Or
    >>can the same JCAS object be utilized over and over with the document
    >>text being changed?
    >>2) For each application of this function, the
    >>UmlsDictionaryLookupAnnotator has to connect to UMLS using the
    >>provided UMLS information. This Is there any way to instead perform
    >>this step locally? Ie. ingest UMLS and place it in either HDFS or just
    >>mount it somewhere on each node? I'm worried about spamming the UMLS
    >>server in this step, and about how long this seems to take.
    >>
    >>Thanks,
    >>
    >>Mike
    >>
    >>
    >>--
    >>
    >>Mike Trepanier| Big Data Engineer | MetiStream, Inc. |
    >>[email protected] | 845 - 270 - 3129 (m) | www.metistream.com
    >
    
    
    
    -- 
    
    Mike Trepanier| Big Data Engineer | MetiStream, Inc. |
    [email protected] | 845 - 270 - 3129 (m) | www.metistream.com

Re: Implementation Improvements for cTAKES on top of Spark

Reply via email to