Chris, We actually were basing our implementation off of https://github.com/selinachu/SparkStreamingCTK which I believe came from your team, but updated it for cTAKES 4.0. For those trying to do this, you'll likely run into issues tied to the lvg annotator outlined here: https://issues.apache.org/jira/browse/CTAKES-445
The comments provide a solution (essentially, ensure the cTAKES resources zip is on your classpath). In a cluster environment, this means putting them on every node at that particular classpath location. Fingers crossed that in some future implementations cTAKES can just be zipped up in a fat-jar with no issues. Mike On Fri, Jul 28, 2017 at 1:17 PM, Mattmann, Chris A (3010) < [email protected]> wrote: > FYI for interest, my JPL team implemented a prototype of this in 2015: > > https://www.mail-archive.com/[email protected]/msg01082.html > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Principal Data Scientist, Engineering Administrative Office (3010) > Manager, NSF & Open Source Projects Formulation and Development Offices > (8212) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 180-503E, Mailstop: 180-503 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Director, Information Retrieval and Data Science Group (IRDS) > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > WWW: http://irds.usc.edu/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > On 7/28/17, 11:19 AM, "Michael Trepanier" <[email protected]> wrote: > > That's an excellent suggestion! In a Spark implementation, build the > pipeline outside of the map function and pass the aed in as an input. > Then just ensure the jcas object persists between each mapping > iteration and leverage the reset method. > > On Fri, Jul 28, 2017 at 11:11 AM, Abramowitsch, Peter > <[email protected]> wrote: > > About your second question with UMLS, You can build the pipeline > > initially and it will verify the license info, then just reuse the > > pipeline on each call. > > > > > > > > On 7/25/17, 4:53 PM, "Michael Trepanier" <[email protected]> > wrote: > > > >>Hi, > >> > >>I am currently leveraging cTAKES inside of Apache Spark and have > >>written a function that takes in a single clinical note as as string > >>and does the following: > >> > >>1) Sets the UMLS system properties. > >>2) Instantiates JCAS object. > >>3) Runs the default pipeline > >>4) (Not shown below) Grabs the annotations and places them in a JSON > >>object for each note. > >> > >> def generateAnnotations(paragraph:String): String = { > >> System.setProperty("ctakes.umlsuser", "MY_UMLS_USERNAME") > >> System.setProperty("ctakes.umlspw", "MY_UMLS_PASSWORD") > >> > >> var jcas = > >>JCasFactory.createJCas("org.apache.ctakes.typesystem. > types.TypeSystem") > >> var aed = ClinicalPipelineFactory.getDefaultPipeline() > >> jcas.setDocumentText(paragraph) > >> SimplePipeline.runPipeline(jcas, aed) > >> ... > >> > >>This function is being implemented as a UDF which is applied to a > >>Spark Dataframe with clinical notes in each row. I have two > >>implementation questions that follow: > >> > >>1) When cTAKES is being applied iteratively to clinical notes, is it > >>necessary to instantiate a new JCAS object for every annotation? Or > >>can the same JCAS object be utilized over and over with the document > >>text being changed? > >>2) For each application of this function, the > >>UmlsDictionaryLookupAnnotator has to connect to UMLS using the > >>provided UMLS information. This Is there any way to instead perform > >>this step locally? Ie. ingest UMLS and place it in either HDFS or > just > >>mount it somewhere on each node? I'm worried about spamming the UMLS > >>server in this step, and about how long this seems to take. > >> > >>Thanks, > >> > >>Mike > >> > >> > >>-- > >> > >>Mike Trepanier| Big Data Engineer | MetiStream, Inc. | > >>[email protected] | 845 - 270 - 3129 (m) | www.metistream.com > > > > > > -- > > Mike Trepanier| Big Data Engineer | MetiStream, Inc. | > [email protected] | 845 - 270 - 3129 (m) | www.metistream.com > > > -- [image: MetiStream Logo - 500] Mike Trepanier| Big Data Engineer | MetiStream, Inc. | [email protected] | 845 - 270 - 3129 (m) | www.metistream.com
