RE: cTakes on Apache Spark - Error

2018-05-03 Thread Eskala, Nagakalyana
Thank You Ewan for pointing us to patch. Will make use of it and troubleshoot 
the error.
Appreciate the help.
Naga

-Original Message-
From: Ewan Mellor [mailto:ctakes-...@ewanmellor.org.uk]
Sent: Tuesday, May 01, 2018 5:24 PM
To: dev@ctakes.apache.org
Subject: Re: cTakes on Apache Spark - Error

There is a point in LvgCmdApiResourceImpl where it changes the working 
directory so that LVG can find the config file.  I have no idea how this would 
be supposed to work on Spark, but I guess that using relative paths in your 
config is going to be a problem.

There is also a point in LvgCmdApiResourceImpl where it is converting a URI to 
a File instance.  You should check whether that's ending up with an hdfs URL, 
and if so whether it is doing the right thing.

I would make sure that you have all the logging coming out from 
LvgCmdApiResourceImpl and check that the paths are correct.  You could also 
look at my patch on 
https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CTAKES-2D501=DwIBAg=A-GX6P9ovB1qTBp7iQve2Q=XKrHrIXqmdSXbdk4FWNCxNg6U89BilpFu7mhvHYmxp0=2ESez2jFD_fXg1r29anNuVEtg8IaC9VcV-X2JVazdwY=-VeCagWj7ufaFlaKzprWxsdGEnbsrxbznX5cSflfF1k=,
which includes some additional logging in this area.

HTH,

Ewan.

On Tue, May 01, 2018 at 04:40:16PM +, Eskala, Nagakalyana wrote:

> More update on the issue:
>
> We have extracted the lvg related files in the exact folder structure,
> and are copying all the folders recursively in the spark executor
> working directory using addFiles option. But the LvgAnnotator is not
> able to find the lvg.properties file in the classpath of the spark
> executor even though we have set up using the configuration
> spark.executor.extraClassPath option
>
> Code snippet:
> sc.addFile("hdfs:///ctakes_4.0.0/resources", true);
> sparkConf.set("spark.executor.extraClassPath", "./resources/");
> sparkConf.set("spark.driver.extraClassPath", "./resources/");
>
>
>
> From: Eskala, Nagakalyana
> Sent: Monday, April 30, 2018 8:50 PM
> To: 'dev@ctakes.apache.org' <dev@ctakes.apache.org>
> Subject: cTakes on Apache Spark - Error
>
> Background:
> We are trying to run the Apache ctakes Default clinical pipeline in a spark 
> streaming application. We intend to parse all input text sent to a socket on 
> spark streaming by executing a default clinical pipeline in individual 
> executors of a spark application.
>
> Challenges:
> The ctakes pipeline requires external resources to be available in the 
> classpath. We have used JavaSparkContext.addFiles to provide all the 
> resources (dictionaries) recursively from HDFS to each individual executor 
> working directory. Once the addFiles copies the resources to each executor, 
> we try to include it in the classpath of each executor using the 
> configuration.
>
> sc.addFile("hdfs:///ctakes_4.0.0/resources", true);
> sparkConf.set("spark.executor.extraClassPath", "./resources/");
> sparkConf.set("spark.driver.extraClassPath", "./resources/");
>
> Error:
> The error occurs in LvgAnnotator class which tries to access the 
> lvg.properties file through the lookup. It is not able to locate the file and 
> hence there is an error.
>
> 18/04/30 15:55:50 INFO scheduler.TaskSetManager: Starting task 0.0 in
> stage 1.0 (TID 1, localhost, executor driver, partition 0, ANY, 4744
> bytes)
> 18/04/30 15:55:50 INFO executor.Executor: Running task 0.0 in stage
> 1.0 (TID 1)
> 18/04/30 15:55:51 INFO ae.LvgAnnotator: URL==null
> 18/04/30 15:55:51 INFO ae.LvgAnnotator: Unable to find 
> org/apache/ctakes/lvg/data/config/lvg.properties.
> 18/04/30 15:55:51 INFO ae.LvgAnnotator: Copying files and directories
> to under /tmp/
> 18/04/30 15:55:51 INFO ae.LvgAnnotator: Copying lvg-related file to
> /tmp/data/config/lvg.properties
> 18/04/30 15:55:51 ERROR executor.Executor: Exception in task 0.0 in
> stage 1.0 (TID 1) java.lang.NullPointerException at
> org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
> at
> org.apache.commons.io.FileUtils.copyInputStreamToFile(FileUtils.java:1
> 512) at
> org.apache.ctakes.lvg.ae.LvgAnnotator.copyLvgFiles(LvgAnnotator.java:6
> 20) at
> org.apache.ctakes.lvg.ae.LvgAnnotator.createAnnotatorDescription(LvgAn
> notator.java:649) at
> org.apache.ctakes.clinicalpipeline.ClinicalPipelineFactory.getTokenPro
> cessingPipeline(ClinicalPipelineFactory.java:110)
> at
> org.apache.ctakes.clinicalpipeline.ClinicalPipelineFactory.getDefaultP
> ipeline(ClinicalPipelineFactory.java:68)
>
>
> Question:
> Ideally, since the resources folder has be

Re: cTakes on Apache Spark - Error

2018-05-01 Thread Ewan Mellor
There is a point in LvgCmdApiResourceImpl where it changes the working
directory so that LVG can find the config file.  I have no idea how this
would be supposed to work on Spark, but I guess that using relative paths
in your config is going to be a problem.

There is also a point in LvgCmdApiResourceImpl where it is converting a URI
to a File instance.  You should check whether that's ending up with an
hdfs URL, and if so whether it is doing the right thing.

I would make sure that you have all the logging coming out from
LvgCmdApiResourceImpl and check that the paths are correct.  You could
also look at my patch on https://issues.apache.org/jira/browse/CTAKES-501,
which includes some additional logging in this area.

HTH,

Ewan.

On Tue, May 01, 2018 at 04:40:16PM +, Eskala, Nagakalyana wrote:

> More update on the issue:
> 
> We have extracted the lvg related files in the exact folder structure, and 
> are copying all the folders recursively in the spark executor working 
> directory using addFiles option. But the LvgAnnotator is not able to find the 
> lvg.properties file in the classpath of the spark executor even though we 
> have set up using the configuration spark.executor.extraClassPath option
> 
> Code snippet:
> sc.addFile("hdfs:///ctakes_4.0.0/resources", true);
> sparkConf.set("spark.executor.extraClassPath", "./resources/");
> sparkConf.set("spark.driver.extraClassPath", "./resources/");
> 
> 
> 
> From: Eskala, Nagakalyana
> Sent: Monday, April 30, 2018 8:50 PM
> To: 'dev@ctakes.apache.org' <dev@ctakes.apache.org>
> Subject: cTakes on Apache Spark - Error
> 
> Background:
> We are trying to run the Apache ctakes Default clinical pipeline in a spark 
> streaming application. We intend to parse all input text sent to a socket on 
> spark streaming by executing a default clinical pipeline in individual 
> executors of a spark application.
> 
> Challenges:
> The ctakes pipeline requires external resources to be available in the 
> classpath. We have used JavaSparkContext.addFiles to provide all the 
> resources (dictionaries) recursively from HDFS to each individual executor 
> working directory. Once the addFiles copies the resources to each executor, 
> we try to include it in the classpath of each executor using the 
> configuration.
> 
> sc.addFile("hdfs:///ctakes_4.0.0/resources", true);
> sparkConf.set("spark.executor.extraClassPath", "./resources/");
> sparkConf.set("spark.driver.extraClassPath", "./resources/");
> 
> Error:
> The error occurs in LvgAnnotator class which tries to access the 
> lvg.properties file through the lookup. It is not able to locate the file and 
> hence there is an error.
> 
> 18/04/30 15:55:50 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 
> 1.0 (TID 1, localhost, executor driver, partition 0, ANY, 4744 bytes)
> 18/04/30 15:55:50 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 
> 1)
> 18/04/30 15:55:51 INFO ae.LvgAnnotator: URL==null
> 18/04/30 15:55:51 INFO ae.LvgAnnotator: Unable to find 
> org/apache/ctakes/lvg/data/config/lvg.properties.
> 18/04/30 15:55:51 INFO ae.LvgAnnotator: Copying files and directories to 
> under /tmp/
> 18/04/30 15:55:51 INFO ae.LvgAnnotator: Copying lvg-related file to 
> /tmp/data/config/lvg.properties
> 18/04/30 15:55:51 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.lang.NullPointerException
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
> at org.apache.commons.io.FileUtils.copyInputStreamToFile(FileUtils.java:1512)
> at org.apache.ctakes.lvg.ae.LvgAnnotator.copyLvgFiles(LvgAnnotator.java:620)
> at 
> org.apache.ctakes.lvg.ae.LvgAnnotator.createAnnotatorDescription(LvgAnnotator.java:649)
> at 
> org.apache.ctakes.clinicalpipeline.ClinicalPipelineFactory.getTokenProcessingPipeline(ClinicalPipelineFactory.java:110)
> at 
> org.apache.ctakes.clinicalpipeline.ClinicalPipelineFactory.getDefaultPipeline(ClinicalPipelineFactory.java:68)
> 
> 
> Question:
> Ideally, since the resources folder has been recursively added to each 
> executor node and the classpath has been set, the internal executor should be 
> able to locate the properties and other resource files. However, that is not 
> the case. Is there something we should be differently doing (configuration, 
> classpath, etc) so that the ctakes pipeline can be run in a spark executor 
> with all the resources and classpath set appropriately.
> 
> Thanks for the help.
> 
> 
> CONFIDENTIALITY NOTICE: This e-mail message, including any attac

RE: cTakes on Apache Spark - Error

2018-05-01 Thread Eskala, Nagakalyana
More update on the issue:

We have extracted the lvg related files in the exact folder structure, and are 
copying all the folders recursively in the spark executor working directory 
using addFiles option. But the LvgAnnotator is not able to find the 
lvg.properties file in the classpath of the spark executor even though we have 
set up using the configuration spark.executor.extraClassPath option

Code snippet:
sc.addFile("hdfs:///ctakes_4.0.0/resources", true);
sparkConf.set("spark.executor.extraClassPath", "./resources/");
sparkConf.set("spark.driver.extraClassPath", "./resources/");



From: Eskala, Nagakalyana
Sent: Monday, April 30, 2018 8:50 PM
To: 'dev@ctakes.apache.org' <dev@ctakes.apache.org>
Subject: cTakes on Apache Spark - Error

Background:
We are trying to run the Apache ctakes Default clinical pipeline in a spark 
streaming application. We intend to parse all input text sent to a socket on 
spark streaming by executing a default clinical pipeline in individual 
executors of a spark application.

Challenges:
The ctakes pipeline requires external resources to be available in the 
classpath. We have used JavaSparkContext.addFiles to provide all the resources 
(dictionaries) recursively from HDFS to each individual executor working 
directory. Once the addFiles copies the resources to each executor, we try to 
include it in the classpath of each executor using the configuration.

sc.addFile("hdfs:///ctakes_4.0.0/resources", true);
sparkConf.set("spark.executor.extraClassPath", "./resources/");
sparkConf.set("spark.driver.extraClassPath", "./resources/");

Error:
The error occurs in LvgAnnotator class which tries to access the lvg.properties 
file through the lookup. It is not able to locate the file and hence there is 
an error.

18/04/30 15:55:50 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 
(TID 1, localhost, executor driver, partition 0, ANY, 4744 bytes)
18/04/30 15:55:50 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 1)
18/04/30 15:55:51 INFO ae.LvgAnnotator: URL==null
18/04/30 15:55:51 INFO ae.LvgAnnotator: Unable to find 
org/apache/ctakes/lvg/data/config/lvg.properties.
18/04/30 15:55:51 INFO ae.LvgAnnotator: Copying files and directories to under 
/tmp/
18/04/30 15:55:51 INFO ae.LvgAnnotator: Copying lvg-related file to 
/tmp/data/config/lvg.properties
18/04/30 15:55:51 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
(TID 1)
java.lang.NullPointerException
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.FileUtils.copyInputStreamToFile(FileUtils.java:1512)
at org.apache.ctakes.lvg.ae.LvgAnnotator.copyLvgFiles(LvgAnnotator.java:620)
at 
org.apache.ctakes.lvg.ae.LvgAnnotator.createAnnotatorDescription(LvgAnnotator.java:649)
at 
org.apache.ctakes.clinicalpipeline.ClinicalPipelineFactory.getTokenProcessingPipeline(ClinicalPipelineFactory.java:110)
at 
org.apache.ctakes.clinicalpipeline.ClinicalPipelineFactory.getDefaultPipeline(ClinicalPipelineFactory.java:68)


Question:
Ideally, since the resources folder has been recursively added to each executor 
node and the classpath has been set, the internal executor should be able to 
locate the properties and other resource files. However, that is not the case. 
Is there something we should be differently doing (configuration, classpath, 
etc) so that the ctakes pipeline can be run in a spark executor with all the 
resources and classpath set appropriately.

Thanks for the help.


CONFIDENTIALITY NOTICE: This e-mail message, including any attachments, is
for the sole use of the intended recipient(s) and may contain confidential
and privileged information or may otherwise be protected by law. Any
unauthorized review, use, disclosure or distribution is prohibited. If you
are not the intended recipient, please contact the sender by reply e-mail
and destroy all copies of the original message and any attachment thereto.


cTakes on Apache Spark - Error

2018-04-30 Thread Eskala, Nagakalyana
Background:
We are trying to run the Apache ctakes Default clinical pipeline in a spark 
streaming application. We intend to parse all input text sent to a socket on 
spark streaming by executing a default clinical pipeline in individual 
executors of a spark application.

Challenges:
The ctakes pipeline requires external resources to be available in the 
classpath. We have used JavaSparkContext.addFiles to provide all the resources 
(dictionaries) recursively from HDFS to each individual executor working 
directory. Once the addFiles copies the resources to each executor, we try to 
include it in the classpath of each executor using the configuration.

sc.addFile("hdfs:///ctakes_4.0.0/resources", true);
sparkConf.set("spark.executor.extraClassPath", "./resources/");
sparkConf.set("spark.driver.extraClassPath", "./resources/");

Error:
The error occurs in LvgAnnotator class which tries to access the lvg.properties 
file through the lookup. It is not able to locate the file and hence there is 
an error.

18/04/30 15:55:50 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 
(TID 1, localhost, executor driver, partition 0, ANY, 4744 bytes)
18/04/30 15:55:50 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 1)
18/04/30 15:55:51 INFO ae.LvgAnnotator: URL==null
18/04/30 15:55:51 INFO ae.LvgAnnotator: Unable to find 
org/apache/ctakes/lvg/data/config/lvg.properties.
18/04/30 15:55:51 INFO ae.LvgAnnotator: Copying files and directories to under 
/tmp/
18/04/30 15:55:51 INFO ae.LvgAnnotator: Copying lvg-related file to 
/tmp/data/config/lvg.properties
18/04/30 15:55:51 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
(TID 1)
java.lang.NullPointerException
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.FileUtils.copyInputStreamToFile(FileUtils.java:1512)
at org.apache.ctakes.lvg.ae.LvgAnnotator.copyLvgFiles(LvgAnnotator.java:620)
at 
org.apache.ctakes.lvg.ae.LvgAnnotator.createAnnotatorDescription(LvgAnnotator.java:649)
at 
org.apache.ctakes.clinicalpipeline.ClinicalPipelineFactory.getTokenProcessingPipeline(ClinicalPipelineFactory.java:110)
at 
org.apache.ctakes.clinicalpipeline.ClinicalPipelineFactory.getDefaultPipeline(ClinicalPipelineFactory.java:68)


Question:
Ideally, since the resources folder has been recursively added to each executor 
node and the classpath has been set, the internal executor should be able to 
locate the properties and other resource files. However, that is not the case. 
Is there something we should be differently doing (configuration, classpath, 
etc) so that the ctakes pipeline can be run in a spark executor with all the 
resources and classpath set appropriately.

Thanks for the help.


CONFIDENTIALITY NOTICE: This e-mail message, including any attachments, is
for the sole use of the intended recipient(s) and may contain confidential
and privileged information or may otherwise be protected by law. Any
unauthorized review, use, disclosure or distribution is prohibited. If you
are not the intended recipient, please contact the sender by reply e-mail
and destroy all copies of the original message and any attachment thereto.