Re: Processing Extraordinarily Long Documents

Jeffrey Miller Thu, 18 Apr 2019 12:35:23 -0700

Hi Mike,

Would you mind sharing the specifications of the nodes you used in your
Spark cluster (cores, memory, disk)? We were originally running nodes with
more than 2 cores, but have run into occasional issues with contention over
the HSQLDB lock for the dictionary file we are using when more than one
executor tries to run on one node. We were thinking of various solutions to
the problem, but the easiest was just to run only 2 core nodes, one for the
executor and one for the spark driver (assuming we have enough machines). I
am not sure why HSQLDB needs to lock the file anyway since it should not be
changing it, but so far we have not had success modifying the connection
string to ask for it open it as read only. Either way, it would be helpful
to learn about the setup you settled on.


Thanks,
Jeff


On 2019/02/27 22:59:56, Michael Trepanier <m...@metistream.com> wrote:
> Hi,>
>
> We currently have a pipeline which is generating ontology mappings for a>
> repository of clinical notes. However, this repository contains
documents>
> which, after RTF parsing, can contain over 900,000 characters (albeit
this>
> is a very small percentage of notes, out of ~13 million, around 50
contain>
> more than 100k chars). Looking at some averages across the dataset, it
is>
> clear that the processing time is exponentially related to the note
length:>
>
> 0-10000 chars: 0.9 seconds (11 million notes)>
> 10000-20000 chars: 5.625 seconds (1.5 million notes)>
> 210000-220000 chars: 4422 seconds/1.22 hours (3 notes)>
> 900000-1000000 chars: 103237 seconds/28.6 hours (1 note)>
>
> Given these results, splitting the longer docs into partitions would
speed>
> up the pipeline considerably. However, our team has some concerns over
how>
> that might impact the context aware steps of the cTAKES pipeline. How
would>
> the results from splitting a doc on its sentences or paragraphs compare
to>
> feeding in an entire doc? Does the default pipeline API support a way to>
> use segments instead of the entire document text?>
>
> Regards,>
>
> Mike>
>

Re: Processing Extraordinarily Long Documents

Reply via email to