Re: Processing Extraordinarily Long Documents

Michael Trepanier Thu, 28 Feb 2019 13:24:11 -0800

Hi Ron,

Hugely appreciate the response. Do you know what the max document size you
fed through your pipeline was? Below is a line-histogram of our note length
vs. processing time (ns). At the lower end, we're seeing a similar drop-off
after around 20,000 chars, with a more or less exponential growth in
runtime from there on out.


[image: image.png]
Our current setup is leveraging 256 Spark Executors (essentially JVMs),
each with 7G of RAM and 1 core, and then feeding partitions with ~ 20,000
notes each into these. With this config, we burned through 99% of the notes
in less than a day, but ended up spinning on the partitions which contained
the larger notes for nearly a week afterwards. For your implementation,
could you please provide what the specs were and how long it took to
process the 84M docs?

Regards,

Mike

On Thu, Feb 28, 2019 at 10:11 AM Price, Ronald <rpr...@luc.edu> wrote:

> Mike,
>
> We’ve fully processed 84M documents through CTAKES on 3 separate
> occasions.  We constructed a pipeline that has 30 separately controlled
> sub-queues.  We have the ability to target processing of documents to
> specific queues.  We allocate and target 5-10 queues for processing of
> large documents.  Similar to you, we have a small percentage (3%-4%) of
> documents that are over 15K in size.  The bulk of our documents are less
> then 3K.  In our environment and through some detailed performance
> analysis, we determined that the performance breakpoint occurs once
> documents get above 12K-13K.  We also target processing as many as 10
> annotators in a single pass of the corpus.  This approach has worked well
> for us.
>
>
>
> Thanks,
>
> Ron
>
>
>
>
>
>
>
>
>
> *From: *Michael Trepanier <m...@metistream.com>
> *Date: *Thursday, February 28, 2019 at 11:57 AM
> *To: *"user@ctakes.apache.org" <user@ctakes.apache.org>
> *Cc: *"Price, Ronald" <rpr...@luc.edu>
> *Subject: *Re: Processing Extraordinarily Long Documents
>
>
>
> Hi Dima,
>
>
>
> Thanks for the feedback! As our pipeline develops, we'll be building in
> additional functionality (eg. Temporal Relations) that require context
> greater than that of a sentence. Given this, partitioning on document
> length and shunting those to another queue is an excellent solution.
>
>
>
> Thanks,
>
>
>
> Mike
>
>
>
> On Thu, Feb 28, 2019 at 4:08 AM Dligach, Dmitriy <ddlig...@luc.edu> wrote:
>
> Hi Mike,
>
>
>
> We also observed this issue. Splitting large documents into smaller ones
> is an option, but you have to make sure you preserve the integrity of
> individual sentences or you might loose some concept mentions. Since you
> are using cTAKES only for ontology mapping, I don’t think you need to worry
> about the integrity of linguistic units larger than a sentence.
>
>
>
> FWIF, our solution to this problem was to create a separate queue for
> large documents and process them independently from the smaller documents.
>
>
>
> Best,
>
>
> Dima
>
>
>
>
>
>
>
> On Feb 27, 2019, at 16:59, Michael Trepanier <m...@metistream.com> wrote:
>
>
>
> Hi,
>
>
>
> We currently have a pipeline which is generating ontology mappings for a
> repository of clinical notes. However, this repository contains documents
> which, after RTF parsing, can contain over 900,000 characters (albeit this
> is a very small percentage of notes, out of ~13 million, around 50 contain
> more than 100k chars). Looking at some averages across the dataset, it is
> clear that the processing time is exponentially related to the note length:
>
>
>
> 0-10000 chars: 0.9 seconds (11 million notes)
>
> 10000-20000 chars: 5.625 seconds (1.5 million notes)
>
> 210000-220000 chars: 4422 seconds/1.22 hours (3 notes)
>
> 900000-1000000 chars: 103237 seconds/28.6 hours (1 note)
>
>
>
> Given these results, splitting the longer docs into partitions would speed
> up the pipeline considerably. However, our team has some concerns over how
> that might impact the context aware steps of the cTAKES pipeline. How would
> the results from splitting a doc on its sentences or paragraphs compare to
> feeding in an entire doc? Does the default pipeline API support a way to
> use segments instead of the entire document text?
>
>
>
> Regards,
>
>
>
> Mike
>
>
>
>
>
>
>
>
> --
>
> *Error! Filename not specified.*
>
> Mike Trepanier| Senior Big Data Engineer | MetiStream, Inc. |
> m...@metistream.com | 845 - 270 - 3129 (m) | www.metistream.com
>


-- 
[image: MetiStream Logo - 500]
Mike Trepanier| Senior Big Data Engineer | MetiStream, Inc. |
m...@metistream.com | 845 - 270 - 3129 (m) | www.metistream.com

Re: Processing Extraordinarily Long Documents

Reply via email to