Hi Eddie,

Another example: 

You want to crawl a directory that contains XML Medline files. Each of these 
files contains several Medline documents, each of them to be put into a CAS. We 
would like to be able to have a clear separation between the crawling of the 
directory and the conversion of the Medline XML format. Doing that it will be 
easy to get Medline documents from ftp by just changing the source connection 
part or to treat a different type of files by just changing the document 
conversion part.  

Best,

Pascal

-----Original Message-----
From: Eric Vachon [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, September 25, 2007 12:26 PM
To: [email protected]
Subject: Re: Multi-threading with a CAS Multiplier.

Hi Eddie,

The problem here is that one file does not mean one document.

Imagine that the each pointer points to a zip containing 2000 documents.
How do you populate not one but 2000 CAS from each pointer ? If you are
restrained by a CAS Pool, then it's ok, otherwise you will create too
many new CAS and you will exceed the allowed memory.

We need a component that has one CAS in input and that can output any
number of CAS (the CASMultiplier) but with a limitation so that at one
time the number of CAS is not above a certain threshold.

My other concern about the CAS Pool is that if I have multiple CAS
Multipliers in a CPE that is launched over 5 pipelines it will become
difficult to have a reasonable CAS Pool size that prevent locks by lack
of CAS and that does not use too much memory.

Best,
Eric


Eddie Epstein a écrit :
> Hi Eric,
> 
> A collection reader typically does 3 things: create/access content to be
> analyzed, obtain a new CAS to be passed to other CAS processors, and
> initialize the CAS with the content (assuming the CAS initializer is part of
> the collection reader). These three things could be done in different
> components. For example, the collection reader could populate a CAS with a
> pointer to new content and a content type label, and the content itself
> accessed from a second component which then populated the CAS. The secondary
> component could be called based on the type of content type and would not
> have to be a CAS multiplier.
> 
> I don't yet understand enough about your configuration to see the CAS pool
> issue.
> 
> Regards,
> Eddie
> 
> 
> On 9/25/07, Eric Vachon <[EMAIL PROTECTED]> wrote:
>> Hi Eddie,
>>
>> In fact we would like to have the unzipping and other file aggregation
>> processes resolved outside the collection reader in order to be able to
>> handle numerous formats and numerous crawling facilities without having
>> to create a collection reader for every combination of crawling and
>> format. The CASMultiplier was like a simple and efficient way to manage
>> this case if it was working inside the CPE. As it is today we handle the
>> process through parameters but it would be easier for people using our
>> collection reader to be able to add the CASMultiplier in the CPE to
>> handle the format.
>>
>> The problems we face with the actual implementation is that there is no
>> CAS pool for the CASMultpilier which can lead to memory issues and that
>> the CPE does not handle this type of processor. Using an aggregated
>> annotator would make it more difficult for the error management.
>>
>> We can manage without using the CASMultiplier by using plugins for our
>> collection readers but we had rather use a standard UIMA feature.
>>
>> Best,
>> Eric Vachon
>>
>> Eddie Epstein a écrit :
>>> Muon,
>>>
>>> I'm not sure exactly what your question is. A CPE based on the CPM uses
>> a
>>> Collection Reader with an optional Cas Initializer. In UIMA 2.x it is
>>> possible to have a Cas Multiplier as a UIMA aggregate component. A
>>> Collection Reader, minus Cas Initializer, is considered a subset of a
>> Cas
>>> Multiplier and can also be used in a UIMA aggregate.
>>>
>>> If none of this answers your question, please try to clarify.
>>>
>>> Thanks,
>>> Eddie Epstein
>>>
>>> On 9/24/07, Muon Le <[EMAIL PROTECTED]> wrote:
>>>> Hi Adam,
>>>>
>>>> I am Muon LE, I would like to replace my CAS Initilizers to CAS
>>>> Multipliers.
>>>> I know there is the limitation about the CPE (UIMA-2.2) to use CAS
>>>> Multiplier.
>>>> Do you know when this limitation will be removed?
>>>>
>>>> Thank you,
>>>> Muon LE.
>>>>
>>>> -----Message d'origine-----
>>>> De : [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] De la part de
>> Adam
>>>> Lally
>>>> Envoyé : jeudi 19 juillet 2007 15:38
>>>> À : [email protected]
>>>> Objet : Re: Multi-threading with a CAS Multiplier.
>>>>
>>>> On 7/19/07, Benjamin Sznajder <[EMAIL PROTECTED]> wrote:
>>>>> <snip/>
>>>>> I would like to get your opinion about the following workaround:
>>>>> Why don't we hide the steps done by the CAS Multiplier in the
>>>>> Collection
>>>>> Reader: the collection reader will read a document of 10 minutes long,
>>>>> and will create 10 CASes corresponding to our 5 and 5 CASes of video
>>>>> and speech of 2 minutes duration?
>>>>> If we do the above, then setting the processingUnitThreadcount to 3
>>>>> (or
>>>>> more) will create three (or more) instances of our AggregateEngine2
>>>>> and we would get real parallelization between our 10 CASes. Do I miss
>>>> something?
>>>> That should work.  As Eddie said, the CPE understands Collection
>> Readers
>>>> but doesn't know anything about CAS Multipliers.
>>>>
>>>> -Adam
>>>>
>>
> 

Reply via email to