Hi Eddie,

The problem here is that one file does not mean one document.

Imagine that the each pointer points to a zip containing 2000 documents.
How do you populate not one but 2000 CAS from each pointer ? If you are
restrained by a CAS Pool, then it's ok, otherwise you will create too
many new CAS and you will exceed the allowed memory.

We need a component that has one CAS in input and that can output any
number of CAS (the CASMultiplier) but with a limitation so that at one
time the number of CAS is not above a certain threshold.

My other concern about the CAS Pool is that if I have multiple CAS
Multipliers in a CPE that is launched over 5 pipelines it will become
difficult to have a reasonable CAS Pool size that prevent locks by lack
of CAS and that does not use too much memory.

Best,
Eric


Eddie Epstein a écrit :
> Hi Eric,
> 
> A collection reader typically does 3 things: create/access content to be
> analyzed, obtain a new CAS to be passed to other CAS processors, and
> initialize the CAS with the content (assuming the CAS initializer is part of
> the collection reader). These three things could be done in different
> components. For example, the collection reader could populate a CAS with a
> pointer to new content and a content type label, and the content itself
> accessed from a second component which then populated the CAS. The secondary
> component could be called based on the type of content type and would not
> have to be a CAS multiplier.
> 
> I don't yet understand enough about your configuration to see the CAS pool
> issue.
> 
> Regards,
> Eddie
> 
> 
> On 9/25/07, Eric Vachon <[EMAIL PROTECTED]> wrote:
>> Hi Eddie,
>>
>> In fact we would like to have the unzipping and other file aggregation
>> processes resolved outside the collection reader in order to be able to
>> handle numerous formats and numerous crawling facilities without having
>> to create a collection reader for every combination of crawling and
>> format. The CASMultiplier was like a simple and efficient way to manage
>> this case if it was working inside the CPE. As it is today we handle the
>> process through parameters but it would be easier for people using our
>> collection reader to be able to add the CASMultiplier in the CPE to
>> handle the format.
>>
>> The problems we face with the actual implementation is that there is no
>> CAS pool for the CASMultpilier which can lead to memory issues and that
>> the CPE does not handle this type of processor. Using an aggregated
>> annotator would make it more difficult for the error management.
>>
>> We can manage without using the CASMultiplier by using plugins for our
>> collection readers but we had rather use a standard UIMA feature.
>>
>> Best,
>> Eric Vachon
>>
>> Eddie Epstein a écrit :
>>> Muon,
>>>
>>> I'm not sure exactly what your question is. A CPE based on the CPM uses
>> a
>>> Collection Reader with an optional Cas Initializer. In UIMA 2.x it is
>>> possible to have a Cas Multiplier as a UIMA aggregate component. A
>>> Collection Reader, minus Cas Initializer, is considered a subset of a
>> Cas
>>> Multiplier and can also be used in a UIMA aggregate.
>>>
>>> If none of this answers your question, please try to clarify.
>>>
>>> Thanks,
>>> Eddie Epstein
>>>
>>> On 9/24/07, Muon Le <[EMAIL PROTECTED]> wrote:
>>>> Hi Adam,
>>>>
>>>> I am Muon LE, I would like to replace my CAS Initilizers to CAS
>>>> Multipliers.
>>>> I know there is the limitation about the CPE (UIMA-2.2) to use CAS
>>>> Multiplier.
>>>> Do you know when this limitation will be removed?
>>>>
>>>> Thank you,
>>>> Muon LE.
>>>>
>>>> -----Message d'origine-----
>>>> De : [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] De la part de
>> Adam
>>>> Lally
>>>> Envoyé : jeudi 19 juillet 2007 15:38
>>>> À : [email protected]
>>>> Objet : Re: Multi-threading with a CAS Multiplier.
>>>>
>>>> On 7/19/07, Benjamin Sznajder <[EMAIL PROTECTED]> wrote:
>>>>> <snip/>
>>>>> I would like to get your opinion about the following workaround:
>>>>> Why don't we hide the steps done by the CAS Multiplier in the
>>>>> Collection
>>>>> Reader: the collection reader will read a document of 10 minutes long,
>>>>> and will create 10 CASes corresponding to our 5 and 5 CASes of video
>>>>> and speech of 2 minutes duration?
>>>>> If we do the above, then setting the processingUnitThreadcount to 3
>>>>> (or
>>>>> more) will create three (or more) instances of our AggregateEngine2
>>>>> and we would get real parallelization between our 10 CASes. Do I miss
>>>> something?
>>>> That should work.  As Eddie said, the CPE understands Collection
>> Readers
>>>> but doesn't know anything about CAS Multipliers.
>>>>
>>>> -Adam
>>>>
>>
> 

Reply via email to