Hi Eddie, The problem here is that one file does not mean one document.
Imagine that the each pointer points to a zip containing 2000 documents. How do you populate not one but 2000 CAS from each pointer ? If you are restrained by a CAS Pool, then it's ok, otherwise you will create too many new CAS and you will exceed the allowed memory. We need a component that has one CAS in input and that can output any number of CAS (the CASMultiplier) but with a limitation so that at one time the number of CAS is not above a certain threshold. My other concern about the CAS Pool is that if I have multiple CAS Multipliers in a CPE that is launched over 5 pipelines it will become difficult to have a reasonable CAS Pool size that prevent locks by lack of CAS and that does not use too much memory. Best, Eric Eddie Epstein a écrit : > Hi Eric, > > A collection reader typically does 3 things: create/access content to be > analyzed, obtain a new CAS to be passed to other CAS processors, and > initialize the CAS with the content (assuming the CAS initializer is part of > the collection reader). These three things could be done in different > components. For example, the collection reader could populate a CAS with a > pointer to new content and a content type label, and the content itself > accessed from a second component which then populated the CAS. The secondary > component could be called based on the type of content type and would not > have to be a CAS multiplier. > > I don't yet understand enough about your configuration to see the CAS pool > issue. > > Regards, > Eddie > > > On 9/25/07, Eric Vachon <[EMAIL PROTECTED]> wrote: >> Hi Eddie, >> >> In fact we would like to have the unzipping and other file aggregation >> processes resolved outside the collection reader in order to be able to >> handle numerous formats and numerous crawling facilities without having >> to create a collection reader for every combination of crawling and >> format. The CASMultiplier was like a simple and efficient way to manage >> this case if it was working inside the CPE. As it is today we handle the >> process through parameters but it would be easier for people using our >> collection reader to be able to add the CASMultiplier in the CPE to >> handle the format. >> >> The problems we face with the actual implementation is that there is no >> CAS pool for the CASMultpilier which can lead to memory issues and that >> the CPE does not handle this type of processor. Using an aggregated >> annotator would make it more difficult for the error management. >> >> We can manage without using the CASMultiplier by using plugins for our >> collection readers but we had rather use a standard UIMA feature. >> >> Best, >> Eric Vachon >> >> Eddie Epstein a écrit : >>> Muon, >>> >>> I'm not sure exactly what your question is. A CPE based on the CPM uses >> a >>> Collection Reader with an optional Cas Initializer. In UIMA 2.x it is >>> possible to have a Cas Multiplier as a UIMA aggregate component. A >>> Collection Reader, minus Cas Initializer, is considered a subset of a >> Cas >>> Multiplier and can also be used in a UIMA aggregate. >>> >>> If none of this answers your question, please try to clarify. >>> >>> Thanks, >>> Eddie Epstein >>> >>> On 9/24/07, Muon Le <[EMAIL PROTECTED]> wrote: >>>> Hi Adam, >>>> >>>> I am Muon LE, I would like to replace my CAS Initilizers to CAS >>>> Multipliers. >>>> I know there is the limitation about the CPE (UIMA-2.2) to use CAS >>>> Multiplier. >>>> Do you know when this limitation will be removed? >>>> >>>> Thank you, >>>> Muon LE. >>>> >>>> -----Message d'origine----- >>>> De : [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] De la part de >> Adam >>>> Lally >>>> Envoyé : jeudi 19 juillet 2007 15:38 >>>> À : [email protected] >>>> Objet : Re: Multi-threading with a CAS Multiplier. >>>> >>>> On 7/19/07, Benjamin Sznajder <[EMAIL PROTECTED]> wrote: >>>>> <snip/> >>>>> I would like to get your opinion about the following workaround: >>>>> Why don't we hide the steps done by the CAS Multiplier in the >>>>> Collection >>>>> Reader: the collection reader will read a document of 10 minutes long, >>>>> and will create 10 CASes corresponding to our 5 and 5 CASes of video >>>>> and speech of 2 minutes duration? >>>>> If we do the above, then setting the processingUnitThreadcount to 3 >>>>> (or >>>>> more) will create three (or more) instances of our AggregateEngine2 >>>>> and we would get real parallelization between our 10 CASes. Do I miss >>>> something? >>>> That should work. As Eddie said, the CPE understands Collection >> Readers >>>> but doesn't know anything about CAS Multipliers. >>>> >>>> -Adam >>>> >> >
