OK sounds like the suggested improvements to the CAS heap design would still preserve the high water mark mechanism for identifying new FSs as those added after the mark. Is this correct ? If so, implementation can start. Should there a branch created for this work ?
The other main concern discussed was the overhead for core UIMA use without remoting. There should be no measureable overhead since there will be one int compare on calls to set feature value and add to index and no impact on accessing FS values. If the overhead turns out to an issue, we could still work around it with a separate class implementing CAS with journaling or a wrapper class as suggested before. Bhavani On Thu, Jul 10, 2008 at 12:57 PM, Marshall Schor <[EMAIL PROTECTED]> wrote: > Thilo Goetz wrote: > >> Eddie Epstein wrote: >> >>> No opinions, but a few observations: >>> >>> 1M is way too big for some applications that need very small, but very >>> many >>> CASes. >>> >> >> I agree. >> > How about treating the 1st 1 mb segment with the same approach as the heap > is now - providing the ability to start small, and expanding it (by > reallocating and copying) until it gets to 1 mb? > > -Marshall > > >> >>> Large arrays may be bigger than whatever segment size is chosen, making >>> segment management a bit more complicated. >>> >>> There will be holes at the top of every segment when the next FS doesn't >>> fit. >>> >> >> Not necessarily. Why couldn't you spread FSs and arrays >> across segments? >> >> >>> Eddie >>> >>> On Wed, Jul 9, 2008 at 2:37 PM, Marshall Schor <[EMAIL PROTECTED]> wrote: >>> >>> Here's a suggestion suggested by previous posts, and common hardware >>>> design >>>> for segmented memory. >>>> >>>> Take the int values that represent feature structure (fs) references. >>>> Today, these are positive numbers from 1 (I think) to around 4 billion. >>>> These values are used directly as an index into the heap. >>>> >>>> Change this to split the bits in these int values into two parts, let's >>>> call them upper and lower. For example >>>> xxxx xxxx xxxx yyyy yyyy yyyy yyyy yyyy >>>> >>>> where the xxx's are the upper bits (each x represents a hex digit), and >>>> the >>>> y's the lower bits. The y's in this case can represent numbers up to 1 >>>> million (approx), and the xxx's represent 4096 values. >>>> >>>> Then allocate the heap using multiple 1 meg entry tables, and store each >>>> one in the 4096 entry reference array. The heap reference would be some >>>> bit-wise shifting and indexed lookup in addition to what we have now and >>>> would probably be very fast, and could be optimized for the xxx=0 case >>>> to be >>>> even faster. >>>> >>>> This breaks heaps of over 1 meg into separate parts, which would make >>>> them >>>> more managable, I think, and keeps the high-water mark method viable, >>>> too. >>>> >>>> Opinions? >>>> >>>> -Marshall >>>> >>>> >>>> >>>> >>> >> >> >
