Bhavani Iyer wrote:
OK sounds like the suggested improvements to the CAS heap design would still
preserve the high water
mark mechanism for identifying new FSs as those added after the mark.  Is
this correct ?

No.  My conclusion was that we'll create a CAS API that returns
a returns a marker object which may later be used to query the
CAS about certain FSs and when they were created.  This object
will be opaque to CAS users and transient in nature.  Please feel
free to make a suggestion for such an API to make sure your
requirements are covered.

> If so, implementation can start. Should there a branch
created for this work ?

I don't see why we need a branch for this.


The other main concern discussed was the overhead for core UIMA use without
remoting. There should be no
measureable overhead since there will be one int compare on calls to set
feature value and add to index
and no impact on accessing FS values.

Please explain your design.  I expect that there'll be a
global setting, so at most a boolean is checked?


If the overhead turns out to an issue, we could still work around it with a
separate class implementing
CAS with journaling or a wrapper class as suggested before.

Bhavani

On Thu, Jul 10, 2008 at 12:57 PM, Marshall Schor <[EMAIL PROTECTED]> wrote:

Thilo Goetz wrote:

Eddie Epstein wrote:

No opinions, but a few observations:

1M is way too big for some applications that need very small, but very
many
CASes.

I agree.

How about treating the 1st 1 mb segment with the same approach as the heap
is now - providing the ability to start small, and expanding it (by
reallocating and copying) until it gets to 1 mb?

-Marshall


Large arrays may be bigger than whatever segment size is chosen, making
segment management a bit more complicated.

There will be holes at the top of every segment when the next FS doesn't
fit.

Not necessarily.  Why couldn't you spread FSs and arrays
across segments?


Eddie

On Wed, Jul 9, 2008 at 2:37 PM, Marshall Schor <[EMAIL PROTECTED]> wrote:

 Here's a suggestion suggested by previous posts, and common hardware
design
for segmented memory.

Take the int values that represent feature structure (fs) references.
 Today, these are positive numbers from 1 (I think) to around 4 billion.
 These values are used directly as an index into the heap.

Change this to split the bits in these int values into two parts, let's
call them upper and lower.  For example
xxxx xxxx xxxx yyyy yyyy yyyy yyyy yyyy

where the xxx's are the upper bits (each x represents a hex digit), and
the
y's the lower bits.  The y's in this case can represent numbers up to 1
million (approx), and the xxx's represent 4096 values.

Then allocate the heap using multiple 1 meg entry tables, and store each
one in the 4096 entry reference array.  The heap reference would be some
bit-wise shifting and indexed lookup in addition to what we have now and
would probably be very fast, and could be optimized for the xxx=0 case
to be
even faster.

This breaks heaps of over 1 meg into separate parts, which would make
them
more managable, I think, and keeps the high-water mark method viable,
too.

Opinions?

-Marshall






Reply via email to