Re: pseudo-specs for a String class: char *buf

2008-09-06 Thread Henrik Nordstrom
On ons, 2008-09-03 at 16:53 +0200, Kinkie wrote:

> I didn't really think of different buffer types. Do you have in mind
> any scenario where it would be useful?

One example is if KBuf gets implemented using a mallocator that may
reallocate the memory area to reduce fragmentation.

> On the other hand, char* are significantly more efficient for common
> operations, consistently with the design goals..

Agreed.

Regards
Henrik



Re: pseudo-specs for a String class: char *buf

2008-09-03 Thread Alex Rousskov
On Thu, 2008-09-04 at 06:47 +0200, Kinkie wrote:
> On Thu, Sep 4, 2008 at 12:31 AM, Alex Rousskov
> <[EMAIL PROTECTED]> wrote:
> > On Thu, 2008-09-04 at 00:12 +0200, Kinkie wrote:
> >
> >> > I do not think an offset would be significantly less efficient in this
> >> > context. I bet 90+% of operations that require raw data access are far
> >> > more expensive than adding an offset to a pointer.
> >>
> >> The most common one is a NULL check, which is hard to express using
> >> (offset/length).
> >
> > As you know, I do not know what you mean by a NULL check (buffers or
> > strings are not pointers). If you mean an isEmpty() check, then it is
> > implemented as (!length) as the offset is irrelevant for an empty
> > string.
> 
> There are uses for declaring an object as undefined, which is a
> different thing than a zero-length string

"NULL" and "undefined" are different things for many developers. If you
want to propose a special undefined String state, you can add an
isDefined or isSet method. 

isDefined check is implemented as (!bigBuffer), where bigBuffer is the
reference counting pointer to the primary buffer. Thus, the argument
that isNULL or isDefined requires storing a raw string pointer for
efficiency reasons is invalid. Neither check needs access to string
contents.

FWIW, I would not recommend adding isDefined though because special
states is the primary cause for bugs (developers always forget about
them). Dereferencing NULL pointers in C is a well-known example of that.

> Take the tokenizer for example. a null (we may call it invalid,
> undefined, no-store) KBuf is a very conveniente way to signal
> "end-of-stream", as opposed to "a token of zero length".

It is also a convenient way to send null, invalid, undefined, etc.
strings to code that does not expect them.

> Without this, an exception will have to be raised to signal
> the end-of-stream condition.

No exceptions are necessary or desired. Here is a simple tokenizer
API/usage sketch:

 for (Tokenizer tzer(string, delimiter); !tzer.atEnd(); ++tzer) {
String token = *tzer;
...
 }

We can also look at std::iterator but that interface may be slightly
more complex than we need. We can use it if we want to be compatible
with std algorithms, but we probably do not care about those at this
point.

HTH,

Alex.




Re: pseudo-specs for a String class: char *buf

2008-09-03 Thread Kinkie
On Thu, Sep 4, 2008 at 12:31 AM, Alex Rousskov
<[EMAIL PROTECTED]> wrote:
> On Thu, 2008-09-04 at 00:12 +0200, Kinkie wrote:
>
>> > I do not think an offset would be significantly less efficient in this
>> > context. I bet 90+% of operations that require raw data access are far
>> > more expensive than adding an offset to a pointer.
>>
>> The most common one is a NULL check, which is hard to express using
>> (offset/length).
>
> As you know, I do not know what you mean by a NULL check (buffers or
> strings are not pointers). If you mean an isEmpty() check, then it is
> implemented as (!length) as the offset is irrelevant for an empty
> string.

There are uses for declaring an object as undefined, which is a
different thing than a zero-length string
Take the tokenizer for example. a null (we may call it invalid,
undefined, no-store) KBuf is a very conveniente way to signal
"end-of-stream", as opposed to "a token of zero length". Without this,
an exception will have to be raised to signal the end-of-stream
condition.

> If you mean the old MemBuf::isNull() check, then it will disappear when
> a proper buffer class is available (i.e., it will not be common at all).
> That check comes from C code and is irrelevant in this context.

Maybe you're right and I was too much C-minded when designing the class innards.

>> Further extra work would be the need of more temporary storage to
>> rebuild the char* out of (memhandle->mem + offset).
>> Extremely small details, but I expect they'll be very common
>> operations (I bumped the size of the stat-counters to 64bits).
>
> Again, I believe that common operations that require raw data access are
> far more expensive than adding an offset to a pointer.
>
>> In the meantime, I ask everyone who can spare 5 minutes to just grab
>> the code off launchapd and see it live ("make" will compile and launch
>> a builtin testsuite/demo, "make dox" will generate the class reference
>> - the code is extensively documented).
>
> I trust you that the code compiles and runs, but so does the current
> ugly code, so I am not sure what is the point of "seeing it live",
> especially if we disagree on basic design principles. Do you expect
> something specific from that "5 minute" exercise?

A feel for the presence (or hopefully absence) of obvious warts.

Thanks.

-- 
 /kinkie


Re: pseudo-specs for a String class: char *buf

2008-09-03 Thread Amos Jeffries
> On Wed, 2008-09-03 at 16:53 +0200, Kinkie wrote:
>> On Wed, Sep 3, 2008 at 3:59 PM, Alex Rousskov
>> <[EMAIL PROTECTED]> wrote:
>> >
>> >I looked at your StringNg wiki page and noticed that your string
>> has
>> > a "char *buf" pointer into the memory buffer (in addition to the
>> buffer
>> > pointer itself). I think it would be better to use an offset instead
>> of
>> > the pointer into internal buffer area:
>>
>> Yes, I had a discussion with Adrian about the same issue earlier on on
>> IRC.
>> You make excellent points - as Adrian did :)
>> Here's my take
>>
>> > - cleaner design: no peeking into other object's privates
>> Yes. At the same time the KBuf::Buf class (the "other object") is a
>> private member class of the KBuf class;
>> it's actually little more than a glorified struct, and shouldn't be
>> thought as a first-level citizen on its own.
>
> Private or not, it is still another object.
>
> And, FWIW, I doubt the memory buffer class will remain inside the string
> class.
>
>> > - easier to change memory buffer internals
>> If you mean "change the buffer contents", that's an operation which
>> should be quite rare.
>> If you mean "change the code" that should be even rarer.
>
> I meant "change the code". I do expect those changes in the foreseeable
> future.
>
>> > - easier to support several buffer types with different internals
>> I didn't really think of different buffer types. Do you have in mind
>> any scenario where it would be useful?
>
> Yes, I do (e.g., small versus large, thread-safe versus not, and
> contiguous versus chunked).
>
> In fact, you kind of documented different buffer implementations
> yourself: "small Bufs (<8Kb) should be managed by MemPools. - Bufs
> bigger than 8Kb should be allocated in sizes compatible with the system
> page size"
>
>> > - easier to support re-allocation of buffer memory
>> > - easier to provide a thread-safe implementation.
>>
>> On the other hand, char* are significantly more efficient for common
>> operations, consistently with the design goals..
>
> I do not think an offset would be significantly less efficient in this
> context. I bet 90+% of operations that require raw data access are far
> more expensive than adding an offset to a pointer.
>
>> I'm not saying that I won't change them, I'd just like to be shown
>> scenarios where it makes a difference.
>
> I believe I provided more than enough reasons and you agreed with at
> least some of them. You have provided one so far ("significantly more
> efficient for common operations"). I think the burden of proof should be
> on you in this case.
>
>> On an unrelated issue, since it was of interest to some of us, here's
>> a sample of the caller code for tokenization functions (actual live
>> code):
>>
>> KBuf s1;
>> cout << "tokenization: \n";
>> {
>> s1="The quick brown fox jumped over the lazy dog";
>> char *needle=" ";
>> KBuf cs1(needle);
>> while (!s1.isNull()) {
>> cout << "token: " << s1.nextToken(cs1) << endl;
>> }
>> }
>> cout << endl;
>
> FWIW, I still think that tokenization should be a external to the buffer
> or string and should not modify them. Please see my earlier posts for
> details.

Kinkie, while I like the single-object API design. I think you could get
around all these arguments and confusion by adding a sub-class of KBuf
called KBufTokeniser, which just provides the nextToken API on top of the
String API.


Alex, the basic buffer is not altered, only where the s1 offset is
pointing at. Kinkie is just not very good at describing that code loop yet
:-( .

>From what he mentioned on IRC last night

Making s1 a duplicate reference to another KBuf (ie the actual in put
buffer) should show that the base KBuf is unchanged, but the parsing with
nextToken() will only spew off a child sub-string and increment the s1
start offset one token down the string.

I'm in favor, it can be tuned for very efficient Parsing. And in
inefficient usage of it can be fixed easily.

Amos




Re: pseudo-specs for a String class: char *buf

2008-09-03 Thread Alex Rousskov
On Thu, 2008-09-04 at 00:12 +0200, Kinkie wrote:

> > I do not think an offset would be significantly less efficient in this
> > context. I bet 90+% of operations that require raw data access are far
> > more expensive than adding an offset to a pointer.
> 
> The most common one is a NULL check, which is hard to express using
> (offset/length).

As you know, I do not know what you mean by a NULL check (buffers or
strings are not pointers). If you mean an isEmpty() check, then it is
implemented as (!length) as the offset is irrelevant for an empty
string. 

If you mean the old MemBuf::isNull() check, then it will disappear when
a proper buffer class is available (i.e., it will not be common at all).
That check comes from C code and is irrelevant in this context.

> Further extra work would be the need of more temporary storage to
> rebuild the char* out of (memhandle->mem + offset).
> Extremely small details, but I expect they'll be very common
> operations (I bumped the size of the stat-counters to 64bits).

Again, I believe that common operations that require raw data access are
far more expensive than adding an offset to a pointer.

> In the meantime, I ask everyone who can spare 5 minutes to just grab
> the code off launchapd and see it live ("make" will compile and launch
> a builtin testsuite/demo, "make dox" will generate the class reference
> - the code is extensively documented).

I trust you that the code compiles and runs, but so does the current
ugly code, so I am not sure what is the point of "seeing it live",
especially if we disagree on basic design principles. Do you expect
something specific from that "5 minute" exercise?

Thank you,

Alex.




Re: pseudo-specs for a String class: char *buf

2008-09-03 Thread Kinkie
On Wed, Sep 3, 2008 at 5:49 PM, Alex Rousskov
<[EMAIL PROTECTED]> wrote:
> On Wed, 2008-09-03 at 16:53 +0200, Kinkie wrote:
>> On Wed, Sep 3, 2008 at 3:59 PM, Alex Rousskov
>> <[EMAIL PROTECTED]> wrote:
>> >
>> >I looked at your StringNg wiki page and noticed that your string has
>> > a "char *buf" pointer into the memory buffer (in addition to the buffer
>> > pointer itself). I think it would be better to use an offset instead of
>> > the pointer into internal buffer area:
>>
>> Yes, I had a discussion with Adrian about the same issue earlier on on IRC.
>> You make excellent points - as Adrian did :)
>> Here's my take
>>
>> > - cleaner design: no peeking into other object's privates
>> Yes. At the same time the KBuf::Buf class (the "other object") is a
>> private member class of the KBuf class;
>> it's actually little more than a glorified struct, and shouldn't be
>> thought as a first-level citizen on its own.
>
> Private or not, it is still another object.
>
> And, FWIW, I doubt the memory buffer class will remain inside the string
> class.

I'm not planning to take it out, and if noone else does...

>> > - easier to change memory buffer internals
>> If you mean "change the buffer contents", that's an operation which
>> should be quite rare.
>> If you mean "change the code" that should be even rarer.
>
> I meant "change the code". I do expect those changes in the foreseeable
> future.
>
>> > - easier to support several buffer types with different internals
>> I didn't really think of different buffer types. Do you have in mind
>> any scenario where it would be useful?
>
> Yes, I do (e.g., small versus large, thread-safe versus not, and
> contiguous versus chunked).

Chunks are out-of-scope, they are better dealt by a KBufList class.
Honestly the only argument I buy 100% is easier thread-safety.

> In fact, you kind of documented different buffer implementations
> yourself: "small Bufs (<8Kb) should be managed by MemPools. - Bufs
> bigger than 8Kb should be allocated in sizes compatible with the system
> page size"

The "magic" is in the allocation strategy function not for the Buf
itself, but for its underlying storage.
We want to allocate a bit more than strictly needed to be able to grow
but not too much.
Quoting:

void init(size_t size)
{
pagesize=sysconf(_SC_PAGESIZE); //FIXME: make this autoconf-based
size_t actualsize=(size*12)/10; //FIXME: make self-tuning
using nreallocs
// arbitrary allocation algorithm
if (actualsize <= 64)
actualsize=64+malloc_overhead;
else if (actualsize <= 64*pagesize)
actualsize=nearestPowerOf2(actualsize);
else
actualsize+=actualsize%pagesize; //increase to closest pagesize
//shrink: we WANT to fit in a page
actualsize-=malloc_overhead;
mem=new char[actualsize];
bufsize=actualsize;
bufused=0;
refs=1;
debugs("Buf@" <<(void *)this <<"::init(). req:"<> > - easier to support re-allocation of buffer memory
>> > - easier to provide a thread-safe implementation.
>>
>> On the other hand, char* are significantly more efficient for common
>> operations, consistently with the design goals..
>
> I do not think an offset would be significantly less efficient in this
> context. I bet 90+% of operations that require raw data access are far
> more expensive than adding an offset to a pointer.

The most common one is a NULL check, which is hard to express using
(offset/length).
Further extra work would be the need of more temporary storage to
rebuild the char* out of (memhandle->mem + offset).
Extremely small details, but I expect they'll be very common
operations (I bumped the size of the stat-counters to 64bits).

>> I'm not saying that I won't change them, I'd just like to be shown
>> scenarios where it makes a difference.
>
> I believe I provided more than enough reasons and you agreed with at
> least some of them. You have provided one so far ("significantly more
> efficient for common operations"). I think the burden of proof should be
> on you in this case.

What I can (and will) do is try and see if and how much these changes
would complicate the code.
In the meantime, I ask everyone who can spare 5 minutes to just grab
the code off launchapd and see it live ("make" will compile and launch
a builtin testsuite/demo, "make dox" will generate the class reference
- the code is extensively documented).


-- 
 /kinkie


Re: pseudo-specs for a String class: char *buf

2008-09-03 Thread Alex Rousskov
On Wed, 2008-09-03 at 16:53 +0200, Kinkie wrote:
> On Wed, Sep 3, 2008 at 3:59 PM, Alex Rousskov
> <[EMAIL PROTECTED]> wrote:
> >
> >I looked at your StringNg wiki page and noticed that your string has
> > a "char *buf" pointer into the memory buffer (in addition to the buffer
> > pointer itself). I think it would be better to use an offset instead of
> > the pointer into internal buffer area:
> 
> Yes, I had a discussion with Adrian about the same issue earlier on on IRC.
> You make excellent points - as Adrian did :)
> Here's my take
> 
> > - cleaner design: no peeking into other object's privates
> Yes. At the same time the KBuf::Buf class (the "other object") is a
> private member class of the KBuf class;
> it's actually little more than a glorified struct, and shouldn't be
> thought as a first-level citizen on its own.

Private or not, it is still another object. 

And, FWIW, I doubt the memory buffer class will remain inside the string
class.

> > - easier to change memory buffer internals
> If you mean "change the buffer contents", that's an operation which
> should be quite rare.
> If you mean "change the code" that should be even rarer.

I meant "change the code". I do expect those changes in the foreseeable
future.

> > - easier to support several buffer types with different internals
> I didn't really think of different buffer types. Do you have in mind
> any scenario where it would be useful?

Yes, I do (e.g., small versus large, thread-safe versus not, and
contiguous versus chunked). 

In fact, you kind of documented different buffer implementations
yourself: "small Bufs (<8Kb) should be managed by MemPools. - Bufs
bigger than 8Kb should be allocated in sizes compatible with the system
page size"

> > - easier to support re-allocation of buffer memory
> > - easier to provide a thread-safe implementation.
> 
> On the other hand, char* are significantly more efficient for common
> operations, consistently with the design goals..

I do not think an offset would be significantly less efficient in this
context. I bet 90+% of operations that require raw data access are far
more expensive than adding an offset to a pointer.

> I'm not saying that I won't change them, I'd just like to be shown
> scenarios where it makes a difference.

I believe I provided more than enough reasons and you agreed with at
least some of them. You have provided one so far ("significantly more
efficient for common operations"). I think the burden of proof should be
on you in this case.

> On an unrelated issue, since it was of interest to some of us, here's
> a sample of the caller code for tokenization functions (actual live
> code):
> 
> KBuf s1;
> cout << "tokenization: \n";
> {
> s1="The quick brown fox jumped over the lazy dog";
> char *needle=" ";
> KBuf cs1(needle);
> while (!s1.isNull()) {
> cout << "token: " << s1.nextToken(cs1) << endl;
> }
> }
> cout << endl;

FWIW, I still think that tokenization should be a external to the buffer
or string and should not modify them. Please see my earlier posts for
details.

Thank you,

Alex.




Re: pseudo-specs for a String class: char *buf

2008-09-03 Thread Kinkie
On Wed, Sep 3, 2008 at 3:59 PM, Alex Rousskov
<[EMAIL PROTECTED]> wrote:
> Hi Kinkie,
>
>I looked at your StringNg wiki page and noticed that your string has
> a "char *buf" pointer into the memory buffer (in addition to the buffer
> pointer itself). I think it would be better to use an offset instead of
> the pointer into internal buffer area:

Yes, I had a discussion with Adrian about the same issue earlier on on IRC.
You make excellent points - as Adrian did :)
Here's my take

> - cleaner design: no peeking into other object's privates
Yes. At the same time the KBuf::Buf class (the "other object") is a
private member class of the KBuf class;
it's actually little more than a glorified struct, and shouldn't be
thought as a first-level citizen on its own.

> - easier to change memory buffer internals
If you mean "change the buffer contents", that's an operation which
should be quite rare.
If you mean "change the code" that should be even rarer.

> - easier to support several buffer types with different internals
I didn't really think of different buffer types. Do you have in mind
any scenario where it would be useful?

> - easier to support re-allocation of buffer memory
> - easier to provide a thread-safe implementation.

On the other hand, char* are significantly more efficient for common
operations, consistently with the design goals..

I'm not saying that I won't change them, I'd just like to be shown
scenarios where it makes a difference.

On an unrelated issue, since it was of interest to some of us, here's
a sample of the caller code for tokenization functions (actual live
code):

KBuf s1;
cout << "tokenization: \n";
{
s1="The quick brown fox jumped over the lazy dog";
char *needle=" ";
KBuf cs1(needle);
while (!s1.isNull()) {
cout << "token: " << s1.nextToken(cs1) << endl;
}
}
cout << endl;

And here's the output:
tokenization:
token: The
token: quick
token: brown
token: fox
token: jumped
token: over
token: the
token: lazy
token: dog




-- 
 /kinkie


Re: pseudo-specs for a String class: char *buf

2008-09-03 Thread Alex Rousskov
Hi Kinkie,

I looked at your StringNg wiki page and noticed that your string has
a "char *buf" pointer into the memory buffer (in addition to the buffer
pointer itself). I think it would be better to use an offset instead of
the pointer into internal buffer area:

- cleaner design: no peeking into other object's privates
- easier to change memory buffer internals
- easier to support several buffer types with different internals
- easier to support re-allocation of buffer memory
- easier to provide a thread-safe implementation.

HTH,

Alex.
http://wiki.squid-cache.org/Features/BetterStringBuffer/StringNg