Re: pseudo-specs for a String class: char *buf
On ons, 2008-09-03 at 16:53 +0200, Kinkie wrote: > I didn't really think of different buffer types. Do you have in mind > any scenario where it would be useful? One example is if KBuf gets implemented using a mallocator that may reallocate the memory area to reduce fragmentation. > On the other hand, char* are significantly more efficient for common > operations, consistently with the design goals.. Agreed. Regards Henrik
Re: pseudo-specs for a String class: char *buf
On Thu, 2008-09-04 at 06:47 +0200, Kinkie wrote: > On Thu, Sep 4, 2008 at 12:31 AM, Alex Rousskov > <[EMAIL PROTECTED]> wrote: > > On Thu, 2008-09-04 at 00:12 +0200, Kinkie wrote: > > > >> > I do not think an offset would be significantly less efficient in this > >> > context. I bet 90+% of operations that require raw data access are far > >> > more expensive than adding an offset to a pointer. > >> > >> The most common one is a NULL check, which is hard to express using > >> (offset/length). > > > > As you know, I do not know what you mean by a NULL check (buffers or > > strings are not pointers). If you mean an isEmpty() check, then it is > > implemented as (!length) as the offset is irrelevant for an empty > > string. > > There are uses for declaring an object as undefined, which is a > different thing than a zero-length string "NULL" and "undefined" are different things for many developers. If you want to propose a special undefined String state, you can add an isDefined or isSet method. isDefined check is implemented as (!bigBuffer), where bigBuffer is the reference counting pointer to the primary buffer. Thus, the argument that isNULL or isDefined requires storing a raw string pointer for efficiency reasons is invalid. Neither check needs access to string contents. FWIW, I would not recommend adding isDefined though because special states is the primary cause for bugs (developers always forget about them). Dereferencing NULL pointers in C is a well-known example of that. > Take the tokenizer for example. a null (we may call it invalid, > undefined, no-store) KBuf is a very conveniente way to signal > "end-of-stream", as opposed to "a token of zero length". It is also a convenient way to send null, invalid, undefined, etc. strings to code that does not expect them. > Without this, an exception will have to be raised to signal > the end-of-stream condition. No exceptions are necessary or desired. Here is a simple tokenizer API/usage sketch: for (Tokenizer tzer(string, delimiter); !tzer.atEnd(); ++tzer) { String token = *tzer; ... } We can also look at std::iterator but that interface may be slightly more complex than we need. We can use it if we want to be compatible with std algorithms, but we probably do not care about those at this point. HTH, Alex.
Re: pseudo-specs for a String class: char *buf
On Thu, Sep 4, 2008 at 12:31 AM, Alex Rousskov <[EMAIL PROTECTED]> wrote: > On Thu, 2008-09-04 at 00:12 +0200, Kinkie wrote: > >> > I do not think an offset would be significantly less efficient in this >> > context. I bet 90+% of operations that require raw data access are far >> > more expensive than adding an offset to a pointer. >> >> The most common one is a NULL check, which is hard to express using >> (offset/length). > > As you know, I do not know what you mean by a NULL check (buffers or > strings are not pointers). If you mean an isEmpty() check, then it is > implemented as (!length) as the offset is irrelevant for an empty > string. There are uses for declaring an object as undefined, which is a different thing than a zero-length string Take the tokenizer for example. a null (we may call it invalid, undefined, no-store) KBuf is a very conveniente way to signal "end-of-stream", as opposed to "a token of zero length". Without this, an exception will have to be raised to signal the end-of-stream condition. > If you mean the old MemBuf::isNull() check, then it will disappear when > a proper buffer class is available (i.e., it will not be common at all). > That check comes from C code and is irrelevant in this context. Maybe you're right and I was too much C-minded when designing the class innards. >> Further extra work would be the need of more temporary storage to >> rebuild the char* out of (memhandle->mem + offset). >> Extremely small details, but I expect they'll be very common >> operations (I bumped the size of the stat-counters to 64bits). > > Again, I believe that common operations that require raw data access are > far more expensive than adding an offset to a pointer. > >> In the meantime, I ask everyone who can spare 5 minutes to just grab >> the code off launchapd and see it live ("make" will compile and launch >> a builtin testsuite/demo, "make dox" will generate the class reference >> - the code is extensively documented). > > I trust you that the code compiles and runs, but so does the current > ugly code, so I am not sure what is the point of "seeing it live", > especially if we disagree on basic design principles. Do you expect > something specific from that "5 minute" exercise? A feel for the presence (or hopefully absence) of obvious warts. Thanks. -- /kinkie
Re: pseudo-specs for a String class: char *buf
> On Wed, 2008-09-03 at 16:53 +0200, Kinkie wrote: >> On Wed, Sep 3, 2008 at 3:59 PM, Alex Rousskov >> <[EMAIL PROTECTED]> wrote: >> > >> >I looked at your StringNg wiki page and noticed that your string >> has >> > a "char *buf" pointer into the memory buffer (in addition to the >> buffer >> > pointer itself). I think it would be better to use an offset instead >> of >> > the pointer into internal buffer area: >> >> Yes, I had a discussion with Adrian about the same issue earlier on on >> IRC. >> You make excellent points - as Adrian did :) >> Here's my take >> >> > - cleaner design: no peeking into other object's privates >> Yes. At the same time the KBuf::Buf class (the "other object") is a >> private member class of the KBuf class; >> it's actually little more than a glorified struct, and shouldn't be >> thought as a first-level citizen on its own. > > Private or not, it is still another object. > > And, FWIW, I doubt the memory buffer class will remain inside the string > class. > >> > - easier to change memory buffer internals >> If you mean "change the buffer contents", that's an operation which >> should be quite rare. >> If you mean "change the code" that should be even rarer. > > I meant "change the code". I do expect those changes in the foreseeable > future. > >> > - easier to support several buffer types with different internals >> I didn't really think of different buffer types. Do you have in mind >> any scenario where it would be useful? > > Yes, I do (e.g., small versus large, thread-safe versus not, and > contiguous versus chunked). > > In fact, you kind of documented different buffer implementations > yourself: "small Bufs (<8Kb) should be managed by MemPools. - Bufs > bigger than 8Kb should be allocated in sizes compatible with the system > page size" > >> > - easier to support re-allocation of buffer memory >> > - easier to provide a thread-safe implementation. >> >> On the other hand, char* are significantly more efficient for common >> operations, consistently with the design goals.. > > I do not think an offset would be significantly less efficient in this > context. I bet 90+% of operations that require raw data access are far > more expensive than adding an offset to a pointer. > >> I'm not saying that I won't change them, I'd just like to be shown >> scenarios where it makes a difference. > > I believe I provided more than enough reasons and you agreed with at > least some of them. You have provided one so far ("significantly more > efficient for common operations"). I think the burden of proof should be > on you in this case. > >> On an unrelated issue, since it was of interest to some of us, here's >> a sample of the caller code for tokenization functions (actual live >> code): >> >> KBuf s1; >> cout << "tokenization: \n"; >> { >> s1="The quick brown fox jumped over the lazy dog"; >> char *needle=" "; >> KBuf cs1(needle); >> while (!s1.isNull()) { >> cout << "token: " << s1.nextToken(cs1) << endl; >> } >> } >> cout << endl; > > FWIW, I still think that tokenization should be a external to the buffer > or string and should not modify them. Please see my earlier posts for > details. Kinkie, while I like the single-object API design. I think you could get around all these arguments and confusion by adding a sub-class of KBuf called KBufTokeniser, which just provides the nextToken API on top of the String API. Alex, the basic buffer is not altered, only where the s1 offset is pointing at. Kinkie is just not very good at describing that code loop yet :-( . >From what he mentioned on IRC last night Making s1 a duplicate reference to another KBuf (ie the actual in put buffer) should show that the base KBuf is unchanged, but the parsing with nextToken() will only spew off a child sub-string and increment the s1 start offset one token down the string. I'm in favor, it can be tuned for very efficient Parsing. And in inefficient usage of it can be fixed easily. Amos
Re: pseudo-specs for a String class: char *buf
On Thu, 2008-09-04 at 00:12 +0200, Kinkie wrote: > > I do not think an offset would be significantly less efficient in this > > context. I bet 90+% of operations that require raw data access are far > > more expensive than adding an offset to a pointer. > > The most common one is a NULL check, which is hard to express using > (offset/length). As you know, I do not know what you mean by a NULL check (buffers or strings are not pointers). If you mean an isEmpty() check, then it is implemented as (!length) as the offset is irrelevant for an empty string. If you mean the old MemBuf::isNull() check, then it will disappear when a proper buffer class is available (i.e., it will not be common at all). That check comes from C code and is irrelevant in this context. > Further extra work would be the need of more temporary storage to > rebuild the char* out of (memhandle->mem + offset). > Extremely small details, but I expect they'll be very common > operations (I bumped the size of the stat-counters to 64bits). Again, I believe that common operations that require raw data access are far more expensive than adding an offset to a pointer. > In the meantime, I ask everyone who can spare 5 minutes to just grab > the code off launchapd and see it live ("make" will compile and launch > a builtin testsuite/demo, "make dox" will generate the class reference > - the code is extensively documented). I trust you that the code compiles and runs, but so does the current ugly code, so I am not sure what is the point of "seeing it live", especially if we disagree on basic design principles. Do you expect something specific from that "5 minute" exercise? Thank you, Alex.
Re: pseudo-specs for a String class: char *buf
On Wed, Sep 3, 2008 at 5:49 PM, Alex Rousskov <[EMAIL PROTECTED]> wrote: > On Wed, 2008-09-03 at 16:53 +0200, Kinkie wrote: >> On Wed, Sep 3, 2008 at 3:59 PM, Alex Rousskov >> <[EMAIL PROTECTED]> wrote: >> > >> >I looked at your StringNg wiki page and noticed that your string has >> > a "char *buf" pointer into the memory buffer (in addition to the buffer >> > pointer itself). I think it would be better to use an offset instead of >> > the pointer into internal buffer area: >> >> Yes, I had a discussion with Adrian about the same issue earlier on on IRC. >> You make excellent points - as Adrian did :) >> Here's my take >> >> > - cleaner design: no peeking into other object's privates >> Yes. At the same time the KBuf::Buf class (the "other object") is a >> private member class of the KBuf class; >> it's actually little more than a glorified struct, and shouldn't be >> thought as a first-level citizen on its own. > > Private or not, it is still another object. > > And, FWIW, I doubt the memory buffer class will remain inside the string > class. I'm not planning to take it out, and if noone else does... >> > - easier to change memory buffer internals >> If you mean "change the buffer contents", that's an operation which >> should be quite rare. >> If you mean "change the code" that should be even rarer. > > I meant "change the code". I do expect those changes in the foreseeable > future. > >> > - easier to support several buffer types with different internals >> I didn't really think of different buffer types. Do you have in mind >> any scenario where it would be useful? > > Yes, I do (e.g., small versus large, thread-safe versus not, and > contiguous versus chunked). Chunks are out-of-scope, they are better dealt by a KBufList class. Honestly the only argument I buy 100% is easier thread-safety. > In fact, you kind of documented different buffer implementations > yourself: "small Bufs (<8Kb) should be managed by MemPools. - Bufs > bigger than 8Kb should be allocated in sizes compatible with the system > page size" The "magic" is in the allocation strategy function not for the Buf itself, but for its underlying storage. We want to allocate a bit more than strictly needed to be able to grow but not too much. Quoting: void init(size_t size) { pagesize=sysconf(_SC_PAGESIZE); //FIXME: make this autoconf-based size_t actualsize=(size*12)/10; //FIXME: make self-tuning using nreallocs // arbitrary allocation algorithm if (actualsize <= 64) actualsize=64+malloc_overhead; else if (actualsize <= 64*pagesize) actualsize=nearestPowerOf2(actualsize); else actualsize+=actualsize%pagesize; //increase to closest pagesize //shrink: we WANT to fit in a page actualsize-=malloc_overhead; mem=new char[actualsize]; bufsize=actualsize; bufused=0; refs=1; debugs("Buf@" <<(void *)this <<"::init(). req:"<> > - easier to support re-allocation of buffer memory >> > - easier to provide a thread-safe implementation. >> >> On the other hand, char* are significantly more efficient for common >> operations, consistently with the design goals.. > > I do not think an offset would be significantly less efficient in this > context. I bet 90+% of operations that require raw data access are far > more expensive than adding an offset to a pointer. The most common one is a NULL check, which is hard to express using (offset/length). Further extra work would be the need of more temporary storage to rebuild the char* out of (memhandle->mem + offset). Extremely small details, but I expect they'll be very common operations (I bumped the size of the stat-counters to 64bits). >> I'm not saying that I won't change them, I'd just like to be shown >> scenarios where it makes a difference. > > I believe I provided more than enough reasons and you agreed with at > least some of them. You have provided one so far ("significantly more > efficient for common operations"). I think the burden of proof should be > on you in this case. What I can (and will) do is try and see if and how much these changes would complicate the code. In the meantime, I ask everyone who can spare 5 minutes to just grab the code off launchapd and see it live ("make" will compile and launch a builtin testsuite/demo, "make dox" will generate the class reference - the code is extensively documented). -- /kinkie
Re: pseudo-specs for a String class: char *buf
On Wed, 2008-09-03 at 16:53 +0200, Kinkie wrote: > On Wed, Sep 3, 2008 at 3:59 PM, Alex Rousskov > <[EMAIL PROTECTED]> wrote: > > > >I looked at your StringNg wiki page and noticed that your string has > > a "char *buf" pointer into the memory buffer (in addition to the buffer > > pointer itself). I think it would be better to use an offset instead of > > the pointer into internal buffer area: > > Yes, I had a discussion with Adrian about the same issue earlier on on IRC. > You make excellent points - as Adrian did :) > Here's my take > > > - cleaner design: no peeking into other object's privates > Yes. At the same time the KBuf::Buf class (the "other object") is a > private member class of the KBuf class; > it's actually little more than a glorified struct, and shouldn't be > thought as a first-level citizen on its own. Private or not, it is still another object. And, FWIW, I doubt the memory buffer class will remain inside the string class. > > - easier to change memory buffer internals > If you mean "change the buffer contents", that's an operation which > should be quite rare. > If you mean "change the code" that should be even rarer. I meant "change the code". I do expect those changes in the foreseeable future. > > - easier to support several buffer types with different internals > I didn't really think of different buffer types. Do you have in mind > any scenario where it would be useful? Yes, I do (e.g., small versus large, thread-safe versus not, and contiguous versus chunked). In fact, you kind of documented different buffer implementations yourself: "small Bufs (<8Kb) should be managed by MemPools. - Bufs bigger than 8Kb should be allocated in sizes compatible with the system page size" > > - easier to support re-allocation of buffer memory > > - easier to provide a thread-safe implementation. > > On the other hand, char* are significantly more efficient for common > operations, consistently with the design goals.. I do not think an offset would be significantly less efficient in this context. I bet 90+% of operations that require raw data access are far more expensive than adding an offset to a pointer. > I'm not saying that I won't change them, I'd just like to be shown > scenarios where it makes a difference. I believe I provided more than enough reasons and you agreed with at least some of them. You have provided one so far ("significantly more efficient for common operations"). I think the burden of proof should be on you in this case. > On an unrelated issue, since it was of interest to some of us, here's > a sample of the caller code for tokenization functions (actual live > code): > > KBuf s1; > cout << "tokenization: \n"; > { > s1="The quick brown fox jumped over the lazy dog"; > char *needle=" "; > KBuf cs1(needle); > while (!s1.isNull()) { > cout << "token: " << s1.nextToken(cs1) << endl; > } > } > cout << endl; FWIW, I still think that tokenization should be a external to the buffer or string and should not modify them. Please see my earlier posts for details. Thank you, Alex.
Re: pseudo-specs for a String class: char *buf
On Wed, Sep 3, 2008 at 3:59 PM, Alex Rousskov <[EMAIL PROTECTED]> wrote: > Hi Kinkie, > >I looked at your StringNg wiki page and noticed that your string has > a "char *buf" pointer into the memory buffer (in addition to the buffer > pointer itself). I think it would be better to use an offset instead of > the pointer into internal buffer area: Yes, I had a discussion with Adrian about the same issue earlier on on IRC. You make excellent points - as Adrian did :) Here's my take > - cleaner design: no peeking into other object's privates Yes. At the same time the KBuf::Buf class (the "other object") is a private member class of the KBuf class; it's actually little more than a glorified struct, and shouldn't be thought as a first-level citizen on its own. > - easier to change memory buffer internals If you mean "change the buffer contents", that's an operation which should be quite rare. If you mean "change the code" that should be even rarer. > - easier to support several buffer types with different internals I didn't really think of different buffer types. Do you have in mind any scenario where it would be useful? > - easier to support re-allocation of buffer memory > - easier to provide a thread-safe implementation. On the other hand, char* are significantly more efficient for common operations, consistently with the design goals.. I'm not saying that I won't change them, I'd just like to be shown scenarios where it makes a difference. On an unrelated issue, since it was of interest to some of us, here's a sample of the caller code for tokenization functions (actual live code): KBuf s1; cout << "tokenization: \n"; { s1="The quick brown fox jumped over the lazy dog"; char *needle=" "; KBuf cs1(needle); while (!s1.isNull()) { cout << "token: " << s1.nextToken(cs1) << endl; } } cout << endl; And here's the output: tokenization: token: The token: quick token: brown token: fox token: jumped token: over token: the token: lazy token: dog -- /kinkie
Re: pseudo-specs for a String class: char *buf
Hi Kinkie, I looked at your StringNg wiki page and noticed that your string has a "char *buf" pointer into the memory buffer (in addition to the buffer pointer itself). I think it would be better to use an offset instead of the pointer into internal buffer area: - cleaner design: no peeking into other object's privates - easier to change memory buffer internals - easier to support several buffer types with different internals - easier to support re-allocation of buffer memory - easier to provide a thread-safe implementation. HTH, Alex. http://wiki.squid-cache.org/Features/BetterStringBuffer/StringNg