RE: UTF-8 Syntax

Marco Cimarosti Mon, 11 Jun 2001 07:30:47 -0700
Toby Phipps wrote:
> This is incredibly inefficient, not only
> because significant amounts of temporary space needs to be 
> allocated and
> freed, but also because the entire result set of the query has to be
> processed and sorted before the first row is returned.  With 
> result sets
> involving several million rows, this is a very significant overhead,
> especially if the typical user only looks at the first couple 
> of hundred.

As a second thought, I would examine these statement by a more pragmatic
point of view.

I have no expertise in the internals of databases, so I am sure that Toby
will correct this haphazarded statement: converting UTF-16 (& UTF-8s) binary
order to UTF-8 (and UTF-32) binary order you don't need a complex and
resource-consuming algorithm as you would need for lexical collation.

I dare to say that such an algorithm just needs 2 pointers, 2 binary
searches, and a reorganization of the list.

Suppose that you have a query against a UTF-16 database, and you execute
using your usual super-efficient binary order. At the end, it yields this
result (binary sorted in UTF-16 code units):

        1:      0x0068 0x0069                   (U+0068 U+0069)
        2:      0x0068 0x006F                   (U+0068 U+0069)
        3:      0xD800 0xDF07 0xD800 0xDF09     (U-010307 U-010309)
        4:      0xD800 0xDF07 0xD800 0xDF0F     (U-010307 U-01030F)
        5:      0xFF48 0xFF49                   (U+FF48 U+FF49)
        6:      0xFF48 0xFF4F                   (U+FF48 U+FF49)

To deliver this in code-point order (i.e. in UTF-8/UTF-32 order), you just
need this algorithm (coded in an imaginary programming language):

        declare FirstSurrog as row index
        FirstSurrog = SEARCH_FIRST_ROW(
                                where first code unit >= 0xD800
                                and first code unit < 0xE000)
        if found then 
                declare FirstPostSurrog as row index
                FirstPostSurrog = SEARCH_FIRST_ROW(
                                                where first code unit >=
0xE000)
                if found then
                        SWAP_ROW_BLOCKS(
                                FirstSurrog .. FirstPostSurrog - 1,
                                FirstPostSurrog .. last)
                end if
        end if

In the case above, the result set would become:

        1:      0x0068 0x0069                   (U+0068 U+0069)
        2:      0x0068 0x006F                   (U+0068 U+0069)
        3:      0xFF48 0xFF49                   (U+FF48 U+FF49)
        4:      0xFF48 0xFF4F                   (U+FF48 U+FF49)
        5:      0xD800 0xDF07 0xD800 0xDF09     (U-010307 U-010309)
        6:      0xD800 0xDF07 0xD800 0xDF0F     (U-010307 U-01030F)

Is it correct to assume that the operation which I called SEARCH_FIRST_ROW
above should take a maximum of 16 comparisons on a b-tree?

And is it correct to assume that the operation which I called
SWAP_ROW_BLOCKS can be implemented virtually (i.e. by traversing the tree in
2 passes, visiting different branches), rather than by moving actual data in
memory?

So, why couldn't this be efficient enough also for a list with millions
rows?

_ Marco
RE: UTF-8 Syntax

Reply via email to