We use this structure in ICU (in UnicodeSet). For a high-level explanation,
see my site (www.macchiato.com), "Bits of Unicode".


As to the binary search, we have used in various contexts before a
"completely unrolled" binary loop, following John Bentley's formulation. It
is quite an interesting structure. Here is a Java version: it is about 40%
faster than the straightforward binary search (this is not a production
version).

    /**
     * @return greatest index such that data[index] <= searchValue
     * If there is no such index (e.g. searchValue < data[0]), then -1 is
returned
     */

 public int highestIndexLEQ(int searchValue) {

  if (!isValid) validate();
  int temp;

  // set up initial range to search. Each subrange is a power of two in
length
  int high = searchValue < data[topOfLow] ? topOfLow : topOfHigh;

  // Completely unrolled binary search, following "Programming Pearls"
  // Each case deliberately falls through to the next
  // Logically, data[-1] < all_search_values && data[count] >
all_search_values
  // although the values -1 and count are never actually touched.

  // The bounds at each point are low & high,
  // where low == high - delta*2
  // so high - delta is the midpoint

  // The invariant AFTER each line is that data[low] < searchValue <=
data[high]

  switch (power) {
  //case 31: if (searchValue < data[temp = high-0x40000000]) high = temp; //
no unsigned int in Java
  case 30: if (searchValue < data[temp = high-0x20000000]) high = temp;
  case 29: if (searchValue < data[temp = high-0x10000000]) high = temp;

  case 28: if (searchValue < data[temp = high- 0x8000000]) high = temp;
  case 27: if (searchValue < data[temp = high- 0x4000000]) high = temp;
  case 26: if (searchValue < data[temp = high- 0x2000000]) high = temp;
  case 25: if (searchValue < data[temp = high- 0x1000000]) high = temp;

  case 24: if (searchValue < data[temp = high-  0x800000]) high = temp;
  case 23: if (searchValue < data[temp = high-  0x400000]) high = temp;
  case 22: if (searchValue < data[temp = high-  0x200000]) high = temp;
  case 21: if (searchValue < data[temp = high-  0x100000]) high = temp;

  case 20: if (searchValue < data[temp = high-   0x80000]) high = temp;
  case 19: if (searchValue < data[temp = high-   0x40000]) high = temp;
  case 18: if (searchValue < data[temp = high-   0x20000]) high = temp;
  case 17: if (searchValue < data[temp = high-   0x10000]) high = temp;

  case 16: if (searchValue < data[temp = high-    0x8000]) high = temp;
  case 15: if (searchValue < data[temp = high-    0x4000]) high = temp;
  case 14: if (searchValue < data[temp = high-    0x2000]) high = temp;
  case 13: if (searchValue < data[temp = high-    0x1000]) high = temp;

  case 12: if (searchValue < data[temp = high-     0x800]) high = temp;
  case 11: if (searchValue < data[temp = high-     0x400]) high = temp;
  case 10: if (searchValue < data[temp = high-     0x200]) high = temp;
  case  9: if (searchValue < data[temp = high-     0x100]) high = temp;

  case  8: if (searchValue < data[temp = high-      0x80]) high = temp;
  case  7: if (searchValue < data[temp = high-      0x40]) high = temp;
  case  6: if (searchValue < data[temp = high-      0x20]) high = temp;
  case  5: if (searchValue < data[temp = high-      0x10]) high = temp;

  case  4: if (searchValue < data[temp = high-       0x8]) high = temp;
  case  3: if (searchValue < data[temp = high-       0x4]) high = temp;
  case  2: if (searchValue < data[temp = high-       0x2]) high = temp;
  case  1: if (searchValue < data[temp = high-       0x1]) high = temp;
  }
  if (high == topOfHigh && searchValue >= data[high]) return high;
  return high-1;
 }


 // NOTE: on some machines the above may not be optimal, if the size of the
function
 // forces code out of the cache. For that case, it would be better for
program in a loop, like the following

 public int highestIndexLEQ2(int searchValue) {

  if (!isValid) validate();
  int temp;
  int high = searchValue < data[topOfLow] ? topOfLow : topOfHigh;
  for (int delta = deltaStart; delta != 0; delta >>= 1) {
            if (searchValue < data[temp = high-delta]) high = temp;
        }
  if (high == topOfHigh && searchValue >= data[high]) return high;
  return high-1;
 }



 // ================ Privates ================

 // data

 int data[];
 int count;

    // validate internal parameters

 private void validate() {
     if (count <= 1) throw new IllegalArgumentException("Array must have at
least 2 elements");

  // find greatest power of 2 less than or equal to count
  for (power = exp2.length-1; power > 0 && exp2[power] > count; power--) {}

  // determine the starting points
  topOfLow = exp2[power] - 1;
  topOfHigh = count - 1;
  deltaStart = exp2[power-1];
  isValid = true;
 }

 private boolean isValid = false;
 private int topOfLow;
 private int topOfHigh;
 private int power;
 private int deltaStart;

 private static final int exp2[] = {
     0x1, 0x2, 0x4, 0x8,
     0x10, 0x20, 0x40, 0x80,
     0x100, 0x200, 0x400, 0x800,
     0x1000, 0x2000, 0x4000, 0x8000,
     0x10000, 0x20000, 0x40000, 0x80000,
     0x100000, 0x200000, 0x400000, 0x800000,
     0x1000000, 0x2000000, 0x4000000, 0x8000000,
     0x10000000, 0x20000000 // , 0x40000000 // no unsigned int in Java
 };


Mark
__________________________________
http://www.macchiato.com
►  “Eppur si muove” ◄

----- Original Message -----
From: "Elliotte Rusty Harold" <[EMAIL PROTECTED]>
To: "John Cowan" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Tuesday, October 08, 2002 06:25
Subject: Re: ISO 8859-11 (Thai) cross-mapping table


> At 8:44 AM -0400 10/8/02, John Cowan wrote:
>
>
> >The underlying data structure here is called a "range table", and is
> >a list of ranges in codepoint order, expressed thus:
> >
> > start of first range
> > end of first range + 1
> > start of second range
> > end of second range + 1
> >
> >etc. etc.  What you are doing is equivalent to a linear search over
> >this range followed by loop unrolling.  However, you can do better,
> >especially in complex cases, with a *binary* search over the range
> >followed by loop unrolling.  The trick here is that if the binary
> >search returns an even value, it succeeds; an odd value fails.
> >
>
> Interesting. Do you have any references on that I can explore
> further? A quick google search didn't turn up anything relevant. I'm
> curious to see how the algorithm actually works.
> --
>
> +-----------------------+------------------------+-------------------+
> | Elliotte Rusty Harold | [EMAIL PROTECTED] | Writer/Programmer |
> +-----------------------+------------------------+-------------------+
> |          XML in a  Nutshell, 2nd Edition (O'Reilly, 2002)          |
> |              http://www.cafeconleche.org/books/xian2/              |
> |  http://www.amazon.com/exec/obidos/ISBN%3D0596002920/cafeaulaitA/  |
> +----------------------------------+---------------------------------+
> |  Read Cafe au Lait for Java News:  http://www.cafeaulait.org/      |
> |  Read Cafe con Leche for XML News: http://www.cafeconleche.org/    |
> +----------------------------------+---------------------------------+
>
>


Reply via email to