Hello,
str_str function searches for a substring into a bigger string.
str_str function in lib/cds/sstr.c file has a comment: /* FIXME:
reimplement using better algorithm */
Attached patch reimplements that function using
Boyer-Moore-Horspool-Raita algorithm, which is the one most text editors
use to search for substrings.
Boyer-Moore-Horspool-Raita algorithm is a simplification from
Boyer-Moore one, with less preprocessing overhead.
If n is the string length, and m is substring length, naive brute force
algorithm has a cost of O(n*m).
Raita algoritm has same cost in worst case but Ω(n/m) a better average cost.
As a drawback it adds a preprocessing cost of O(m + charset_length)
where charset_length is 256 for us, because we use char.
Another way to approach this substring search issue would be separate it
into two functions, a preprocessing one, and a searching one. Then in
module initializacion where the function is used, calculate the needed
preprocess and when the function is called do the search only.
This Raita algorithm could also be useful for str_search function in
ut.c. Its usefulness depends on n, and m lengths, the bigger they are
the better the algoritm works.
http://en.wikipedia.org/wiki/String_searching_algorithm
http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore%E2%80%93Horspool_algorithm
http://www-igm.univ-mlv.fr/~lecroq/string/node22.html
Regards,
Vicente.
diff --git lib/cds/sstr.c lib/cds/sstr.c
index 4aabdbe..b94f267 100644
--- lib/cds/sstr.c
+++ lib/cds/sstr.c
@@ -223,29 +223,72 @@ char *str_strchr(const str_t *s, char c)
return NULL;
}
+/*
+ * Boyer-Moore-Horspool-Raita algorithm.
+ */
+#define CHARSET_LEN 256
+unsigned int shift[CHARSET_LEN];
char *str_str(const str_t *s, const str_t *search_for)
{
- int i, j;
- /* FIXME: reimplement using better algorithm */
+ unsigned int i, num_shift, max, middle;
+ unsigned char c, last;
+ unsigned char *p, *q, *top;
if (is_str_empty(search_for)) return s->s;
if (is_str_empty(s)) return NULL;
if (search_for->len > s->len) return NULL;
- j = 0;
- i = 0;
- while (i < s->len) {
- if (s->s[i] == search_for->s[j]) {
- j++;
- i++;
- if (j == search_for->len) return s->s + i - j;
+ /* Preprocess shift vector. */
+ max = search_for->len;
+ for (i=0; i<CHARSET_LEN; i++) {
+ shift[i] = max;
+ }
+
+ q = (unsigned char *)(search_for->s);
+ max--;
+ for (i=0; i<max; i++) {
+ shift[q[i]] = max - i;
+ }
+
+ /* Main loop */
+ p = (unsigned char *)(s->s);
+ top = p + s->len - max;
+ middle = max / 2;
+
+search_array:
+ while (p < top) {
+ last = p[max];
+ num_shift = shift[last];
+ if (last != q[max]) {
+ p += num_shift;
+ goto search_array;
+ }
+
+ c = *p;
+ if (c != *q) {
+ p += num_shift;
+ goto search_array;
+ }
+
+ c = p[middle];
+ if (c != q[middle]) {
+ p += num_shift;
+ goto search_array;
}
- else {
- i = i - j + 1;
- j = 0;
+
+ for(i=1; i<max; i++) {
+ if (p[i] != q[i]) {
+ p += num_shift;
+ goto search_array;
+ }
}
+
+ /* Success. */
+ return (char *)p;
}
+
+ /* Substring not found. */
return NULL;
}
_______________________________________________
sr-dev mailing list
[email protected]
http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev