Hello,

str_str function searches for a substring into a bigger string.

str_str function in lib/cds/sstr.c file has a comment: /* FIXME: reimplement using better algorithm */

Attached patch reimplements that function using Boyer-Moore-Horspool-Raita algorithm, which is the one most text editors use to search for substrings.

Boyer-Moore-Horspool-Raita algorithm is a simplification from Boyer-Moore one, with less preprocessing overhead.


If n is the string length, and m is substring length, naive brute force algorithm has a cost of O(n*m).

Raita algoritm has same cost in worst case but Ω(n/m) a better average cost.
As a drawback it adds a preprocessing cost of O(m + charset_length) where charset_length is 256 for us, because we use char.


Another way to approach this substring search issue would be separate it into two functions, a preprocessing one, and a searching one. Then in module initializacion where the function is used, calculate the needed preprocess and when the function is called do the search only.


This Raita algorithm could also be useful for str_search function in ut.c. Its usefulness depends on n, and m lengths, the bigger they are the better the algoritm works.

http://en.wikipedia.org/wiki/String_searching_algorithm
http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore%E2%80%93Horspool_algorithm
http://www-igm.univ-mlv.fr/~lecroq/string/node22.html

Regards,
Vicente.




diff --git lib/cds/sstr.c lib/cds/sstr.c
index 4aabdbe..b94f267 100644
--- lib/cds/sstr.c
+++ lib/cds/sstr.c
@@ -223,29 +223,72 @@ char *str_strchr(const str_t *s, char c)
 	return NULL;
 }
 
+/*
+ * Boyer-Moore-Horspool-Raita algorithm.
+ */
+#define CHARSET_LEN	256
+unsigned int shift[CHARSET_LEN];
 char *str_str(const str_t *s, const str_t *search_for)
 {
-	int i, j;
-	/* FIXME: reimplement using better algorithm */
+	unsigned int i, num_shift, max, middle;
+	unsigned char c, last;
+	unsigned char *p, *q, *top;
 
 	if (is_str_empty(search_for)) return s->s;
 	if (is_str_empty(s)) return NULL;
 	
 	if (search_for->len > s->len) return NULL;
 
-	j = 0;
-	i = 0;
-	while (i < s->len) {
-		if (s->s[i] == search_for->s[j]) {
-			j++;
-			i++;
-			if (j == search_for->len) return s->s + i - j;
+	/* Preprocess shift vector. */
+	max = search_for->len;
+	for (i=0; i<CHARSET_LEN; i++) {
+		shift[i] = max;
+	}
+
+	q = (unsigned char *)(search_for->s);
+	max--;
+	for (i=0; i<max; i++) {
+		shift[q[i]] = max - i;
+	}
+
+	/* Main loop */
+	p = (unsigned char *)(s->s);
+	top = p + s->len - max;
+	middle = max / 2;
+
+search_array:
+	while (p < top) {
+		last = p[max];
+		num_shift = shift[last];
+		if (last != q[max]) {
+			p += num_shift;
+			goto search_array;
+		}
+
+		c = *p;
+		if (c != *q) {
+			p += num_shift;
+			goto search_array;
+		}
+
+		c = p[middle];
+		if (c != q[middle]) {
+			p += num_shift;
+			goto search_array;
 		}
-		else {
-			i = i - j + 1;
-			j = 0;
+
+		for(i=1; i<max; i++) {
+			if (p[i] != q[i]) {
+				p += num_shift;
+				goto search_array;
+			}
 		}
+
+		/* Success. */
+		return (char *)p;
 	}
+
+	/* Substring not found. */
 	return NULL;
 }
 
_______________________________________________
sr-dev mailing list
[email protected]
http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

Reply via email to