Problems with Wildcard Queries / Own Filter

Björn Keil Tue, 15 Oct 2019 06:45:46 -0700

Hello,

I am having a bit of a problem with Wildcard queries and I don't know how
to pin it down yet. I have a suspect, but I kind find an error in it, one
of the filters in the respective search field.


The problem is that when I do a wildcard query:
title:todesmä*
it does return a result, but it also returns results that would match
title:todesma* It is not supposed to do that because, due to the filter,
it's supposed to be equivalent to title:todesmae*

The reals problem is that if I search for title:todesmär* it does not find
anything at all anymore. There are titles on the index that would match
"todesmärsche" and "todesmärchen".

I have looked the Filter in a debugger, but I could not find anything wrong
with it. It's supposed to replace the "ä" with "ae", which it does, calls
termAtt.resizeBuffer() before it does and termAtt.length() afterwards. The
result seems perfectly alright. What it does not change is the endOffset
attribute of the CharTermAttribute object, that's probably because it's
counting Bytes, not characters; I replaced a single two-byte char with a
two one-byte chars, consequently the endOffset is the same.

Could anybody tell me whether there is anything wrong with the filter in
the attachment?

package de.example.analysis;

import java.io.IOException;

import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

/**
 * This TokenFilter replaces German umlauts and the character ß with a normalized form in ASCII characters.
 * 
 * <ul><li>ü => ue</li>
 * <li>ß => ss</li>
 * <li>etc.</li></ul>
 * 
 * This enables a sort order according DIN 5007, variant 2, the so called "phone book" sort order.
 * 
 * @see org.apache.lucene.analysis.TokenStream
 *
 */
public class GermanUmaultFilter extends TokenFilter {
	
	private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);

	/**
	 * @see org.apache.lucene.analysis.TokenFilter#TokenFilter()
	 * @param input TokenStream with the tokens to filter
	 */
	public GermanUmaultFilter(TokenStream input) {
		super(input);
	}

	/**
	 * Performs the actual filtering whenever upon request by the consumer.
	 * 
	 * @see org.apache.lucene.analysis.TokenStream#incrementToken()
	 * @return true on success, false on failure
	 */
	public boolean incrementToken() throws IOException {
		if (input.incrementToken()) {
			int countReplacements = 0;
			char[] origBuffer = termAtt.buffer();
			int origLength = termAtt.length();
			// Figure out how many replacements we need to get the size of the new buffer
			for (int i = 0; i < origLength; i++) {
				if (origBuffer[i] == 'ü'
					|| origBuffer[i] == 'ä'
					|| origBuffer[i] == 'ö'
					|| origBuffer[i] == 'ß'
					|| origBuffer[i] == 'Ä'
					|| origBuffer[i] == 'Ö'
					|| origBuffer[i] == 'Ü'
				) {
					countReplacements++;
				}
			}
			
			// If there is a replacement create a new buffer of the appropriate length...
			if (countReplacements != 0) {
				int newLength = origLength + countReplacements;
				char[] target = new char[newLength];
				int j = 0;
				// ... perform the replacement ...
				for (int i = 0; i < origLength; i++) {
					switch (origBuffer[i]) {
					case 'ä':
						target[j++] = 'a';
						target[j++] = 'e';
						break;
					case 'ö':
						target[j++] = 'o';
						target[j++] = 'e';
						break;
					case 'ü':
						target[j++] = 'u';
						target[j++] = 'e';
						break;
					case 'Ä':
						target[j++] = 'A';
						target[j++] = 'E';
						break;
					case 'Ö':
						target[j++] = 'O';
						target[j++] = 'E';
						break;
					case 'Ü':
						target[j++] = 'U';
						target[j++] = 'E';
						break;
					case 'ß':
						target[j++] = 's';
						target[j++] = 's';
						break;
					default:
						target[j++] = origBuffer[i];
					}
				}
				// ... make sure the attribute's buffer is large enough, copy the new buffer
				// and set the length ...
				termAtt.resizeBuffer(newLength);
				termAtt.copyBuffer(target, 0, newLength);
				termAtt.setLength(newLength);
			}
			return true;
		} else {
			return false;
		}
	}

}

Problems with Wildcard Queries / Own Filter

Reply via email to