wrong tokens output from WordDelimiterFilter when english possessives are in
the text
-------------------------------------------------------------------------------------
Key: SOLR-1706
URL: https://issues.apache.org/jira/browse/SOLR-1706
Project: Solr
Issue Type: Bug
Components: Schema and Analysis
Affects Versions: 1.4
Reporter: Robert Muir
the WordDelimiterFilter english possessive stemming "'s" removal (on by
default) unfortunately causes strange behavior:
below you can see that when I have requested to only output numeric
concatenations (not words), these english possessive stems are still sometimes
output, ignoring the options i have provided, and even then, in a very
inconsistent way.
{code}
assertWdf("Super-Duper-XL500-42-AutoCoder's", 0,0,0,1,0,0,0,0,1, null,
new String[] { "42", "AutoCoder" },
new int[] { 18, 21 },
new int[] { 20, 30 },
new int[] { 1, 1 });
assertWdf("Super-Duper-XL500-42-AutoCoder's-56", 0,0,0,1,0,0,0,0,1, null,
new String[] { "42", "AutoCoder", "56" },
new int[] { 18, 21, 33 },
new int[] { 20, 30, 35 },
new int[] { 1, 1, 1 });
assertWdf("Super-Duper-XL500-AB-AutoCoder's", 0,0,0,1,0,0,0,0,1, null,
new String[] { },
new int[] { },
new int[] { },
new int[] { });
assertWdf("Super-Duper-XL500-42-AutoCoder's-BC", 0,0,0,1,0,0,0,0,1, null,
new String[] { "42" },
new int[] { 18 },
new int[] { 20 },
new int[] { 1 });
{code}
where assertWdf is
{code}
void assertWdf(String text, int generateWordParts, int generateNumberParts,
int catenateWords, int catenateNumbers, int catenateAll,
int splitOnCaseChange, int preserveOriginal, int splitOnNumerics,
int stemEnglishPossessive, CharArraySet protWords, String expected[],
int startOffsets[], int endOffsets[], String types[], int posIncs[])
throws IOException {
TokenStream ts = new WhitespaceTokenizer(new StringReader(text));
WordDelimiterFilter wdf = new WordDelimiterFilter(ts, generateWordParts,
generateNumberParts, catenateWords, catenateNumbers, catenateAll,
splitOnCaseChange, preserveOriginal, splitOnNumerics,
stemEnglishPossessive, protWords);
assertTokenStreamContents(wdf, expected, startOffsets, endOffsets, types,
posIncs);
}
{code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.