[ https://issues.apache.org/jira/browse/SOLR-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797466#action_12797466 ]
Robert Muir commented on SOLR-1706: ----------------------------------- ok i narrowed this one down some, appears to be unrelated completely to possessives, but some other off-by-one bug: {code} public void test0() throws Exception { assertWdf("1-a-2 3-b-c-4 5-d-e 6-f", 0,0,0,0,0,0,0,0,0, null, new String[] { }, new int[] { }, new int[] { }, new int[] { }); } public void test32() throws Exception { assertWdf("1-a-2 3-b-c-4 5-d-e 6-f", 0,0,0,1,0,0,0,0,0, null, new String[] { "1", "a", "2", "3", "4", "5", "6", "f" }, new int[] { 0, 2, 4, 6, 12, 14, 20, 22 }, new int[] { 1, 3, 5, 7, 13, 15, 21, 23 }, new int[] { 1, 1, 1, 1, 1, 1, 1, 1 }); } {code} > wrong tokens output from WordDelimiterFilter when english possessives are in > the text > ------------------------------------------------------------------------------------- > > Key: SOLR-1706 > URL: https://issues.apache.org/jira/browse/SOLR-1706 > Project: Solr > Issue Type: Bug > Components: Schema and Analysis > Affects Versions: 1.4 > Reporter: Robert Muir > > the WordDelimiterFilter english possessive stemming "'s" removal (on by > default) unfortunately causes strange behavior: > below you can see that when I have requested to only output numeric > concatenations (not words), these english possessive stems are still > sometimes output, ignoring the options i have provided, and even then, in a > very inconsistent way. > {code} > assertWdf("Super-Duper-XL500-42-AutoCoder's", 0,0,0,1,0,0,0,0,1, null, > new String[] { "42", "AutoCoder" }, > new int[] { 18, 21 }, > new int[] { 20, 30 }, > new int[] { 1, 1 }); > assertWdf("Super-Duper-XL500-42-AutoCoder's-56", 0,0,0,1,0,0,0,0,1, null, > new String[] { "42", "AutoCoder", "56" }, > new int[] { 18, 21, 33 }, > new int[] { 20, 30, 35 }, > new int[] { 1, 1, 1 }); > assertWdf("Super-Duper-XL500-AB-AutoCoder's", 0,0,0,1,0,0,0,0,1, null, > new String[] { }, > new int[] { }, > new int[] { }, > new int[] { }); > assertWdf("Super-Duper-XL500-42-AutoCoder's-BC", 0,0,0,1,0,0,0,0,1, null, > new String[] { "42" }, > new int[] { 18 }, > new int[] { 20 }, > new int[] { 1 }); > {code} > where assertWdf is > {code} > void assertWdf(String text, int generateWordParts, int generateNumberParts, > int catenateWords, int catenateNumbers, int catenateAll, > int splitOnCaseChange, int preserveOriginal, int splitOnNumerics, > int stemEnglishPossessive, CharArraySet protWords, String expected[], > int startOffsets[], int endOffsets[], String types[], int posIncs[]) > throws IOException { > TokenStream ts = new WhitespaceTokenizer(new StringReader(text)); > WordDelimiterFilter wdf = new WordDelimiterFilter(ts, generateWordParts, > generateNumberParts, catenateWords, catenateNumbers, catenateAll, > splitOnCaseChange, preserveOriginal, splitOnNumerics, > stemEnglishPossessive, protWords); > assertTokenStreamContents(wdf, expected, startOffsets, endOffsets, types, > posIncs); > } > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.