Have you tried adding autoGeneratePhraseQueries=true to the fieldType without changing the index analysis behavior.
This works at querytime only, and will convert 12-34 to "12 34", as if the user entered the query as a phrase. This gives the expected behavior as long as the tokenization is the same for analysis and query. This'll work for the 80-IA structure, and I think it'll also work for the 9(1)(vii) example (converting it to "9 1 vii"), but I haven't tested it. Also, I would think the 12AA example should already be working as you expect, unless maybe you're already using the worddelimiterfilterfactory. When I test the standardTokenizer on 12AA it preserves the string, resulting in just one token of 12aa. autoGeneratePhraseQueries is at least worth a quick try - it doesn't require reindexing. Two things to note 1) don't use autoGeneratePhraseQueries if you have CJK languages...probably applies to any language that's not whitespace delimited. You mentioned Indian, I presume Hindi, which I don't think will be an issue 2) In very rare cases you may have a few odd results if the non-alphanumeric characters differ but generate the same phrase query. E.g. 9(1)(vii) would produce the same phrase as 9-1(vii), but this doesn't seem worth considering until you know it's a problem. On Sun, Dec 8, 2013 at 10:29 AM, Upayavira <u...@odoko.co.uk> wrote: > If you want to just split on whitespace, then the WhitespaceTokenizer > will do the job. > > However, this will mean that these two tokens aren't the same, and won't > match each other: > > cat > cat. > > A simple regex filter could handle those cases, remove a comma or dot > when at the end of a word. Although there are other similar situations > (quotes, colons, etc) that you may want to handle eventually. > > Upayavira > > On Sun, Dec 8, 2013, at 11:51 AM, Vulcanoid Developer wrote: > > Thanks for your email. > > > > Great, I will look at the WordDelimiterFactory. Just to make clear, I > > DON'T > > want any other tokenizing on digits, specialchars, punctuations etc done > > other than word delimiting on whitespace. > > > > All I want for my first version is NO removal of punctuations/special > > characters at indexing time and during search time i.e., input as-is and > > search as-is (like a simple sql db?) . I was assuming this would be a > > trivial case with SOLR and not sure what I am missing here. > > > > thanks > > Vulcanoid > > > > > > > > On Sun, Dec 8, 2013 at 4:33 AM, Upayavira <u...@odoko.co.uk> wrote: > > > > > Have you tried a WhitespaceTokenizerFactory followed by the > > > WordDelimiterFilterFactory? The latter is perhaps more configurable at > > > what it does. Alternatively, you could use a RegexFilterFactory to > > > remove extraneous punctuation that wasn't removed by the Whitespace > > > Tokenizer. > > > > > > Upayavira > > > > > > On Sat, Dec 7, 2013, at 06:15 PM, Vulcanoid Developer wrote: > > > > Hi, > > > > > > > > I am new to solr and I guess this is a basic tokenizer question so > please > > > > bear with me. > > > > > > > > I am trying to use SOLR to index a few (Indian) legal judgments in > text > > > > form and search against them. One of the key points with these > documents > > > > is > > > > that the sections/provisions of law usually have punctuation/special > > > > characters in them. For example search queries will TYPICALLY be > section > > > > 12AA, section 80-IA, section 9(1)(vii) and the text of the judgments > > > > themselves will contain these sort of text with section references > all > > > > over > > > > the place. > > > > > > > > Now, using a default schema setup with standardtokenizer, which > seems to > > > > delimit on whitespace AND punctuations, I get really bad results > because > > > > it > > > > looks like 12AA is split and results such having 12 and AA in them > turn > > > > up. > > > > It becomes worse with 9(1)(vii) with results containing 9 and 1 etc > > > > being > > > > turned up. > > > > > > > > What is the best solution here? I really just want to index the > document > > > > as-is and also to do whitespace tokenizing on the search and nothing > > > > more. > > > > > > > > So in other words: > > > > a) I would like the text document to be indexed as-is with say 12AA > and > > > > 9(1)(vii) in the document stored as it is mentioned. > > > > b) I would like to be able to search for 12AA and for 9(1)(vii) and > get > > > > proper full matches on them without any splitting up/munging etc. > > > > > > > > Any suggestions are appreciated. Thank you for your time. > > > > > > > > Thanks > > > > Vulcanoid > > > >