analysis tab does not support multi-valued fields. It only analyses a single field value.
On Wed, Aug 26, 2015, at 05:05 PM, Erick Erickson wrote: > bq: my dog > has fleas > I wouldn't want some variant of "og ha" to match, > > Here's where the mysterious "positionIncrementGap" comes in. If you > make this field "multiValued", and index this like this: > <doc> > <field name="blah">my dog</field> > <field name="blah">has fleas</field> > <doc> > > or equivalently in SolrJ just > doc.addField("blah", "my dog"); > doc.addField("blah", "has fleas"); > > then the position of "dog" will be 2 and the position of "has" will be > 102 assuming > the positionIncrementGap is the default 100. N.B. I'm not sure you'll > see this in the > admin/analysis page or not..... > > Anyway, now your example won't match across the two parts unless > you specify a "slop" up in the 101 range. > > Best, > Erick > > On Wed, Aug 26, 2015 at 2:19 AM, Christian Ramseyer <r...@networkz.ch> > wrote: > > On 26/08/15 00:24, Erick Erickson wrote: > >> Hmmm, this sounds like a nonsensical question, but "what do you mean > >> by arbitrary substring"? > >> > >> Because if your substrings consist of whole _tokens_, then ngramming > >> is totally unnecessary (and gets in the way). Phrase queries with no slop > >> fulfill this requirement. > >> > >> But let's assume you need to march within tokens, i.e. if the doc > >> contains "my dog has fleas", you need to match input like "as fle", in this > >> case ngramming is an option. > > > > Yeah the "as fle"-thing is exactly what I want to achieve. > > > >> > >> You have substantially different index and query time chains. The result > >> is that > >> the offsets for all the grams at index time are the same in the quick > >> experiment > >> I tried, all were 1. But at query time, each gram had an incremented > >> position. > >> > >> I'd start by using the query time analysis chain for indexing also. Next, > >> I'd > >> try enclosing multiple words in double quotes at query time and go from > >> there. > >> What you have now is an anti-pattern in that having substantially > >> different index > >> and query time analysis chains is not something that's likely to be very > >> predictable unless you know _exactly_ what the consequences are. > >> > >> The admin/analysis page is your friend, in this case check the > >> "verbose" checkbox > >> to see what I mean. > > > > Hmm interesting. I had the additional \R tokenizer in the index chain > > because the the document can be multiple lines (but the search text is > > always a single line) and if the document was > > > > my dog > > has fleas > > > > I wouldn't want some variant of "og ha" to match, but I didn't realize > > it didn't give me any positions like you noticed. > > > > I'll try to experiment some more, thanks for the hints! > > > > Chris > > > >> > >> Best, > >> Erick > >> > >> On Tue, Aug 25, 2015 at 3:00 PM, Christian Ramseyer <r...@networkz.ch> > >> wrote: > >>> Hi > >>> > >>> I'm trying to build an index for technical documents that basically > >>> works like "grep", i.e. the user gives an arbitray substring somewhere > >>> in a line of a document and the exact matches will be returned. I > >>> specifically want no stemming etc. and keep all whitespace, parentheses > >>> etc. because they might be significant. The only normalization is that > >>> the search should be case-insensitvie. > >>> > >>> I tried to achieve this by tokenizing on line breaks, and then building > >>> trigrams of the individual lines: > >>> > >>> <fieldType name="configtext_trigram" class="solr.TextField" > > >>> > >>> <analyzer type="index"> > >>> > >>> <tokenizer class="solr.PatternTokenizerFactory" > >>> pattern="\R" group="-1"/> > >>> > >>> <filter class="solr.NGramFilterFactory" > >>> minGramSize="3" maxGramSize="3"/> > >>> <filter class="solr.LowerCaseFilterFactory"/> > >>> > >>> </analyzer> > >>> > >>> <analyzer type="query"> > >>> > >>> <tokenizer class="solr.NGramTokenizerFactory" > >>> minGramSize="3" maxGramSize="3"/> > >>> <filter class="solr.LowerCaseFilterFactory"/> > >>> > >>> </analyzer> > >>> </fieldType> > >>> > >>> Then in the search, I use the edismax parser with mm=100%, so given the > >>> documents > >>> > >>> > >>> {"id":"test1","content":" > >>> encryption > >>> 10.0.100.22 > >>> description > >>> "} > >>> > >>> {"id":"test2","content":" > >>> 10.100.0.22 > >>> description > >>> "} > >>> > >>> and the query content:encryption, this will turn into > >>> > >>> "parsedquery_toString": > >>> > >>> "+((content:enc content:ncr content:cry content:ryp > >>> content:ypt content:pti content:tio content:ion)~8)", > >>> > >>> and return only the first document. All fine and dandy. But I have a > >>> problem with possible false positives. If the search is e.g. > >>> > >>> content:.100.22 > >>> > >>> then the generated query will be > >>> > >>> "parsedquery_toString": > >>> "+((content:.10 content:100 content:00. content:0.2 content:.22)~5)", > >>> > >>> and because all of tokens are also generated for document test2 in the > >>> proximity of 5, both documents will wrongly be returned. > >>> > >>> So somehow I'd need to express the query "content:.10 content:100 > >>> content:00. content:0.2 content:.22" with *the tokens exactly in this > >>> order and nothing in between*. Is this somehow possible, maybe by using > >>> the termvectors/termpositions stuff? Or am I trying to do something > >>> that's fundamentally impossible? Other good ideas how to achieve this > >>> kind of behaviour? > >>> > >>> Thanks > >>> Christian > >>> > >>> > >>> > >