Re: Exact substring search with ngrams

Upayavira Wed, 26 Aug 2015 09:21:23 -0700

analysis tab does not support multi-valued fields. It only analyses a
single field value.


On Wed, Aug 26, 2015, at 05:05 PM, Erick Erickson wrote:
> bq: my dog
> has fleas
> I wouldn't  want some variant of "og ha" to match,
> 
> Here's where the mysterious "positionIncrementGap" comes in. If you
> make this field "multiValued",  and index this like this:
> <doc>
> <field name="blah">my dog</field>
> <field name="blah">has fleas</field>
> <doc>
> 
> or equivalently in SolrJ just
> doc.addField("blah", "my dog");
> doc.addField("blah", "has fleas");
> 
> then the position of "dog" will be 2 and the position of "has" will be
> 102 assuming
> the positionIncrementGap is the default 100. N.B. I'm not sure you'll
> see this in the
> admin/analysis page or not.....
> 
> Anyway, now your example won't match across the two parts unless
> you specify a "slop" up in the 101 range.
> 
> Best,
> Erick
> 
> On Wed, Aug 26, 2015 at 2:19 AM, Christian Ramseyer <r...@networkz.ch>
> wrote:
> > On 26/08/15 00:24, Erick Erickson wrote:
> >> Hmmm, this sounds like a nonsensical question, but "what do you mean
> >> by arbitrary substring"?
> >>
> >> Because if your substrings consist of whole _tokens_, then ngramming
> >> is totally unnecessary (and gets in the way). Phrase queries with no slop
> >> fulfill this requirement.
> >>
> >> But let's assume you need to march within tokens, i.e. if the doc
> >> contains "my dog has fleas", you need to match input like "as fle", in this
> >> case ngramming is an option.
> >
> > Yeah the "as fle"-thing is exactly what I want to achieve.
> >
> >>
> >> You have substantially different index and query time chains. The result 
> >> is that
> >> the offsets for all the grams at index time are the same in the quick 
> >> experiment
> >> I tried, all were 1. But at query time, each gram had an incremented 
> >> position.
> >>
> >> I'd start by using the query time analysis chain for indexing also. Next, 
> >> I'd
> >> try enclosing multiple words in double quotes at query time and go from 
> >> there.
> >> What you have now is an anti-pattern in that having substantially
> >> different index
> >> and query time analysis chains is not something that's likely to be very
> >> predictable unless you know _exactly_ what the consequences are.
> >>
> >> The admin/analysis page is your friend, in this case check the
> >> "verbose" checkbox
> >> to see what I mean.
> >
> > Hmm interesting. I had the additional \R tokenizer in the index chain
> > because the the document can be multiple lines (but the search text is
> > always a single line) and if the document was
> >
> > my dog
> > has fleas
> >
> > I wouldn't want some variant of "og ha" to match, but I didn't realize
> > it didn't give me any positions like you noticed.
> >
> > I'll try to experiment some more, thanks for the hints!
> >
> > Chris
> >
> >>
> >> Best,
> >> Erick
> >>
> >> On Tue, Aug 25, 2015 at 3:00 PM, Christian Ramseyer <r...@networkz.ch> 
> >> wrote:
> >>> Hi
> >>>
> >>> I'm trying to build an index for technical documents that basically
> >>> works like "grep", i.e. the user gives an arbitray substring somewhere
> >>> in a line of a document and the exact matches will be returned. I
> >>> specifically want no stemming etc. and keep all whitespace, parentheses
> >>> etc. because they might be significant. The only normalization is that
> >>> the search should be case-insensitvie.
> >>>
> >>> I tried to achieve this by tokenizing on line breaks, and then building
> >>> trigrams of the individual lines:
> >>>
> >>> <fieldType name="configtext_trigram" class="solr.TextField" >
> >>>
> >>>     <analyzer type="index">
> >>>
> >>>         <tokenizer class="solr.PatternTokenizerFactory"
> >>>             pattern="\R" group="-1"/>
> >>>
> >>>         <filter class="solr.NGramFilterFactory"
> >>>             minGramSize="3" maxGramSize="3"/>
> >>>         <filter class="solr.LowerCaseFilterFactory"/>
> >>>
> >>>     </analyzer>
> >>>
> >>>     <analyzer type="query">
> >>>
> >>>         <tokenizer class="solr.NGramTokenizerFactory"
> >>>             minGramSize="3" maxGramSize="3"/>
> >>>         <filter class="solr.LowerCaseFilterFactory"/>
> >>>
> >>>     </analyzer>
> >>> </fieldType>
> >>>
> >>> Then in the search, I use the edismax parser with mm=100%, so given the
> >>> documents
> >>>
> >>>
> >>> {"id":"test1","content":"
> >>> encryption
> >>> 10.0.100.22
> >>> description
> >>> "}
> >>>
> >>> {"id":"test2","content":"
> >>> 10.100.0.22
> >>> description
> >>> "}
> >>>
> >>> and the query content:encryption, this will turn into
> >>>
> >>> "parsedquery_toString":
> >>>
> >>> "+((content:enc content:ncr content:cry content:ryp
> >>> content:ypt content:pti content:tio content:ion)~8)",
> >>>
> >>> and return only the first document. All fine and dandy. But I have a
> >>> problem with possible false positives. If the search is e.g.
> >>>
> >>> content:.100.22
> >>>
> >>> then the generated query will be
> >>>
> >>> "parsedquery_toString":
> >>> "+((content:.10 content:100 content:00. content:0.2 content:.22)~5)",
> >>>
> >>> and because all of tokens are also generated for document test2 in the
> >>> proximity of 5, both documents will wrongly be returned.
> >>>
> >>> So somehow I'd need to express the query "content:.10 content:100
> >>> content:00. content:0.2 content:.22" with *the tokens exactly in this
> >>> order and nothing in between*. Is this somehow possible, maybe by using
> >>> the termvectors/termpositions stuff? Or am I trying to do something
> >>> that's fundamentally impossible? Other good ideas how to achieve this
> >>> kind of behaviour?
> >>>
> >>> Thanks
> >>> Christian
> >>>
> >>>
> >>>
> >

Re: Exact substring search with ngrams

Reply via email to