Re: Are analysers applied to each value in a multi-valued field separately?

Jack Krupansky Tue, 16 Jul 2013 09:52:13 -0700

Actually, I appear to be wrong on the position limit filter - it appears tobe relative to the string being analyzed and not the full sequence of valuesanalyzed for the field.


Given this field and type:

<fieldType name="text_limit_position4" class="solr.TextField"positionIncrementGap="10">

 <analyzer>
   <tokenizer class="solr.WhitespaceTokenizerFactory"/>

<filter class="solr.LimitTokenPositionFilterFactory"maxTokenPosition="23"/>

 </analyzer>
</fieldType>

<field name="text_limit3" type="text_limit_position4"
      indexed="true" stored="true" multiValued="true" />

And this document:

curl "http://localhost:8983/solr/update?commit=true"; \
-H 'Content-type:application/json' -d '
[{"id": "doc-1",
 "title": "Hello World",
 "text_limit4": ["a1 a2 a3 a4", "b1 b2 b3 b4", "c1 c2 c3 c4",
                 "d1 d2 d3 d4", "e1 e2 e3 e4", "f1 f2 f3 f4"]}]'

The hope was that the indexed sequence of terms would stop at c4, but thefull values are indexed. These queries succeed:


curl "http://localhost:8983/solr/select/?q=text_limit4:d1";

curl "http://localhost:8983/solr/select/?q=text_limit4:f4";

And this query fails:

curl "http://localhost:8983/solr/select/?q=text_limit4:%22a4+f1%22~65";

While this query succeeds:

curl "http://localhost:8983/solr/select/?q=text_limit4:%22a4+f1%22~66";

Indicating that the position gaps of 10 are there between each value, butthe token position limit filter doesn't trigger.


This document:

curl "http://localhost:8983/solr/update?commit=true"; \
-H 'Content-type:application/json' -d '
[{"id": "doc-1",
 "title": "Hello World",

"text_limit4": "a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 a16 a17a18 a19 a20 a21 a22 a23 a24 a25 a26"}]'


Fails on this query:

curl "http://localhost:8983/solr/select/?q=text_limit4:a24";

But succeeds on this query:

curl "http://localhost:8983/solr/select/?q=text_limit4:a23";

Indicating that the token position limit filter does work, but only for therelative position, making it not much more useful than the token count limitfilter.


Oh well.

-- Jack Krupansky

-----Original Message-----From: Daniel Collins

Sent: Tuesday, July 16, 2013 12:18 PM
To: solr-user@lucene.apache.org

Subject: Re: Are analysers applied to each value in a multi-valued fieldseparately?


Self-correction, we'd need to set LimitTokenPositionFilterFactor**y to "PI
+ N" to give the results above because of the increment gap between values.


On 16 July 2013 17:16, Daniel Collins <danwcoll...@gmail.com> wrote:

Thanks Jack.

There seem to be a never ending set of FilterFactories, I keep hearing
about new ones all the time :)

Ok, I get it, so our existing code is the first N tokens of each value,
and using LimitTokenPositionFilterFactor**y with the same number would
give us the first N of the combined set of tokens, that's good to know.



On 16 July 2013 14:15, Jack Krupansky <j...@basetechnology.com> wrote:

Yes, each input value is analyzed separately. Solr passes each input
value to Lucene and then Lucene analyzes each.

You could use LimitTokenPositionFilterFactor**y which uses the absolute
token position - each successive analyzed value would have an incremented
position, plus the positionIncrementGap (typically 100 for text.)

-- Jack Krupansky

-----Original Message----- From: Daniel Collins
Sent: Tuesday, July 16, 2013 8:46 AM
To: solr-user@lucene.apache.org
Subject: Are analysers applied to each value in a multi-valued field
separately?


I'm guessing the answer is yes, but here's the background.

We index 2 separate fields, headline and body text for a document, and
then
we want to identify the "top" of the story which is th headline + N words
of the body (we want to weight that in scoring).

So do to that:

<copyField src="headline" dest="top"/>
<copyField src="body" dest="top"/>

And the "top" field has a LimitTokenCountFilterFactory appended to it to
do
the limiting.

       <filter class="solr.**LimitTokenCountFilterFactory"
maxTokenCount="N"/>

I realised that top needs to be multi-valued, which got me thinking: is
that N tokens PER VALUE of top or N tokens in total within the top
field...
The field is indexed but not stored, so its hard to determine exactly
which is being done.

Logically, I presume each value in the field is independent (and Solrthen

just matches searches against each one), so that would suggest N is per
value?

Cheers, Daniel

Re: Are analysers applied to each value in a multi-valued field separately?

Reply via email to