Re: sort by field length

Erick Erickson Wed, 26 May 2010 07:24:09 -0700

Take a look at the scoring algorithm on the Wiki, it already takes
this into account, albeit modified by how many times the term
is mentioned in the field. So a field with 5 terms and one match
will score higher than one with 10 terms and one match. Where
it lands with 10 terms and 2 matches I leave as an exercise for
the reader.


I really think you're reinventing the wheel here and looking at the
default scoring mechanism would be a good use of your time.

Best
Erick

On Wed, May 26, 2010 at 4:04 AM, Sascha Szott <sz...@zib.de> wrote:

> Hi Erick,
>
> Erick Erickson wrote:
>
>> Ah, I may have misunderstood, I somehow got it in my mind
>> you were talking about the length of each term (as in string length).
>>
>> But if you're looking at the field length as the count of terms, that's
>> another question, sorry for the confusion...
>>
>> I have to ask, though, why you want to sort this way? The relevance
>> calculations already factor in both term frequency and field length.
>> What's
>> the use-case for sorting by field length given the above?
>>
> It's not a real world use-case -- I just want to get a better understanding
> of the data I'm indexing (therefore, performance is neglectable). In my
> current use case, you can think of the field length as an indicator of data
> quality (i.e., the longer the field content, the worse the quality is).
> Being able to sort the field data in order of decreasing length would allow
> me to investigate "exceptional" data items that are not appropriately
> handled by my curation process.
>
> Best,
> Sascha
>
>
>
>> Best
>> Erick
>>
>> On Tue, May 25, 2010 at 3:40 AM, Sascha Szott<sz...@zib.de>  wrote:
>>
>>  Hi Erick,
>>>
>>>
>>> Erick Erickson wrote:
>>>
>>>  Are you sure you want to recompute the length when sorting?
>>>> It's the classic time/space tradeoff, but I'd suggest that when
>>>> your index is big enough to make taking up some more space
>>>> a problem, it's far too big to spend the cycles calculating each
>>>> term length for sorting purposes considering you may be
>>>> sorting all the terms in your index worst-case.
>>>>
>>>>  Good point, thank you for the clarification. I "thought" that Lucene
>>> internally stores the field length (e.g., in order to compute the
>>> relevance)
>>> and getting this information at query time requires only a simple lookup.
>>>
>>> -Sascha
>>>
>>>
>>>
>>>  But you could consider payloads for storing the length, although
>>>> that would still be redundant...
>>>>
>>>> Best
>>>> Erick
>>>>
>>>> On Mon, May 24, 2010 at 8:30 AM, Sascha Szott<sz...@zib.de>   wrote:
>>>>
>>>>  Hi folks,
>>>>
>>>>>
>>>>> is it possible to sort by field length without having to (redundantly)
>>>>> save
>>>>> the length information in a seperate index field? At first, I thought
>>>>> to
>>>>> accomplish this using a function query, but I couldn't find an
>>>>> appropriate
>>>>> one.
>>>>>
>>>>> Thanks in advance,
>>>>> Sascha
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>
>

Re: sort by field length

Reply via email to