Hi Naomi,

Try fixing your data. :-)

No, really:

1 - Sort all of your call numbers using whatever sort makes sense to you.

2 - Assign them - in your sort order - sort keys that are floats, starting:
    0.01
    0.02
    ...
    1.01
    1.02
     ...
     79,999.98
     79,999.99
This should approx. cover all of your 8M items.

3 - To guarantee at least 10 items:
     sortKey:[234.87 TO 234.96]

4 - OK, as your library is not static, and new items are not added to
the end of the call numbers but interleaved, you need to be able to
deal with this. Here is how.
   a - Add the new item to your call numbers.
   b - sort by call numbers, as in #1 above
   c - Find the sort keys of the two items on either side of your new
item. Let's say these are: 3288.78 and 3288.79. You assign the new
item the sort key one decimal down from the other two: 3288.785, which
places it between the other 2 items, sort-wise.

Of course, as the distribution of additions is not even, so over time
some query ranges will start being much larger than 10. When this
starts to be a problem, re-sort all and re-assign sort keys.

Oh, you will need a mapping from call numbers to the sort keys (and
perhaps vice versa). Just use a Lucene index as a hash lookup.

Is this what you needed?

-glen  :-)

2008/11/28 Naomi Dushay <[EMAIL PROTECTED]>:
> The point isn't really how the exact sort works - it's the performance
> issues, coupled with an unpredictable distribution along the entire possible
> sort space.
>
> the sort works
> the range queries work
> the performance sucks
>
> and I haven't thought of a clever work around.
>
> - Naomi
>
> On Nov 27, 2008, at 9:41 AM, Alexander Ramos Jardim wrote:
>
>> I did not even understand what you are considering to be the order on your
>> call numbers.
>>
>> 2008/11/26 Naomi Dushay <[EMAIL PROTECTED]>
>>
>>> I have a performance problem and I haven't thought of a clever way around
>>> it.
>>>
>>> I work at the Stanford University Libraries.  We have a collection of
>>> over
>>> 8 million items.  Each item has a call number.  I have been asked to
>>> provide
>>> a way to browse forward and backward from an arbitrary call number.
>>>
>>> I have managed to create a fields that present the call numbers in
>>> appropriate sorts, both forward and reverse.  (This is necessary because
>>> raw
>>> call numbers don't sort properly:   A123 AZ27 B99 B999 BBB111111).
>>>
>>> We can ignore the reverse sorted range query problem;  it's the same as
>>> the
>>> forward sorted range query.
>>>
>>> So I use a query like this:
>>>
>>> sortCallNum["A123 B34 1970" TO *]&rows=10.
>>>
>>>
>>> Call numbers are squirrelly, so we can't predict the string that will
>>> appropriately grab at least 10 subsequent documents.  They are certainly
>>> not
>>> consecutive!
>>>
>>> so from
>>> A123 B34 1970
>>>
>>> we're unable to predict if any of these will return at least 10 results:
>>>
>>> A123 B34 1980  or
>>> A123 B34 V.8  or
>>> A123 B44 or
>>> A123 B67 or
>>> A123 C27 or
>>> A124* or
>>> A22* or
>>> AA* or
>>>
>>> You get the idea.
>>>
>>> I have not figured out a way to efficiently query for "the next 10 call
>>> numbers in sort order".  I have also mucked about with the cache
>>> initialization, but that's not working either:
>>>
>>>  <listener event="firstSearcher" class="solr.QuerySenderListener">
>>>    <arr name="queries">
>>>      <!-- populate query result cache for sorted queries -->
>>>      <lst>
>>>              <str name="q">shelfkey:[0 TO *]</str>
>>>              <str name="sort">shelfkey asc</str>
>>>      </lst>
>>>    </arr>
>>>
>>> Can anyone help me with this?
>>>
>>> - Naomi
>>>
>>>
>>
>>
>> --
>> Alexander Ramos Jardim
>
> Naomi Dushay
> [EMAIL PROTECTED]
>
>
>
>



-- 

-

Reply via email to