Get Recently Added/Updated Documents

Lyuba Romanchuk Tue, 15 Mar 2016 11:45:30 -0700

Hi,

I have the following scenario:


   - there are 2 machines running solr 4.8.1
   - there are different time zones on both machines
   - the clock is not synchronized on both machines

Autorefresh query running each X-2 seconds should return documents for last
X seconds and the performance impact should be low as much as possible
(perfectly, should take less then second).

First of all, I added first-component that overrides NOW param set by main
shard in order to treat the local NOW time on each solr machine.
And I added a new custom function
recent_docs(ms_since_now(_version_),X)=recip(ms(NOW,_version_ to
milliseconds),0.01/X,1,1).

Then I thought about 2 possible solutions but there is disadvantage for
each one and now I try to decide which one is the most optimal.
And maybe there are another solutions that I didn't think about.

   1. *Solution 1*: use boosting for _version_ field like this: q={!boost
   b=recent_docs(ms_since_now(_version_),X)}*:*
   1. _version_ because I need to receive the recently updated documents
      and the time of the document shouldn't be changed. And I saw
from the code
      that the _version_ is calculated based on the time
      2. It's good for sorting because all documents are sorting by scoring
      but in this case all documents are matched and I need to return only
      documents with score from [0.1 to 1]. I may filter by _version_
field but I
      prefer not to do it due to performance.
      3. *Question*:
         1. what is the performance impact for such scoring?
         2. *how can I return only documents with scoring from 0.1 to 1*?
      2. *Solution 2*: use query function like this:  fq={!frange l=0.1
   u=1}recent_docs(ms_since_now(_version_),X)
   1. in this case only relevant documents are returned but they are not
      sorted and sorting by _version_ or adding scoring seems is not  efficient
      because in such case the same function will be claculated twice
      2. it seems that there is very high performance impact to use this
      query function on large cores with hundred millions of documents
      3. *Questions*:
         1. *what is the most optimal way to sort the returned documents
         without calculating twice the same function*?
         2. and what is the performance impact of such filter query, is
         FieldCache is used?
         3. May it drastic increase the memory consumption of solr on very
         updated cores with millions of documents?


Any assistance/suggestion/comment will be very appreciated.

Thank you.

Best regards,
Lyuba

Get Recently Added/Updated Documents

Reply via email to