I don't have a very specific suggestion, but I wonder if you could have a data 
structure that lives outside of the main index and keeps only these dates.  
Presumably this smaller data structure would be simpler/faster to update, and 
you'd just have to remain in sync with the main index (document-document 
mapping).  I think ParallelReader in Lucene is a similar approach, as it Solr's 
ExternalFileField.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Development Team <dev.and...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Friday, July 3, 2009 4:46:37 PM
> Subject: Suggestions needed: Lots of updates for tiny changes
> 
> Hi everybody,
>      Let's say I had an index with 10M large-ish documents, and as people
> logged into a website and viewed them the "last viewed date" was updated to
> the current time. We index a document's last-viewed-date because we allow
> users to a) search on this last-viewed-date alongside all other searchable
> criteria, and b) we can order results of any search by the last-viewed-date.
>      The problem is that in a given 5-minute period, we may have many
> thousands of updated documents (due to this simple last-viewed-date). We
> have a task that looks for changed documents, loads the full documents, and
> then feeds them into Solr to update the index, but unfortunately reading
> these changed documents and continually feeding them to Solr is generating *
> far* more load on our system (both Solr and the database) than any of the
> searches. In a given day, *we may have more updates to documents than we
> have total documents indexed*. (Databases don't handle this well either, the
> contention on rows for updates slows the database down significantly.)
>      How should we approach this problem? It seems like such a waste of
> resources to be doing so much work in applications/database/solr only for
> last-viewed-dates.
> 
>      Solutions we've looked at include:
>      1) Update only partial document. --Apparently this isn't supported in
> Solr yet (we're using nightly Solr 1.4 builds currently).
>      2) Use "near-real-time updates". --Not supported yet. Also, the
> "freshness" of the data isn't as much as concern as the sheer volume of
> changes that we have to make here. For example, we could update Solr
> less-fequently, but then we'd just have many more documents to update. The
> data only has to be, say, fresh to within 30 minutes.
>      3) Use a separate index for the last-viewed-date. --This won't work
> because we need to search on the last-viewed-date alongside other criteria,
> and we use it as scoring criteria for all our searches.
> 
>      Any suggestions?
> 
> Sincerely,
> 
>      Daryl.

Reply via email to