I don't have a very specific suggestion, but I wonder if you could have a data structure that lives outside of the main index and keeps only these dates. Presumably this smaller data structure would be simpler/faster to update, and you'd just have to remain in sync with the main index (document-document mapping). I think ParallelReader in Lucene is a similar approach, as it Solr's ExternalFileField.
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Development Team <dev.and...@gmail.com> > To: solr-user@lucene.apache.org > Sent: Friday, July 3, 2009 4:46:37 PM > Subject: Suggestions needed: Lots of updates for tiny changes > > Hi everybody, > Let's say I had an index with 10M large-ish documents, and as people > logged into a website and viewed them the "last viewed date" was updated to > the current time. We index a document's last-viewed-date because we allow > users to a) search on this last-viewed-date alongside all other searchable > criteria, and b) we can order results of any search by the last-viewed-date. > The problem is that in a given 5-minute period, we may have many > thousands of updated documents (due to this simple last-viewed-date). We > have a task that looks for changed documents, loads the full documents, and > then feeds them into Solr to update the index, but unfortunately reading > these changed documents and continually feeding them to Solr is generating * > far* more load on our system (both Solr and the database) than any of the > searches. In a given day, *we may have more updates to documents than we > have total documents indexed*. (Databases don't handle this well either, the > contention on rows for updates slows the database down significantly.) > How should we approach this problem? It seems like such a waste of > resources to be doing so much work in applications/database/solr only for > last-viewed-dates. > > Solutions we've looked at include: > 1) Update only partial document. --Apparently this isn't supported in > Solr yet (we're using nightly Solr 1.4 builds currently). > 2) Use "near-real-time updates". --Not supported yet. Also, the > "freshness" of the data isn't as much as concern as the sheer volume of > changes that we have to make here. For example, we could update Solr > less-fequently, but then we'd just have many more documents to update. The > data only has to be, say, fresh to within 30 minutes. > 3) Use a separate index for the last-viewed-date. --This won't work > because we need to search on the last-viewed-date alongside other criteria, > and we use it as scoring criteria for all our searches. > > Any suggestions? > > Sincerely, > > Daryl.