Hi Otis,
     Thanks for your reply and for giving it some thought.
     Actually we have considered using something that lives outside of the
main index... We've looked into using the ExternalFileField, but abandoned
that when it became clear we'd have to use a function to use it, and that
limited how we could use the field in our searches.
     For another more-real-time data problem we're having, we've considered
writing a search handler and search component to handle it as a
filter-query. This is equivalent to the "data structure outside of the main
index" that you have proposed. The problem with it is that getting it to be
*part of the index* is difficult.
     Well... any more ideas would be appreciated. But thanks for your help
so far.

- Daryl.


On Fri, Jul 3, 2009 at 9:34 PM, Otis Gospodnetic <otis_gospodne...@yahoo.com
> wrote:

>
> I don't have a very specific suggestion, but I wonder if you could have a
> data structure that lives outside of the main index and keeps only these
> dates.  Presumably this smaller data structure would be simpler/faster to
> update, and you'd just have to remain in sync with the main index
> (document-document mapping).  I think ParallelReader in Lucene is a similar
> approach, as it Solr's ExternalFileField.
>
>  Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
> > From: Development Team <dev.and...@gmail.com>
> > To: solr-user@lucene.apache.org
> > Sent: Friday, July 3, 2009 4:46:37 PM
> > Subject: Suggestions needed: Lots of updates for tiny changes
> >
> > Hi everybody,
> >      Let's say I had an index with 10M large-ish documents, and as people
> > logged into a website and viewed them the "last viewed date" was updated
> to
> > the current time. We index a document's last-viewed-date because we allow
> > users to a) search on this last-viewed-date alongside all other
> searchable
> > criteria, and b) we can order results of any search by the
> last-viewed-date.
> >      The problem is that in a given 5-minute period, we may have many
> > thousands of updated documents (due to this simple last-viewed-date). We
> > have a task that looks for changed documents, loads the full documents,
> and
> > then feeds them into Solr to update the index, but unfortunately reading
> > these changed documents and continually feeding them to Solr is
> generating *
> > far* more load on our system (both Solr and the database) than any of the
> > searches. In a given day, *we may have more updates to documents than we
> > have total documents indexed*. (Databases don't handle this well either,
> the
> > contention on rows for updates slows the database down significantly.)
> >      How should we approach this problem? It seems like such a waste of
> > resources to be doing so much work in applications/database/solr only for
> > last-viewed-dates.
> >
> >      Solutions we've looked at include:
> >      1) Update only partial document. --Apparently this isn't supported
> in
> > Solr yet (we're using nightly Solr 1.4 builds currently).
> >      2) Use "near-real-time updates". --Not supported yet. Also, the
> > "freshness" of the data isn't as much as concern as the sheer volume of
> > changes that we have to make here. For example, we could update Solr
> > less-fequently, but then we'd just have many more documents to update.
> The
> > data only has to be, say, fresh to within 30 minutes.
> >      3) Use a separate index for the last-viewed-date. --This won't work
> > because we need to search on the last-viewed-date alongside other
> criteria,
> > and we use it as scoring criteria for all our searches.
> >
> >      Any suggestions?
> >
> > Sincerely,
> >
> >      Daryl.
>
>

Reply via email to