On 7/26/2013 11:02 PM, Joe Zhang wrote: > I have an ever-growing solr repository, and I need to process every single > document to extract statistics. What would be a reasonable process that > satifies the following properties: > > - Exhaustive: I have to traverse every single document > - Incremental: in other words, it has to allow me to divide and conquer --- > if I have processed the first 20k docs, next time I can start with 20001.
If your index isn't very big, a *:* query with rows and start parameters is perfectly acceptable. Performance is terrible for this method when the index gets huge, though. If "id" is your uniqueKey field, here's how you can do it. If that's not your uniqueKey field, substitute your uniqueKey field for id. This method doesn't work properly if you don't use a field with values that are guaranteed to be unique. For the first query, send a query with these parameters, where NNNNNN is the number of docs you want to retrieve at once: q=*:*&rows=NNNNNN&sort=id asc For each subsequent query, use the following parameters, where XXX is the highest id value seen in the previous query: q={XXX TO *}&rows=NNNNNN&sort=id asc As soon as you see a numFound value less than NNNNNN, you will know that there's no more data. Generally speaking, you'd want to avoid updating the index while doing these queries. If you never replace existing documents and you can guarantee that the value in the uniqueKey field for new documents will always be higher than any previous value, then you could continue updating the index. A database autoincrement field would qualify for that condition. Thanks, Shawn