On 7/26/2013 11:02 PM, Joe Zhang wrote:
> I have an ever-growing solr repository, and I need to process every single
> document to extract statistics. What would be a reasonable process that
> satifies the following properties:
> 
> - Exhaustive: I have to traverse every single document
> - Incremental: in other words, it has to allow me to divide and conquer ---
> if I have processed the first 20k docs, next time I can start with 20001.

If your index isn't very big, a *:* query with rows and start parameters
is perfectly acceptable.  Performance is terrible for this method when
the index gets huge, though.

If "id" is your uniqueKey field, here's how you can do it.  If that's
not your uniqueKey field, substitute your uniqueKey field for id.  This
method doesn't work properly if you don't use a field with values that
are guaranteed to be unique.

For the first query, send a query with these parameters, where NNNNNN is
the number of docs you want to retrieve at once:
q=*:*&rows=NNNNNN&sort=id asc

For each subsequent query, use the following parameters, where XXX is
the highest id value seen in the previous query:
q={XXX TO *}&rows=NNNNNN&sort=id asc

As soon as you see a numFound value less than NNNNNN, you will know that
there's no more data.

Generally speaking, you'd want to avoid updating the index while doing
these queries.  If you never replace existing documents and you can
guarantee that the value in the uniqueKey field for new documents will
always be higher than any previous value, then you could continue
updating the index.  A database autoincrement field would qualify for
that condition.

Thanks,
Shawn

Reply via email to