On a related, inspired by what you said, Shawn, an auto increment id seems perfect here. Yet I found there is no such support in solr. The UUID only guarantees uniqueness.
On Fri, Jul 26, 2013 at 10:50 PM, Joe Zhang <smartag...@gmail.com> wrote: > Thanks for your kind reply, Shawn. > > On Fri, Jul 26, 2013 at 10:27 PM, Shawn Heisey <s...@elyograg.org> wrote: > >> On 7/26/2013 11:02 PM, Joe Zhang wrote: >> > I have an ever-growing solr repository, and I need to process every >> single >> > document to extract statistics. What would be a reasonable process that >> > satifies the following properties: >> > >> > - Exhaustive: I have to traverse every single document >> > - Incremental: in other words, it has to allow me to divide and conquer >> --- >> > if I have processed the first 20k docs, next time I can start with >> 20001. >> >> If your index isn't very big, a *:* query with rows and start parameters >> is perfectly acceptable. Performance is terrible for this method when >> the index gets huge, though. >> > > ==> Essentially we are doing paigination here, right? If performance is > not the concern, given that the index is dynamic, does the order of > entries remain stable over time? > > > >> If "id" is your uniqueKey field, here's how you can do it. If that's >> not your uniqueKey field, substitute your uniqueKey field for id. This >> method doesn't work properly if you don't use a field with values that >> are guaranteed to be unique. >> >> For the first query, send a query with these parameters, where NNNNNN is >> the number of docs you want to retrieve at once: >> q=*:*&rows=NNNNNN&sort=id asc >> >> For each subsequent query, use the following parameters, where XXX is >> the highest id value seen in the previous query: >> q={XXX TO *}&rows=NNNNNN&sort=id asc >> >> ==> This approach seems to require that the id field is numerical, right? > I have a text-based id that is unique. > > ==> I'm not sure I understand the "q={XXX TO *}" part --> wouldn't query > be matched against the default search field, which could be "content", for > example? How would that do the job? > > >> As soon as you see a numFound value less than NNNNNN, you will know that >> there's no more data. >> >> Generally speaking, you'd want to avoid updating the index while doing >> these queries. If you never replace existing documents and you can >> guarantee that the value in the uniqueKey field for new documents will >> always be higher than any previous value, then you could continue >> updating the index. A database autoincrement field would qualify for >> that condition. >> >> Thanks, >> Shawn >> >> >