Dear list, I'vw written a special processor exactly for this kind of operations
https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs/src/java/org/apache/solr/handler/batch This is how we use it http://labs.adsabs.harvard.edu/trac/ads-invenio/wiki/SearchEngineBatch It is capable of processing index of 200gb in few minutes, copying/streaming large amounts of data is normal If there is general interest, we can create jira issue - but given my current workload time, it will take longer and also somebody else will *have to* invest their time and energy in testing it, reporting, etc. Of course, feel free to create the jira yourself or reuse the code - hopefully, you will improve it and let me know ;-) Roman On 27 Jul 2013 01:03, "Joe Zhang" <smartag...@gmail.com> wrote: > Dear list: > > I have an ever-growing solr repository, and I need to process every single > document to extract statistics. What would be a reasonable process that > satifies the following properties: > > - Exhaustive: I have to traverse every single document > - Incremental: in other words, it has to allow me to divide and conquer --- > if I have processed the first 20k docs, next time I can start with 20001. > > A simple "*:*" query would satisfy the 1st but not the 2nd property. In > fact, given that the processing will take very long, and the repository > keeps growing, it is not even clear that the exhaustiveness is achieved. > > I'm running solr 3.6.2 in a single-machine setting; no hadoop capability > yet. But I guess the same issues still hold even if I have the solr cloud > environment, right, say in each shard? > > Any help would be greatly appreciated. > > Joe >