Dear list,
I'vw written a special processor exactly for this kind of operations

https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs/src/java/org/apache/solr/handler/batch

This is how we use it
http://labs.adsabs.harvard.edu/trac/ads-invenio/wiki/SearchEngineBatch

It is capable of processing index of 200gb in few minutes,
copying/streaming large amounts of data is normal

If there is general interest, we can create jira issue - but given my
current workload time, it will take longer and also somebody else will
*have to* invest their time and energy in testing it, reporting, etc. Of
course, feel free to create the jira yourself or reuse the code -
hopefully, you will improve it and let me know ;-)

Roman
On 27 Jul 2013 01:03, "Joe Zhang" <smartag...@gmail.com> wrote:

> Dear list:
>
> I have an ever-growing solr repository, and I need to process every single
> document to extract statistics. What would be a reasonable process that
> satifies the following properties:
>
> - Exhaustive: I have to traverse every single document
> - Incremental: in other words, it has to allow me to divide and conquer ---
> if I have processed the first 20k docs, next time I can start with 20001.
>
> A simple "*:*" query would satisfy the 1st but not the 2nd property. In
> fact, given that the processing will take very long, and the repository
> keeps growing, it is not even clear that the exhaustiveness is achieved.
>
> I'm running solr 3.6.2 in a single-machine setting; no hadoop capability
> yet. But I guess the same issues still hold even if I have the solr cloud
> environment, right, say in each shard?
>
> Any help would be greatly appreciated.
>
> Joe
>

Reply via email to