Re: processing documents in solr

Erick Erickson Sun, 28 Jul 2013 07:04:54 -0700

Why wouldn't a simple timestamp work for the ordering? Although
I guess "simple timestamp" isn't really simple if the time settings
change.


So how about a simple counter field in your documents? Assuming
you're indexing from SolrJ, your setup is to query q=*:*&sort=counter desc.
Take the counter from the first document returned. Increment for
each doc for the life of the indexing run. Now you've got, for all intents
and purposes, an identity field albeit manually maintained.

Then use your counter field as Shawn suggests for pulling all the
data out.

FWIW,
Erick

On Sun, Jul 28, 2013 at 1:01 AM, Maurizio Cucchiara
<mcucchi...@apache.org> wrote:
> In both cases, for better performance, first I'd load just all the IDs,
> after, during processing I'd load each document.
> For what concern the incremental requirement, it should not be difficult to
> write an hash function which maps a non-numerical I'd to a value.
>  On Jul 27, 2013 7:03 AM, "Joe Zhang" <smartag...@gmail.com> wrote:
>
>> Dear list:
>>
>> I have an ever-growing solr repository, and I need to process every single
>> document to extract statistics. What would be a reasonable process that
>> satifies the following properties:
>>
>> - Exhaustive: I have to traverse every single document
>> - Incremental: in other words, it has to allow me to divide and conquer ---
>> if I have processed the first 20k docs, next time I can start with 20001.
>>
>> A simple "*:*" query would satisfy the 1st but not the 2nd property. In
>> fact, given that the processing will take very long, and the repository
>> keeps growing, it is not even clear that the exhaustiveness is achieved.
>>
>> I'm running solr 3.6.2 in a single-machine setting; no hadoop capability
>> yet. But I guess the same issues still hold even if I have the solr cloud
>> environment, right, say in each shard?
>>
>> Any help would be greatly appreciated.
>>
>> Joe
>>

Re: processing documents in solr

Reply via email to