I'll try reindexing the timestamp. The id-creation approach suggested by Erick sounds attractive, but the nutch/solr integration seems rather tight. I don't where to break in to insert the id into solr.
On Mon, Jul 29, 2013 at 4:11 AM, Erick Erickson <erickerick...@gmail.com>wrote: > No SolrJ doesn't provide this automatically. You'd be providing the > counter by inserting it into the document as you created new docs. > > You could do this with any kind of document creation you are > using. > > Best > Erick > > On Mon, Jul 29, 2013 at 2:51 AM, Aditya <findbestopensou...@gmail.com> > wrote: > > Hi, > > > > The easiest solution would be to have timestamp indexed. Is there any > issue > > in doing re-indexing? > > If you want to process records in batch then you need a ordered list and > a > > bookmark. You require a field to sort and maintain a counter / last id as > > bookmark. This is mandatory to solve your problem. > > > > If you don't want to re-index, then you need to maintain information > > related to visited nodes. Have a database / solr core which maintains > list > > of IDs which already processed. Fetch record from Solr, For each record, > > check the new DB, if the record is already processed. > > > > Regards > > Aditya > > www.findbestopensource.com > > > > > > > > > > > > On Mon, Jul 29, 2013 at 10:26 AM, Joe Zhang <smartag...@gmail.com> > wrote: > > > >> Basically, I was thinking about running a range query like Shawn > suggested > >> on the tstamp field, but unfortunately it was not indexed. Range queries > >> only work on indexed fields, right? > >> > >> > >> On Sun, Jul 28, 2013 at 9:49 PM, Joe Zhang <smartag...@gmail.com> > wrote: > >> > >> > I've been thinking about tstamp solution int the past few days. but > too > >> > bad, the field is avaialble but not indexed... > >> > > >> > I'm not familiar with SolrJ. Again, sounds like SolrJ is providing the > >> > counter value. If yes, that would be equivalent to an autoincrement > id. > >> I'm > >> > indexing from Nutch though; don't know how to feed in such counter... > >> > > >> > > >> > On Sun, Jul 28, 2013 at 7:03 AM, Erick Erickson < > erickerick...@gmail.com > >> >wrote: > >> > > >> >> Why wouldn't a simple timestamp work for the ordering? Although > >> >> I guess "simple timestamp" isn't really simple if the time settings > >> >> change. > >> >> > >> >> So how about a simple counter field in your documents? Assuming > >> >> you're indexing from SolrJ, your setup is to query q=*:*&sort=counter > >> >> desc. > >> >> Take the counter from the first document returned. Increment for > >> >> each doc for the life of the indexing run. Now you've got, for all > >> intents > >> >> and purposes, an identity field albeit manually maintained. > >> >> > >> >> Then use your counter field as Shawn suggests for pulling all the > >> >> data out. > >> >> > >> >> FWIW, > >> >> Erick > >> >> > >> >> On Sun, Jul 28, 2013 at 1:01 AM, Maurizio Cucchiara > >> >> <mcucchi...@apache.org> wrote: > >> >> > In both cases, for better performance, first I'd load just all the > >> IDs, > >> >> > after, during processing I'd load each document. > >> >> > For what concern the incremental requirement, it should not be > >> >> difficult to > >> >> > write an hash function which maps a non-numerical I'd to a value. > >> >> > On Jul 27, 2013 7:03 AM, "Joe Zhang" <smartag...@gmail.com> > wrote: > >> >> > > >> >> >> Dear list: > >> >> >> > >> >> >> I have an ever-growing solr repository, and I need to process > every > >> >> single > >> >> >> document to extract statistics. What would be a reasonable process > >> that > >> >> >> satifies the following properties: > >> >> >> > >> >> >> - Exhaustive: I have to traverse every single document > >> >> >> - Incremental: in other words, it has to allow me to divide and > >> >> conquer --- > >> >> >> if I have processed the first 20k docs, next time I can start with > >> >> 20001. > >> >> >> > >> >> >> A simple "*:*" query would satisfy the 1st but not the 2nd > property. > >> In > >> >> >> fact, given that the processing will take very long, and the > >> repository > >> >> >> keeps growing, it is not even clear that the exhaustiveness is > >> >> achieved. > >> >> >> > >> >> >> I'm running solr 3.6.2 in a single-machine setting; no hadoop > >> >> capability > >> >> >> yet. But I guess the same issues still hold even if I have the > solr > >> >> cloud > >> >> >> environment, right, say in each shard? > >> >> >> > >> >> >> Any help would be greatly appreciated. > >> >> >> > >> >> >> Joe > >> >> >> > >> >> > >> > > >> > > >> >