Re: Keeping a rolling window of indexes around solr

Erick Erickson Wed, 29 May 2013 04:32:45 -0700

I suspect you're worrying about something you don't need to. At 1 insert every
30 seconds, and assuming 30,000,000 records will fit on a machine (I've seen
this), you're talking 1,000,000 seconds worth of data on a single box!
Or roughly
10,000 day's worth of data. Test, of course, YMMV.


Or I'm mis-understanding what "1 log insert" means, I guess it could be a full
log file....

But do the simple thing first, just let Solr do what it does by
default and periodically
do a delete by query on documents you want to roll off the end. Especially since
you say that queries happen every few days. The tricks for utilizing
"hot shards" are
probably not very useful for you with that low a query rate.

Test, of course
Best
Erick

On Tue, May 28, 2013 at 8:42 PM, Saikat Kanjilal <sxk1...@hotmail.com> wrote:
> Volume of data:
> 1 log insert every 30 seconds, queries done sporadically asynchronously every 
> so often at a much lower frequency every few days
>
> Also the majority of the requests are indeed going to be within a splice of 
> time (typically hours or at most a few days)
>
> Type of queries:
> Keyword or termsearch
> Search by guid (or id as known in the solr world)
> Reserved or percolation queries to be executed when new data becomes available
> Search by dates as mentioned above
>
> Regards
>
>
> Sent from my iPhone
>
> On May 28, 2013, at 4:25 PM, Chris Hostetter <hossman_luc...@fucit.org> wrote:
>
>>
>> : This is kind of the approach used by elastic search , if I'm not using
>> : solrcloud will I be able to use shard aliasing, also with this approach
>> : how would replication work, is it even needed?
>>
>> you haven't said much about hte volume of data you expect to deal with,
>> nor have you really explained what types of queries you intend to do --
>> ie: you said you were intersted in a "rolling window of indexes
>> around n days of data" but you never clarified why you think a
>> rolling window of indexes would be useful to you or how exactly you would
>> use it.
>>
>> The primary advantage of sharding by date is if you know that a large
>> percentage of your queries are only going to be within a small range of
>> time, and therefore you can optimize those requests to only hit the shards
>> neccessary to satisfy that small windo of time.
>>
>> if the majority of requests are going to be across your entire "n days" of
>> data, then date based sharding doesn't really help you -- you can just use
>> arbitrary (randomized) sharding using periodic deleteByQuery commands to
>> purge anything older then N days.  Query the whole collection by default,
>> and add a filter query if/when you want to restrict your search to only a
>> narrow date range of documents.
>>
>> this is the same general approach you would use on a non-distributed /
>> non-SolrCloud setup if you just had a single collection on a single master
>> replicated to some number of slaves for horizontal scaling.
>>
>>
>> -Hoss
>>

Re: Keeping a rolling window of indexes around solr

Reply via email to