Re: Solr architecture

Upayavira Thu, 11 Feb 2016 06:47:15 -0800

Your biggest issue here is likely to be http connections. Making an HTTP
connection to Solr is way more expensive than the ask of adding a single
document to the index. If you are expecting to add 24 billion docs per
day, I'd suggest that somehow merging those documents into batches
before sending them to Solr will be necessary.


To my previous question - what do you gain by using Solr that you don't
get from other solutions? I'd suggest that to make this system really
work, you are going to need a deep understanding of how Lucene works -
segments, segment merges, deletions, and many other things because when
you start to work at that scale, the implementation details behind
Lucene really start to matter and impact upon your ability to succeed.

I'd suggest that what you are undertaking can certainly be done, but is
a substantial project.

Upayavira

On Wed, Feb 10, 2016, at 09:48 PM, Mark Robinson wrote:
> Thanks everyone for your suggestions.
> Based on it I am planning to have one doc per event with sessionId
> common.
> 
> So in this case hopefully indexing each doc as and when it comes would be
> okay? Or do we still need to batch and index to Solr?
> 
> Also with 4M sessions a day with about 6000 docs (events) per session we
> can expect about 24Billion docs per day!
> 
> Will Solr still hold good. If so could some one please recommend a sizing
> to cater to this levels of data.
> The queries per second is around 320 qps.
> 
> Thanks!
> Mark
> 
> 
> On Wed, Feb 10, 2016 at 3:38 AM, Emir Arnautovic <
> emir.arnauto...@sematext.com> wrote:
> 
> > Hi Mark,
> > Appending session actions just to be able to return more than one session
> > without retrieving large number of results is not good tradeoff. Like
> > Upayavira suggested, you should consider storing one action per doc and
> > aggregate on read time or push to Solr once session ends and aggregate on
> > some other layer.
> > If you are thinking handling infrastructure might be too much, you may
> > consider using some of logging services to hold data. One such service is
> > Sematext's Logsene (http://sematext.com/logsene).
> >
> > Thanks,
> > Emir
> >
> > --
> > Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> > Solr & Elasticsearch Support * http://sematext.com/
> >
> >
> >
> > On 10.02.2016 03:22, Mark Robinson wrote:
> >
> >> Thanks for your replies and suggestions!
> >>
> >> Why I store all events related to a session under one doc?
> >> Each session can have about 500 total entries (events) corresponding to
> >> it.
> >> So when I try to retrieve a session's info it can back with around 500
> >> records. If it is this compounded one doc per session, I can retrieve more
> >> sessions at a time with one doc per session.
> >> eg under a sessionId an array of eventA activities, eventB activities
> >>   (using json). When an eventA activity again occurs, we will read all
> >> that
> >> data for that session, append this extra info to evenA data and push the
> >> whole session related data back (indexing) to Solr. Like this for many
> >> sessions parallely.
> >>
> >>
> >> Why NRT?
> >> Parallely many sessions are being written (4Million sessions hence
> >> 4Million
> >> docs per day). A person can do this querying any time.
> >>
> >> It is just a look up?
> >> Yes. We just need to retrieve all info for a session and pass it on to
> >> another system. We may even do some extra querying on some data like
> >> timestamps, pageurl etc in that info added to a session.
> >>
> >> Thinking of having the data separate from the actual Solr Instance and
> >> mention the loc of the dataDir in solrconfig.
> >>
> >> If Solr is not a good option could you please suggest something which will
> >> satisfy this use case with min response time while querying.
> >>
> >> Thanks!
> >> Mark
> >>
> >> On Tue, Feb 9, 2016 at 6:02 PM, Daniel Collins <danwcoll...@gmail.com>
> >> wrote:
> >>
> >> So as I understand your use case, its effectively logging actions within a
> >>> user session, why do you have to do the update in NRT?  Why not just log
> >>> all the user session events (with some unique key, and ensuring the
> >>> session
> >>> Id is in the document somewhere), then when you want to do the query, you
> >>> join on the session id, and that gives you all the data records for that
> >>> session. I don't really follow why it has to be 1 document (which you
> >>> continually update). If you really need that aggregation, couldn't that
> >>> happen offline?
> >>>
> >>> I guess your 1 saving grace is that you query using the unique ID (in
> >>> your
> >>> scenario) so you could use the real-time get handler, since you aren't
> >>> doing a complex query (strictly its not a search, its a raw key lookup).
> >>>
> >>> But I would still question your use case, if you go the Solr route for
> >>> that
> >>> kind of scale with querying and indexing that much, you're going to have
> >>> to
> >>> throw a lot of hardware at it, as Jack says probably in the order of
> >>> hundreds of machines...
> >>>
> >>> On 9 February 2016 at 19:00, Upayavira <u...@odoko.co.uk> wrote:
> >>>
> >>> Bear in mind that Lucene is optimised towards high read lower write.
> >>>> That is, it puts in a lot of effort at write time to make reading
> >>>> efficient. It sounds like you are going to be doing far more writing
> >>>> than reading, and I wonder whether you are necessarily choosing the
> >>>> right tool for the job.
> >>>>
> >>>> How would you later use this data, and what advantage is there to
> >>>> storing it in Solr?
> >>>>
> >>>> Upayavira
> >>>>
> >>>> On Tue, Feb 9, 2016, at 03:40 PM, Mark Robinson wrote:
> >>>>
> >>>>> Hi,
> >>>>> Thanks for all your suggestions. I took some time to get the details to
> >>>>> be
> >>>>> more accurate. Please find what I have gathered:-
> >>>>>
> >>>>> My data being indexed is something like this.
> >>>>> I am basically capturing all data related to a user session.
> >>>>> Inside a session I have categorized my actions like actionA, actionB
> >>>>> etc..,
> >>>>> per page.
> >>>>> So each time an action pertaining to say actionA or actionB etc.. (in
> >>>>> each
> >>>>> page) happens, it is updated in Solr under that session (sessionId).
> >>>>>
> >>>>> So in short there is only one doc pertaining to a single session
> >>>>> (identified by sessionid) in my Solr index and that is retrieved and
> >>>>> updated
> >>>>> whenever a new action under that session occurs.
> >>>>> We expect upto 4Million session per day.
> >>>>>
> >>>>> On an average *one session's* *doc has a size* of *3MB to 20MB*.
> >>>>> So if it is *4Million sessions per day*, each session writing around
> >>>>>
> >>>> *500
> >>>
> >>>> times to Solr*, it is* 2Billion writes or (indexing) per day to Solr*.
> >>>>> As it is one doc per session, it is *4Million docs per day*.
> >>>>> This is around *80K docs indexed per second* during *peak* hours and
> >>>>> around *15K
> >>>>> docs indexed per second* into Solr during* non-peak* hours.
> >>>>> Number of queries per second is around *320 queries per second*.
> >>>>>
> >>>>>
> >>>>> 1. Average size of a doc
> >>>>>       3MB to 20MB
> >>>>> 2. Query types:-
> >>>>>       Until that session is in progress, whatever data is there for
> >>>>> that
> >>>>> session so far is queried and the new action's details captured and
> >>>>> appended to existing data already captured        related to that
> >>>>>
> >>>> session
> >>>
> >>>> and indexed back into Solr. So, longer the session the more data
> >>>>> retrieved
> >>>>> for each subsequent query to get current data captured for that
> >>>>>
> >>>> session.
> >>>
> >>>>       Also querying can be done on timestamp etc... which is captured
> >>>>>       along
> >>>>> with each action.
> >>>>> 3. Are docs grouped somehow?
> >>>>>       All data related to a session are retrieved from Solr, updated
> >>>>> and
> >>>>> indexed back to Solr based on sessionId. No other grouping.
> >>>>> 4. Are they time sensitive (NRT or offline process does this)
> >>>>>       As mentioned above this is in NRT. Each time a new user action in
> >>>>>       that
> >>>>> session happens, we need to query existing session info already
> >>>>>
> >>>> captured
> >>>
> >>>> related to that session and        append this new data  to this
> >>>>>
> >>>> existing
> >>>
> >>>> info retrieved and index it back to Solr.
> >>>>> 5. Will they update or it is rebuild every time, etc.
> >>>>>       Each time a new user action occurs, the full data pertaining to
> >>>>>
> >>>> that
> >>>
> >>>> session so far captured is retrieved from Solr, the extra latest data
> >>>>> pertaining to this new action is appended      and indexed  back to
> >>>>>
> >>>> Solr.
> >>>
> >>>> 6. And the other thing you haven't told us is whether you plan on
> >>>>> _adding_
> >>>>> 2B docs a day or whether that number is the total corpus size and you
> >>>>>
> >>>> are
> >>>
> >>>> re-indexing the 2B docs/day. IOW, if you are  adding 2B docs/day, 30
> >>>>>
> >>>> days
> >>>
> >>>> later do you have 2B docs or 60B docs in your
> >>>>>     corpus?
> >>>>>     We are expecting around 4 million sessions per day (per session 500
> >>>>> writes to Solr), which turns out to be 2B indexing done per day. So
> >>>>>
> >>>> after
> >>>
> >>>> 30 days it would be 4Milion*30          docs in the index.
> >>>>> 7. Is there any aging of docs
> >>>>>       No we always query against the whole corpus present.
> >>>>> 8. Is any doc deleted?
> >>>>>       No all data remains in the index
> >>>>>
> >>>>> Any suggestion is very welcome!
> >>>>>
> >>>>> Thanks!
> >>>>> Mark.
> >>>>>
> >>>>>
> >>>>> On Mon, Feb 8, 2016 at 3:30 PM, Jack Krupansky <
> >>>>>
> >>>> jack.krupan...@gmail.com
> >>>
> >>>> wrote:
> >>>>>
> >>>>> Oops... at 100 qps for a single node you would need 120 nodes to get
> >>>>>>
> >>>>> to 12K
> >>>>
> >>>>> qps and 800 nodes to get 80K qps, but that is just an extremely rough
> >>>>>> ballpark estimate, not some precise and firm number. And that's if
> >>>>>>
> >>>>> all
> >>>
> >>>> the
> >>>>
> >>>>> queries can be evenly distributed throughout the cluster and don't
> >>>>>>
> >>>>> require
> >>>>
> >>>>> fanout to other shards, which effectively turns each incoming query
> >>>>>>
> >>>>> into n
> >>>>
> >>>>> queries where n is the number of shards.
> >>>>>>
> >>>>>> -- Jack Krupansky
> >>>>>>
> >>>>>> On Mon, Feb 8, 2016 at 12:07 PM, Jack Krupansky <
> >>>>>>
> >>>>> jack.krupan...@gmail.com>
> >>>>
> >>>>> wrote:
> >>>>>>
> >>>>>> So is there any aging or TTL (in database terminology) of older
> >>>>>>>
> >>>>>> docs?
> >>>
> >>>> And do all of your queries need to query all of the older documents
> >>>>>>>
> >>>>>> all
> >>>>
> >>>>> of
> >>>>>>
> >>>>>>> the time or is there a clear hierarchy of querying for aged
> >>>>>>>
> >>>>>> documents,
> >>>>
> >>>>> like
> >>>>>>
> >>>>>>> past 24-hours vs. past week vs. past year vs. older than a year?
> >>>>>>>
> >>>>>> Sure,
> >>>>
> >>>>> you
> >>>>>>
> >>>>>>> can always use a function query to boost by the inverse of document
> >>>>>>>
> >>>>>> age,
> >>>>
> >>>>> but Solr would be more efficient with filter queries or separate
> >>>>>>>
> >>>>>> indexes
> >>>>
> >>>>> for different time scales.
> >>>>>>>
> >>>>>>> Are documents ever updated or are they write-once?
> >>>>>>>
> >>>>>>> Are documents explicitly deleted?
> >>>>>>>
> >>>>>>> Technically you probably could meet those specs, but... how many
> >>>>>>> organizations have the resources and the energy to do so?
> >>>>>>>
> >>>>>>> As a back of the envelope calculation, if Solr gave you 100 queries
> >>>>>>>
> >>>>>> per
> >>>>
> >>>>> second per node, that would mean you would need 1,200 nodes. It
> >>>>>>>
> >>>>>> would
> >>>
> >>>> also
> >>>>>>
> >>>>>>> depend on whether those queries are very narrow so that a single
> >>>>>>>
> >>>>>> node can
> >>>>
> >>>>> execute them or if they require fanout to other shards and then
> >>>>>>>
> >>>>>> aggregation
> >>>>>>
> >>>>>>> of results from those other shards.
> >>>>>>>
> >>>>>>> -- Jack Krupansky
> >>>>>>>
> >>>>>>> On Mon, Feb 8, 2016 at 11:24 AM, Erick Erickson <
> >>>>>>>
> >>>>>> erickerick...@gmail.com
> >>>>
> >>>>> wrote:
> >>>>>>>
> >>>>>>> Short form: You really have to prototype. Here's the long form:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> >>>
> >>>> I've seen between 20M and 200M docs fit on a single piece of
> >>>>>>>>
> >>>>>>> hardware,
> >>>>
> >>>>> so you'll absolutely have to shard.
> >>>>>>>>
> >>>>>>>> And the other thing you haven't told us is whether you plan on
> >>>>>>>> _adding_ 2B docs a day or whether that number is the total corpus
> >>>>>>>>
> >>>>>>> size
> >>>>
> >>>>> and you are re-indexing the 2B docs/day. IOW, if you are adding 2B
> >>>>>>>> docs/day, 30 days later do you have 2B docs or 60B docs in your
> >>>>>>>> corpus?
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Erick
> >>>>>>>>
> >>>>>>>> On Mon, Feb 8, 2016 at 8:09 AM, Susheel Kumar <
> >>>>>>>>
> >>>>>>> susheel2...@gmail.com>
> >>>>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Also if you are expecting indexing of 2 billion docs as NRT or
> >>>>>>>>>
> >>>>>>>> if
> >>>
> >>>> it
> >>>>
> >>>>> will
> >>>>>>>>
> >>>>>>>>> be offline (during off hours etc).  For more accurate sizing you
> >>>>>>>>>
> >>>>>>>> may
> >>>>
> >>>>> also
> >>>>>>>>
> >>>>>>>>> want to index say 10 million documents which may give you idea
> >>>>>>>>>
> >>>>>>>> how
> >>>
> >>>> much
> >>>>>>
> >>>>>>> is
> >>>>>>>>
> >>>>>>>>> your index size and then use that for extrapolation to come up
> >>>>>>>>>
> >>>>>>>> with
> >>>>
> >>>>> memory
> >>>>>>>>
> >>>>>>>>> requirements.
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Susheel
> >>>>>>>>>
> >>>>>>>>> On Mon, Feb 8, 2016 at 11:00 AM, Emir Arnautovic <
> >>>>>>>>> emir.arnauto...@sematext.com> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi Mark,
> >>>>>>>>>> Can you give us bit more details: size of docs, query types,
> >>>>>>>>>>
> >>>>>>>>> are
> >>>
> >>>> docs
> >>>>
> >>>>> grouped somehow, are they time sensitive, will they update or
> >>>>>>>>>>
> >>>>>>>>> it
> >>>
> >>>> is
> >>>>
> >>>>> rebuild
> >>>>>>>>
> >>>>>>>>> every time, etc.
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Emir
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 08.02.2016 16:56, Mark Robinson wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>> We have a requirement where we would need to index around 2
> >>>>>>>>>>>
> >>>>>>>>>> Billion
> >>>>
> >>>>> docs
> >>>>>>>>
> >>>>>>>>> in
> >>>>>>>>>>> a day.
> >>>>>>>>>>> The queries against this indexed data set can be around 80K
> >>>>>>>>>>>
> >>>>>>>>>> queries
> >>>>
> >>>>> per
> >>>>>>>>
> >>>>>>>>> second during peak time and during non peak hours around 12K
> >>>>>>>>>>>
> >>>>>>>>>> queries
> >>>>
> >>>>> per
> >>>>>>>>
> >>>>>>>>> second.
> >>>>>>>>>>>
> >>>>>>>>>>> Can Solr realize this huge volumes.
> >>>>>>>>>>>
> >>>>>>>>>>> If so, assuming we have no constraints for budget what would
> >>>>>>>>>>>
> >>>>>>>>>> be
> >>>
> >>>> a
> >>>>
> >>>>> recommended Solr set up (number of shards, number of Solr
> >>>>>>>>>>>
> >>>>>>>>>> instances
> >>>>
> >>>>> etc...)
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks!
> >>>>>>>>>>> Mark
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>> Monitoring * Alerting * Anomaly Detection * Centralized Log
> >>>>>>>>>>
> >>>>>>>>> Management
> >>>>>>
> >>>>>>> Solr & Elasticsearch Support * http://sematext.com/
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>

Re: Solr architecture

Reply via email to