Re: Very very large scale Solr Deployment = how to do (Expert Question)?
Perfect. Thank you very much. Andy --- On Fri, 4/8/11, Pascal Coupet wrote: > From: Pascal Coupet > Subject: Re: Very very large scale Solr Deployment = how to do (Expert > Question)? > To: solr-user@lucene.apache.org > Date: Friday, April 8, 2011, 10:20 AM > I dit put a pdf version here: > https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srcid=0B02DHBZQYYT_MmRkZTY0YjQtODJmZS00Mzg0LWJiNTEtOWJjNzViNmNjZjdh&hl=en&authkey=CL2Fq_QG > > Zoom it to get a better view. > > Pascal > > 2011/4/8 Andy > > > Could anyone please post a version of the document in > pdf or openoffice > > format? I'm on Linux so there's no way for me to use > MS Word. > > > > Thanks. > > > > > > --- On Fri, 4/8/11, Albert Vila > wrote: > > > > > From: Albert Vila > > > Subject: Re: Very very large scale Solr > Deployment = how to do (Expert > > Question)? > > > To: solr-user@lucene.apache.org > > > Date: Friday, April 8, 2011, 9:25 AM > > > Yes, It won't work if you are using > > > OpenOffice. However it works fine > > > with Microsoft Word. > > > > > > Hope it helps. > > > > > > Albert > > > > > > On 8 April 2011 14:55, Andy > > > wrote: > > > > I can't view the document either -- it > showed up > > > empty. > > > > > > > > Has anyone succeeded in viewing it? > > > > > > > > Andy > > > > > > > > --- On Fri, 4/8/11, Albert Vila > > > wrote: > > > > > > > >> From: Albert Vila > > > >> Subject: Re: Very very large scale Solr > Deployment > > > = how to do (Expert Question)? > > > >> To: solr-user@lucene.apache.org > > > >> Date: Friday, April 8, 2011, 3:43 AM > > > >> Ephraim, I still can't view the > > > >> document. > > > >> > > > >> Don't know if I'm doing something wrong, > but I > > > downloaded > > > >> it and It > > > >> appears to be empty. > > > >> > > > >> Albert > > > >> > > > >> On 7 April 2011 09:32, Ephraim Ofir > > > > >> wrote: > > > >> > You can't view it online, but you > should be > > > able to > > > >> download it from: > > > >> > > > https://docs.google.com/leaf?id=0BwOEbnJ7oeOrNmU5ZThjODUtYzM5MS00YjRlLWI > > > >> > > > > > 2OTktZTEzNDk1YmVmOWU4&hl=en&authkey=COGel4gP > > > >> > > > > >> > Enjoy, > > > >> > Ephraim Ofir > > > >> > > > > >> > > > > >> > -Original Message- > > > >> > From: Jens Mueller [mailto:supidupi...@googlemail.com] > > > >> > Sent: Thursday, April 07, 2011 8:30 > AM > > > >> > To: solr-user@lucene.apache.org > > > >> > Subject: Re: Very very large scale > Solr > > > Deployment = > > > >> how to do (Expert > > > >> > Question)? > > > >> > > > > >> > Hello Ephraim, hello Lance, hello > Walter, > > > >> > > > > >> > thanks for your replies: > > > >> > > > > >> > Ephraim, thanks very much for the > further > > > detailed > > > >> explanation. I will > > > >> > try > > > >> > to setup a demo system in the next > few days > > > and use > > > >> your advice. > > > >> > LoadBalancers are an important > aspect of your > > > design. > > > >> Can you recommend > > > >> > one > > > >> > LB specificallly? (I would be > using > > > haproxy.1wt.eu) . > > > >> I think the Idea > > > >> > with > > > >> > uploading your document is very > good. > > > However > > > >> Google-Docs seemed not be > > > >> > be > > > >> > working (at least for me with the > docx > > > format?), but > > > >> maybe you can > > > >> > simply > > > >> > output the document as PDF and then > I think > > > Google > > > >> Docs is working, so > > > >> > all > > > >> > the others can also have a look at >
Re: Very very large scale Solr Deployment = how to do (Expert Question)?
I dit put a pdf version here: https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srcid=0B02DHBZQYYT_MmRkZTY0YjQtODJmZS00Mzg0LWJiNTEtOWJjNzViNmNjZjdh&hl=en&authkey=CL2Fq_QG Zoom it to get a better view. Pascal 2011/4/8 Andy > Could anyone please post a version of the document in pdf or openoffice > format? I'm on Linux so there's no way for me to use MS Word. > > Thanks. > > > --- On Fri, 4/8/11, Albert Vila wrote: > > > From: Albert Vila > > Subject: Re: Very very large scale Solr Deployment = how to do (Expert > Question)? > > To: solr-user@lucene.apache.org > > Date: Friday, April 8, 2011, 9:25 AM > > Yes, It won't work if you are using > > OpenOffice. However it works fine > > with Microsoft Word. > > > > Hope it helps. > > > > Albert > > > > On 8 April 2011 14:55, Andy > > wrote: > > > I can't view the document either -- it showed up > > empty. > > > > > > Has anyone succeeded in viewing it? > > > > > > Andy > > > > > > --- On Fri, 4/8/11, Albert Vila > > wrote: > > > > > >> From: Albert Vila > > >> Subject: Re: Very very large scale Solr Deployment > > = how to do (Expert Question)? > > >> To: solr-user@lucene.apache.org > > >> Date: Friday, April 8, 2011, 3:43 AM > > >> Ephraim, I still can't view the > > >> document. > > >> > > >> Don't know if I'm doing something wrong, but I > > downloaded > > >> it and It > > >> appears to be empty. > > >> > > >> Albert > > >> > > >> On 7 April 2011 09:32, Ephraim Ofir > > >> wrote: > > >> > You can't view it online, but you should be > > able to > > >> download it from: > > >> > > https://docs.google.com/leaf?id=0BwOEbnJ7oeOrNmU5ZThjODUtYzM5MS00YjRlLWI > > >> > > > 2OTktZTEzNDk1YmVmOWU4&hl=en&authkey=COGel4gP > > >> > > > >> > Enjoy, > > >> > Ephraim Ofir > > >> > > > >> > > > >> > -Original Message- > > >> > From: Jens Mueller [mailto:supidupi...@googlemail.com] > > >> > Sent: Thursday, April 07, 2011 8:30 AM > > >> > To: solr-user@lucene.apache.org > > >> > Subject: Re: Very very large scale Solr > > Deployment = > > >> how to do (Expert > > >> > Question)? > > >> > > > >> > Hello Ephraim, hello Lance, hello Walter, > > >> > > > >> > thanks for your replies: > > >> > > > >> > Ephraim, thanks very much for the further > > detailed > > >> explanation. I will > > >> > try > > >> > to setup a demo system in the next few days > > and use > > >> your advice. > > >> > LoadBalancers are an important aspect of your > > design. > > >> Can you recommend > > >> > one > > >> > LB specificallly? (I would be using > > haproxy.1wt.eu) . > > >> I think the Idea > > >> > with > > >> > uploading your document is very good. > > However > > >> Google-Docs seemed not be > > >> > be > > >> > working (at least for me with the docx > > format?), but > > >> maybe you can > > >> > simply > > >> > output the document as PDF and then I think > > Google > > >> Docs is working, so > > >> > all > > >> > the others can also have a look at your > > concept. The > > >> best approach would > > >> > be > > >> > if you could upload your advice directly > > somewhere to > > >> the solr wiki as > > >> > it is > > >> > really helpful.I found some other documents > > meanwhile, > > >> but yours is much > > >> > clearer and more complete, with the LBs and > > the > > >> Aggregators ( > > >> > http://lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf) > > >> > > > >> > Lance, thanks I will have a look at what > > linkedin is > > >> doing. > > >> > > > >> > Walter, thanks for the advice: Well you are > > right, > > >> mentioning google. My > > >> > question was also to understand how su
Re: Very very large scale Solr Deployment = how to do (Expert Question)?
Could anyone please post a version of the document in pdf or openoffice format? I'm on Linux so there's no way for me to use MS Word. Thanks. --- On Fri, 4/8/11, Albert Vila wrote: > From: Albert Vila > Subject: Re: Very very large scale Solr Deployment = how to do (Expert > Question)? > To: solr-user@lucene.apache.org > Date: Friday, April 8, 2011, 9:25 AM > Yes, It won't work if you are using > OpenOffice. However it works fine > with Microsoft Word. > > Hope it helps. > > Albert > > On 8 April 2011 14:55, Andy > wrote: > > I can't view the document either -- it showed up > empty. > > > > Has anyone succeeded in viewing it? > > > > Andy > > > > --- On Fri, 4/8/11, Albert Vila > wrote: > > > >> From: Albert Vila > >> Subject: Re: Very very large scale Solr Deployment > = how to do (Expert Question)? > >> To: solr-user@lucene.apache.org > >> Date: Friday, April 8, 2011, 3:43 AM > >> Ephraim, I still can't view the > >> document. > >> > >> Don't know if I'm doing something wrong, but I > downloaded > >> it and It > >> appears to be empty. > >> > >> Albert > >> > >> On 7 April 2011 09:32, Ephraim Ofir > >> wrote: > >> > You can't view it online, but you should be > able to > >> download it from: > >> > https://docs.google.com/leaf?id=0BwOEbnJ7oeOrNmU5ZThjODUtYzM5MS00YjRlLWI > >> > > 2OTktZTEzNDk1YmVmOWU4&hl=en&authkey=COGel4gP > >> > > >> > Enjoy, > >> > Ephraim Ofir > >> > > >> > > >> > -Original Message- > >> > From: Jens Mueller [mailto:supidupi...@googlemail.com] > >> > Sent: Thursday, April 07, 2011 8:30 AM > >> > To: solr-user@lucene.apache.org > >> > Subject: Re: Very very large scale Solr > Deployment = > >> how to do (Expert > >> > Question)? > >> > > >> > Hello Ephraim, hello Lance, hello Walter, > >> > > >> > thanks for your replies: > >> > > >> > Ephraim, thanks very much for the further > detailed > >> explanation. I will > >> > try > >> > to setup a demo system in the next few days > and use > >> your advice. > >> > LoadBalancers are an important aspect of your > design. > >> Can you recommend > >> > one > >> > LB specificallly? (I would be using > haproxy.1wt.eu) . > >> I think the Idea > >> > with > >> > uploading your document is very good. > However > >> Google-Docs seemed not be > >> > be > >> > working (at least for me with the docx > format?), but > >> maybe you can > >> > simply > >> > output the document as PDF and then I think > Google > >> Docs is working, so > >> > all > >> > the others can also have a look at your > concept. The > >> best approach would > >> > be > >> > if you could upload your advice directly > somewhere to > >> the solr wiki as > >> > it is > >> > really helpful.I found some other documents > meanwhile, > >> but yours is much > >> > clearer and more complete, with the LBs and > the > >> Aggregators ( > >> > http://lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf) > >> > > >> > Lance, thanks I will have a look at what > linkedin is > >> doing. > >> > > >> > Walter, thanks for the advice: Well you are > right, > >> mentioning google. My > >> > question was also to understand how such > large systems > >> like > >> > google/facebook > >> > are actually working. So my numbers are just > >> theoretical and made up. My > >> > system will be smaller, but I would be very > happy to > >> understand how > >> > such > >> > large systems are build and I think the > approach > >> Ephraim showd should be > >> > working quite well at large scale. If you > know a good > >> documents (besides > >> > the > >> > bigtable research paper that I already know) > that > >> technically describes > >> > how > >> > google is working in detail that would be of > great > >> interest. You seem to > >> > be > >> > working for a company that handles large &
Re: Very very large scale Solr Deployment = how to do (Expert Question)?
Yes, It won't work if you are using OpenOffice. However it works fine with Microsoft Word. Hope it helps. Albert On 8 April 2011 14:55, Andy wrote: > I can't view the document either -- it showed up empty. > > Has anyone succeeded in viewing it? > > Andy > > --- On Fri, 4/8/11, Albert Vila wrote: > >> From: Albert Vila >> Subject: Re: Very very large scale Solr Deployment = how to do (Expert >> Question)? >> To: solr-user@lucene.apache.org >> Date: Friday, April 8, 2011, 3:43 AM >> Ephraim, I still can't view the >> document. >> >> Don't know if I'm doing something wrong, but I downloaded >> it and It >> appears to be empty. >> >> Albert >> >> On 7 April 2011 09:32, Ephraim Ofir >> wrote: >> > You can't view it online, but you should be able to >> download it from: >> > https://docs.google.com/leaf?id=0BwOEbnJ7oeOrNmU5ZThjODUtYzM5MS00YjRlLWI >> > 2OTktZTEzNDk1YmVmOWU4&hl=en&authkey=COGel4gP >> > >> > Enjoy, >> > Ephraim Ofir >> > >> > >> > -Original Message- >> > From: Jens Mueller [mailto:supidupi...@googlemail.com] >> > Sent: Thursday, April 07, 2011 8:30 AM >> > To: solr-user@lucene.apache.org >> > Subject: Re: Very very large scale Solr Deployment = >> how to do (Expert >> > Question)? >> > >> > Hello Ephraim, hello Lance, hello Walter, >> > >> > thanks for your replies: >> > >> > Ephraim, thanks very much for the further detailed >> explanation. I will >> > try >> > to setup a demo system in the next few days and use >> your advice. >> > LoadBalancers are an important aspect of your design. >> Can you recommend >> > one >> > LB specificallly? (I would be using haproxy.1wt.eu) . >> I think the Idea >> > with >> > uploading your document is very good. However >> Google-Docs seemed not be >> > be >> > working (at least for me with the docx format?), but >> maybe you can >> > simply >> > output the document as PDF and then I think Google >> Docs is working, so >> > all >> > the others can also have a look at your concept. The >> best approach would >> > be >> > if you could upload your advice directly somewhere to >> the solr wiki as >> > it is >> > really helpful.I found some other documents meanwhile, >> but yours is much >> > clearer and more complete, with the LBs and the >> Aggregators ( >> > http://lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf) >> > >> > Lance, thanks I will have a look at what linkedin is >> doing. >> > >> > Walter, thanks for the advice: Well you are right, >> mentioning google. My >> > question was also to understand how such large systems >> like >> > google/facebook >> > are actually working. So my numbers are just >> theoretical and made up. My >> > system will be smaller, but I would be very happy to >> understand how >> > such >> > large systems are build and I think the approach >> Ephraim showd should be >> > working quite well at large scale. If you know a good >> documents (besides >> > the >> > bigtable research paper that I already know) that >> technically describes >> > how >> > google is working in detail that would be of great >> interest. You seem to >> > be >> > working for a company that handles large datasets. >> Does google use this >> > approach, sharing the index into N writers, and the >> procuded index is >> > then >> > replicated to N "read only searchers"? >> > >> > thank you all. >> > best regards >> > jens >> > >> > >> > >> > 2011/4/7 Walter Underwood >> > >> >> The bigger answer is that you cannot get to this >> size by just >> > configuring >> >> Solr. You may have to invent a lot of stuff. Like >> all of Google. >> >> >> >> Where did you get these numbers? The proposed >> query rate is twice as >> > big as >> >> Google (Feb 2010 estimate, 34K qps). >> >> >> >> I work at MarkLogic, and we scale to 100's of >> terabytes, with fast >> > update >> >> and query rates. If you want a real system that >> handles that, you >> > might want >> >>
Re: Very very large scale Solr Deployment = how to do (Expert Question)?
I can't view the document either -- it showed up empty. Has anyone succeeded in viewing it? Andy --- On Fri, 4/8/11, Albert Vila wrote: > From: Albert Vila > Subject: Re: Very very large scale Solr Deployment = how to do (Expert > Question)? > To: solr-user@lucene.apache.org > Date: Friday, April 8, 2011, 3:43 AM > Ephraim, I still can't view the > document. > > Don't know if I'm doing something wrong, but I downloaded > it and It > appears to be empty. > > Albert > > On 7 April 2011 09:32, Ephraim Ofir > wrote: > > You can't view it online, but you should be able to > download it from: > > https://docs.google.com/leaf?id=0BwOEbnJ7oeOrNmU5ZThjODUtYzM5MS00YjRlLWI > > 2OTktZTEzNDk1YmVmOWU4&hl=en&authkey=COGel4gP > > > > Enjoy, > > Ephraim Ofir > > > > > > -Original Message- > > From: Jens Mueller [mailto:supidupi...@googlemail.com] > > Sent: Thursday, April 07, 2011 8:30 AM > > To: solr-user@lucene.apache.org > > Subject: Re: Very very large scale Solr Deployment = > how to do (Expert > > Question)? > > > > Hello Ephraim, hello Lance, hello Walter, > > > > thanks for your replies: > > > > Ephraim, thanks very much for the further detailed > explanation. I will > > try > > to setup a demo system in the next few days and use > your advice. > > LoadBalancers are an important aspect of your design. > Can you recommend > > one > > LB specificallly? (I would be using haproxy.1wt.eu) . > I think the Idea > > with > > uploading your document is very good. However > Google-Docs seemed not be > > be > > working (at least for me with the docx format?), but > maybe you can > > simply > > output the document as PDF and then I think Google > Docs is working, so > > all > > the others can also have a look at your concept. The > best approach would > > be > > if you could upload your advice directly somewhere to > the solr wiki as > > it is > > really helpful.I found some other documents meanwhile, > but yours is much > > clearer and more complete, with the LBs and the > Aggregators ( > > http://lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf) > > > > Lance, thanks I will have a look at what linkedin is > doing. > > > > Walter, thanks for the advice: Well you are right, > mentioning google. My > > question was also to understand how such large systems > like > > google/facebook > > are actually working. So my numbers are just > theoretical and made up. My > > system will be smaller, but I would be very happy to > understand how > > such > > large systems are build and I think the approach > Ephraim showd should be > > working quite well at large scale. If you know a good > documents (besides > > the > > bigtable research paper that I already know) that > technically describes > > how > > google is working in detail that would be of great > interest. You seem to > > be > > working for a company that handles large datasets. > Does google use this > > approach, sharing the index into N writers, and the > procuded index is > > then > > replicated to N "read only searchers"? > > > > thank you all. > > best regards > > jens > > > > > > > > 2011/4/7 Walter Underwood > > > >> The bigger answer is that you cannot get to this > size by just > > configuring > >> Solr. You may have to invent a lot of stuff. Like > all of Google. > >> > >> Where did you get these numbers? The proposed > query rate is twice as > > big as > >> Google (Feb 2010 estimate, 34K qps). > >> > >> I work at MarkLogic, and we scale to 100's of > terabytes, with fast > > update > >> and query rates. If you want a real system that > handles that, you > > might want > >> to look at our product. > >> > >> wunder > >> > >> On Apr 6, 2011, at 8:06 PM, Lance Norskog wrote: > >> > >> > I would not use replication. LinkedIn > consumer search is a flat > > system > >> > where one process indexes new entries and > does queries > > simultaneously. > >> > It's a custom Lucene app called Zoie. Their > stuff is on Github.. > >> > > >> > I would get documents to indexers via a > multicast IP-based queueing > >> > system. This scales very well and there's a > lot of hardware support. > >> > > &
Re: Very very large scale Solr Deployment = how to do (Expert Question)?
You might also want to look at the heritrix crawler too: http://crawler.archive.org/ I have written three crawlers in the past, all for RSS feeds, it is not easy. Happy to provide tips and help if you want to go down that route. François On Apr 8, 2011, at 1:53 AM, Andrea Campi wrote: > On Fri, Apr 8, 2011 at 6:23 AM, Jens Mueller > wrote: > >> Hello all, >> >> thanks for your generous help. >> >> I think I now know everything: (What I want to do is to build a web >> crawler >> and index the documents found). I will start with the setup as suggested by >> >> > Write a web crawler from scratch is... ambitious. > Have you looked at Nutch (http://nutch.apache.org/)? It uses Solr for > indexing, it may help you get a head start. > If you've never used Hadoop before it may take some getting used to, but I > have helped a customer implement it and helped a couple of their devs > (medium-seniority) get up to speed, and it didn't take them too long to get > used to it. > > Andrea
Re: Very very large scale Solr Deployment = how to do (Expert Question)?
Ephraim, I still can't view the document. Don't know if I'm doing something wrong, but I downloaded it and It appears to be empty. Albert On 7 April 2011 09:32, Ephraim Ofir wrote: > You can't view it online, but you should be able to download it from: > https://docs.google.com/leaf?id=0BwOEbnJ7oeOrNmU5ZThjODUtYzM5MS00YjRlLWI > 2OTktZTEzNDk1YmVmOWU4&hl=en&authkey=COGel4gP > > Enjoy, > Ephraim Ofir > > > -Original Message- > From: Jens Mueller [mailto:supidupi...@googlemail.com] > Sent: Thursday, April 07, 2011 8:30 AM > To: solr-user@lucene.apache.org > Subject: Re: Very very large scale Solr Deployment = how to do (Expert > Question)? > > Hello Ephraim, hello Lance, hello Walter, > > thanks for your replies: > > Ephraim, thanks very much for the further detailed explanation. I will > try > to setup a demo system in the next few days and use your advice. > LoadBalancers are an important aspect of your design. Can you recommend > one > LB specificallly? (I would be using haproxy.1wt.eu) . I think the Idea > with > uploading your document is very good. However Google-Docs seemed not be > be > working (at least for me with the docx format?), but maybe you can > simply > output the document as PDF and then I think Google Docs is working, so > all > the others can also have a look at your concept. The best approach would > be > if you could upload your advice directly somewhere to the solr wiki as > it is > really helpful.I found some other documents meanwhile, but yours is much > clearer and more complete, with the LBs and the Aggregators ( > http://lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf) > > Lance, thanks I will have a look at what linkedin is doing. > > Walter, thanks for the advice: Well you are right, mentioning google. My > question was also to understand how such large systems like > google/facebook > are actually working. So my numbers are just theoretical and made up. My > system will be smaller, but I would be very happy to understand how > such > large systems are build and I think the approach Ephraim showd should be > working quite well at large scale. If you know a good documents (besides > the > bigtable research paper that I already know) that technically describes > how > google is working in detail that would be of great interest. You seem to > be > working for a company that handles large datasets. Does google use this > approach, sharing the index into N writers, and the procuded index is > then > replicated to N "read only searchers"? > > thank you all. > best regards > jens > > > > 2011/4/7 Walter Underwood > >> The bigger answer is that you cannot get to this size by just > configuring >> Solr. You may have to invent a lot of stuff. Like all of Google. >> >> Where did you get these numbers? The proposed query rate is twice as > big as >> Google (Feb 2010 estimate, 34K qps). >> >> I work at MarkLogic, and we scale to 100's of terabytes, with fast > update >> and query rates. If you want a real system that handles that, you > might want >> to look at our product. >> >> wunder >> >> On Apr 6, 2011, at 8:06 PM, Lance Norskog wrote: >> >> > I would not use replication. LinkedIn consumer search is a flat > system >> > where one process indexes new entries and does queries > simultaneously. >> > It's a custom Lucene app called Zoie. Their stuff is on Github.. >> > >> > I would get documents to indexers via a multicast IP-based queueing >> > system. This scales very well and there's a lot of hardware support. >> > >> > The problem with distributed search is that it is a) inherently > slower >> > and b) has inherently more and longer jitter. The "airplane wing" >> > distribution of query times becomes longer and flatter. >> > >> > This is going to have to be a "federated" system, where the > front-end >> > app aggregates results rather than Solr. >> > >> > On Mon, Apr 4, 2011 at 6:25 PM, Jens Mueller > >> wrote: >> >> Hello Experts, >> >> >> >> >> >> >> >> I am a Solr newbie but read quite a lot of docs. I still do not >> understand >> >> what would be the best way to setup very large scale deployments: >> >> >> >> >> >> >> >> Goal (threoretical): >> >> >> >> A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size) >> >> >> >> B) Queries: 10 Queries/ per Se
Re: Very very large scale Solr Deployment = how to do (Expert Question)?
On Fri, Apr 8, 2011 at 6:23 AM, Jens Mueller wrote: > Hello all, > > thanks for your generous help. > > I think I now know everything: (What I want to do is to build a web > crawler > and index the documents found). I will start with the setup as suggested by > > Write a web crawler from scratch is... ambitious. Have you looked at Nutch (http://nutch.apache.org/)? It uses Solr for indexing, it may help you get a head start. If you've never used Hadoop before it may take some getting used to, but I have helped a customer implement it and helped a couple of their devs (medium-seniority) get up to speed, and it didn't take them too long to get used to it. Andrea
Re: Very very large scale Solr Deployment = how to do (Expert Question)?
Hello all, thanks for your generous help. I think I now know everything: (What I want to do is to build a web crawler and index the documents found). I will start with the setup as suggested by Ephraim (Several sharded masters, each with at least one slave for reads and some aggregators for querying). This is only a prototype to learn more... And the Google PDF from Walter is very interesting, that is something that I can then try if I hit the limits with the setup above. But before that, I have to learn much more about all this indexing / index building and solr/lucene stuff. Thanks again for your help!! best regards jens 2011/4/7 Walter Underwood > On Apr 6, 2011, at 10:29 PM, Jens Mueller wrote: > > > Walter, thanks for the advice: Well you are right, mentioning google. My > > question was also to understand how such large systems like > google/facebook > > are actually working. So my numbers are just theoretical and made up. My > > system will be smaller, but I would be very happy to understand how such > > large systems are build and I think the approach Ephraim showd should be > > working quite well at large scale. > > Understanding what Google does will NOT help you build your engine. Just > like understanding a F1 race car does not help you build a Toyota Camry. One > is built for performance only, and requires LOTS of support, the other for > supportability and stability. Very different engineering goals and designs. > > Here is one view of Google's search setup: > http://www.linesave.co.uk/google_search_engine.html > > This talk gives a lot more detail. Summary in the blog post, slides in the > PDF. Google's search is entirely in-memory. They load off disk and run. > > http://glinden.blogspot.com/2009/02/jeff-dean-keynote-at-wsdm-2009.html > http://research.google.com/people/jeff/WSDM09-keynote.pdf > > How big will your system be? Does it require real-time updates? > > wunder > -- > Walter Underwood > Lead Engineer, MarkLogic > >
Re: Very very large scale Solr Deployment = how to do (Expert Question)?
On Apr 6, 2011, at 10:29 PM, Jens Mueller wrote: > Walter, thanks for the advice: Well you are right, mentioning google. My > question was also to understand how such large systems like google/facebook > are actually working. So my numbers are just theoretical and made up. My > system will be smaller, but I would be very happy to understand how such > large systems are build and I think the approach Ephraim showd should be > working quite well at large scale. Understanding what Google does will NOT help you build your engine. Just like understanding a F1 race car does not help you build a Toyota Camry. One is built for performance only, and requires LOTS of support, the other for supportability and stability. Very different engineering goals and designs. Here is one view of Google's search setup: http://www.linesave.co.uk/google_search_engine.html This talk gives a lot more detail. Summary in the blog post, slides in the PDF. Google's search is entirely in-memory. They load off disk and run. http://glinden.blogspot.com/2009/02/jeff-dean-keynote-at-wsdm-2009.html http://research.google.com/people/jeff/WSDM09-keynote.pdf How big will your system be? Does it require real-time updates? wunder -- Walter Underwood Lead Engineer, MarkLogic
RE: Very very large scale Solr Deployment = how to do (Expert Question)?
You can't view it online, but you should be able to download it from: https://docs.google.com/leaf?id=0BwOEbnJ7oeOrNmU5ZThjODUtYzM5MS00YjRlLWI 2OTktZTEzNDk1YmVmOWU4&hl=en&authkey=COGel4gP Enjoy, Ephraim Ofir -Original Message- From: Jens Mueller [mailto:supidupi...@googlemail.com] Sent: Thursday, April 07, 2011 8:30 AM To: solr-user@lucene.apache.org Subject: Re: Very very large scale Solr Deployment = how to do (Expert Question)? Hello Ephraim, hello Lance, hello Walter, thanks for your replies: Ephraim, thanks very much for the further detailed explanation. I will try to setup a demo system in the next few days and use your advice. LoadBalancers are an important aspect of your design. Can you recommend one LB specificallly? (I would be using haproxy.1wt.eu) . I think the Idea with uploading your document is very good. However Google-Docs seemed not be be working (at least for me with the docx format?), but maybe you can simply output the document as PDF and then I think Google Docs is working, so all the others can also have a look at your concept. The best approach would be if you could upload your advice directly somewhere to the solr wiki as it is really helpful.I found some other documents meanwhile, but yours is much clearer and more complete, with the LBs and the Aggregators ( http://lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf) Lance, thanks I will have a look at what linkedin is doing. Walter, thanks for the advice: Well you are right, mentioning google. My question was also to understand how such large systems like google/facebook are actually working. So my numbers are just theoretical and made up. My system will be smaller, but I would be very happy to understand how such large systems are build and I think the approach Ephraim showd should be working quite well at large scale. If you know a good documents (besides the bigtable research paper that I already know) that technically describes how google is working in detail that would be of great interest. You seem to be working for a company that handles large datasets. Does google use this approach, sharing the index into N writers, and the procuded index is then replicated to N "read only searchers"? thank you all. best regards jens 2011/4/7 Walter Underwood > The bigger answer is that you cannot get to this size by just configuring > Solr. You may have to invent a lot of stuff. Like all of Google. > > Where did you get these numbers? The proposed query rate is twice as big as > Google (Feb 2010 estimate, 34K qps). > > I work at MarkLogic, and we scale to 100's of terabytes, with fast update > and query rates. If you want a real system that handles that, you might want > to look at our product. > > wunder > > On Apr 6, 2011, at 8:06 PM, Lance Norskog wrote: > > > I would not use replication. LinkedIn consumer search is a flat system > > where one process indexes new entries and does queries simultaneously. > > It's a custom Lucene app called Zoie. Their stuff is on Github.. > > > > I would get documents to indexers via a multicast IP-based queueing > > system. This scales very well and there's a lot of hardware support. > > > > The problem with distributed search is that it is a) inherently slower > > and b) has inherently more and longer jitter. The "airplane wing" > > distribution of query times becomes longer and flatter. > > > > This is going to have to be a "federated" system, where the front-end > > app aggregates results rather than Solr. > > > > On Mon, Apr 4, 2011 at 6:25 PM, Jens Mueller > wrote: > >> Hello Experts, > >> > >> > >> > >> I am a Solr newbie but read quite a lot of docs. I still do not > understand > >> what would be the best way to setup very large scale deployments: > >> > >> > >> > >> Goal (threoretical): > >> > >> A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size) > >> > >> B) Queries: 10 Queries/ per Second > >> > >> C) Updates: 10 Updates / per Second > >> > >> > >> > >> > >> Solr offers: > >> > >> 1.)Replication => Scales Well for B) BUT A) and C) are not > satisfied > >> > >> > >> 2.)Sharding => Scales well for A) BUT B) and C) are not satisfied > (=> As > >> I understand the Sharding approach all goes through a central server, > that > >> dispatches the updates and assembles the quries retrieved from the > different > >> shards. But this central server has also some capacity limits...) > >> > >> > >> > >> > >> What is the right approach to handle such large deployments? I would be > >> thankfull for just a rough sketch of the concepts so I can > experiment/search > >> further... > >> > >> > >> Maybe I am missing something very trivial as I think some of the "Solr > >> Users/Use Cases" on the homepage are that kind of large deployments. How > are > >> they implemented? > >> > >> > >> > >> Thanky very much!!! > >> > >> Jens > >> > > > > > > >
Re: Very very large scale Solr Deployment = how to do (Expert Question)?
Just a quick comment re LinkedIn's stuff. You can look at Zoie (also covered in Lucene in Action 2), but you may be more interested in Sensei. And yes, big systems like that need sharding and replication, multiple master and lots of slaves. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: Jens Mueller > To: solr-user@lucene.apache.org > Sent: Thu, April 7, 2011 1:29:40 AM > Subject: Re: Very very large scale Solr Deployment = how to do (Expert >Question)? > > Hello Ephraim, hello Lance, hello Walter, > > thanks for your replies: > > Ephraim, thanks very much for the further detailed explanation. I will try > to setup a demo system in the next few days and use your advice. > LoadBalancers are an important aspect of your design. Can you recommend one > LB specificallly? (I would be using haproxy.1wt.eu) . I think the Idea with > uploading your document is very good. However Google-Docs seemed not be be > working (at least for me with the docx format?), but maybe you can simply > output the document as PDF and then I think Google Docs is working, so all > the others can also have a look at your concept. The best approach would be > if you could upload your advice directly somewhere to the solr wiki as it is > really helpful.I found some other documents meanwhile, but yours is much > clearer and more complete, with the LBs and the Aggregators ( > http://lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf) > > Lance, thanks I will have a look at what linkedin is doing. > > Walter, thanks for the advice: Well you are right, mentioning google. My > question was also to understand how such large systems like google/facebook > are actually working. So my numbers are just theoretical and made up. My > system will be smaller, but I would be very happy to understand how such > large systems are build and I think the approach Ephraim showd should be > working quite well at large scale. If you know a good documents (besides the > bigtable research paper that I already know) that technically describes how > google is working in detail that would be of great interest. You seem to be > working for a company that handles large datasets. Does google use this > approach, sharing the index into N writers, and the procuded index is then > replicated to N "read only searchers"? > > thank you all. > best regards > jens > > > > 2011/4/7 Walter Underwood > > > The bigger answer is that you cannot get to this size by just configuring > > Solr. You may have to invent a lot of stuff. Like all of Google. > > > > Where did you get these numbers? The proposed query rate is twice as big as > > Google (Feb 2010 estimate, 34K qps). > > > > I work at MarkLogic, and we scale to 100's of terabytes, with fast update > > and query rates. If you want a real system that handles that, you might want > > to look at our product. > > > > wunder > > > > On Apr 6, 2011, at 8:06 PM, Lance Norskog wrote: > > > > > I would not use replication. LinkedIn consumer search is a flat system > > > where one process indexes new entries and does queries simultaneously. > > > It's a custom Lucene app called Zoie. Their stuff is on Github.. > > > > > > I would get documents to indexers via a multicast IP-based queueing > > > system. This scales very well and there's a lot of hardware support. > > > > > > The problem with distributed search is that it is a) inherently slower > > > and b) has inherently more and longer jitter. The "airplane wing" > > > distribution of query times becomes longer and flatter. > > > > > > This is going to have to be a "federated" system, where the front-end > > > app aggregates results rather than Solr. > > > > > > On Mon, Apr 4, 2011 at 6:25 PM, Jens Mueller > > wrote: > > >> Hello Experts, > > >> > > >> > > >> > > >> I am a Solr newbie but read quite a lot of docs. I still do not > > understand > > >> what would be the best way to setup very large scale deployments: > > >> > > >> > > >> > > >> Goal (threoretical): > > >> > > >> A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size) > > >> > > >> B) Queries: 10 Queries/ per Second > > >> > > >> C) Updates: 10 Updates / per Second > > >> > > >> > >
Re: Very very large scale Solr Deployment = how to do (Expert Question)?
Hello Ephraim, hello Lance, hello Walter, thanks for your replies: Ephraim, thanks very much for the further detailed explanation. I will try to setup a demo system in the next few days and use your advice. LoadBalancers are an important aspect of your design. Can you recommend one LB specificallly? (I would be using haproxy.1wt.eu) . I think the Idea with uploading your document is very good. However Google-Docs seemed not be be working (at least for me with the docx format?), but maybe you can simply output the document as PDF and then I think Google Docs is working, so all the others can also have a look at your concept. The best approach would be if you could upload your advice directly somewhere to the solr wiki as it is really helpful.I found some other documents meanwhile, but yours is much clearer and more complete, with the LBs and the Aggregators ( http://lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf) Lance, thanks I will have a look at what linkedin is doing. Walter, thanks for the advice: Well you are right, mentioning google. My question was also to understand how such large systems like google/facebook are actually working. So my numbers are just theoretical and made up. My system will be smaller, but I would be very happy to understand how such large systems are build and I think the approach Ephraim showd should be working quite well at large scale. If you know a good documents (besides the bigtable research paper that I already know) that technically describes how google is working in detail that would be of great interest. You seem to be working for a company that handles large datasets. Does google use this approach, sharing the index into N writers, and the procuded index is then replicated to N "read only searchers"? thank you all. best regards jens 2011/4/7 Walter Underwood > The bigger answer is that you cannot get to this size by just configuring > Solr. You may have to invent a lot of stuff. Like all of Google. > > Where did you get these numbers? The proposed query rate is twice as big as > Google (Feb 2010 estimate, 34K qps). > > I work at MarkLogic, and we scale to 100's of terabytes, with fast update > and query rates. If you want a real system that handles that, you might want > to look at our product. > > wunder > > On Apr 6, 2011, at 8:06 PM, Lance Norskog wrote: > > > I would not use replication. LinkedIn consumer search is a flat system > > where one process indexes new entries and does queries simultaneously. > > It's a custom Lucene app called Zoie. Their stuff is on Github.. > > > > I would get documents to indexers via a multicast IP-based queueing > > system. This scales very well and there's a lot of hardware support. > > > > The problem with distributed search is that it is a) inherently slower > > and b) has inherently more and longer jitter. The "airplane wing" > > distribution of query times becomes longer and flatter. > > > > This is going to have to be a "federated" system, where the front-end > > app aggregates results rather than Solr. > > > > On Mon, Apr 4, 2011 at 6:25 PM, Jens Mueller > wrote: > >> Hello Experts, > >> > >> > >> > >> I am a Solr newbie but read quite a lot of docs. I still do not > understand > >> what would be the best way to setup very large scale deployments: > >> > >> > >> > >> Goal (threoretical): > >> > >> A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size) > >> > >> B) Queries: 10 Queries/ per Second > >> > >> C) Updates: 10 Updates / per Second > >> > >> > >> > >> > >> Solr offers: > >> > >> 1.)Replication => Scales Well for B) BUT A) and C) are not > satisfied > >> > >> > >> 2.)Sharding => Scales well for A) BUT B) and C) are not satisfied > (=> As > >> I understand the Sharding approach all goes through a central server, > that > >> dispatches the updates and assembles the quries retrieved from the > different > >> shards. But this central server has also some capacity limits...) > >> > >> > >> > >> > >> What is the right approach to handle such large deployments? I would be > >> thankfull for just a rough sketch of the concepts so I can > experiment/search > >> further… > >> > >> > >> Maybe I am missing something very trivial as I think some of the “Solr > >> Users/Use Cases” on the homepage are that kind of large deployments. How > are > >> they implemented? > >> > >> > >> > >> Thanky very much!!! > >> > >> Jens > >> > > > > > > >
Re: Very very large scale Solr Deployment = how to do (Expert Question)?
The bigger answer is that you cannot get to this size by just configuring Solr. You may have to invent a lot of stuff. Like all of Google. Where did you get these numbers? The proposed query rate is twice as big as Google (Feb 2010 estimate, 34K qps). I work at MarkLogic, and we scale to 100's of terabytes, with fast update and query rates. If you want a real system that handles that, you might want to look at our product. wunder On Apr 6, 2011, at 8:06 PM, Lance Norskog wrote: > I would not use replication. LinkedIn consumer search is a flat system > where one process indexes new entries and does queries simultaneously. > It's a custom Lucene app called Zoie. Their stuff is on Github.. > > I would get documents to indexers via a multicast IP-based queueing > system. This scales very well and there's a lot of hardware support. > > The problem with distributed search is that it is a) inherently slower > and b) has inherently more and longer jitter. The "airplane wing" > distribution of query times becomes longer and flatter. > > This is going to have to be a "federated" system, where the front-end > app aggregates results rather than Solr. > > On Mon, Apr 4, 2011 at 6:25 PM, Jens Mueller > wrote: >> Hello Experts, >> >> >> >> I am a Solr newbie but read quite a lot of docs. I still do not understand >> what would be the best way to setup very large scale deployments: >> >> >> >> Goal (threoretical): >> >> A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size) >> >> B) Queries: 10 Queries/ per Second >> >> C) Updates: 10 Updates / per Second >> >> >> >> >> Solr offers: >> >> 1.)Replication => Scales Well for B) BUT A) and C) are not satisfied >> >> >> 2.)Sharding => Scales well for A) BUT B) and C) are not satisfied (=> As >> I understand the Sharding approach all goes through a central server, that >> dispatches the updates and assembles the quries retrieved from the different >> shards. But this central server has also some capacity limits...) >> >> >> >> >> What is the right approach to handle such large deployments? I would be >> thankfull for just a rough sketch of the concepts so I can experiment/search >> further… >> >> >> Maybe I am missing something very trivial as I think some of the “Solr >> Users/Use Cases” on the homepage are that kind of large deployments. How are >> they implemented? >> >> >> >> Thanky very much!!! >> >> Jens >> >
Re: Very very large scale Solr Deployment = how to do (Expert Question)?
I would not use replication. LinkedIn consumer search is a flat system where one process indexes new entries and does queries simultaneously. It's a custom Lucene app called Zoie. Their stuff is on Github.. I would get documents to indexers via a multicast IP-based queueing system. This scales very well and there's a lot of hardware support. The problem with distributed search is that it is a) inherently slower and b) has inherently more and longer jitter. The "airplane wing" distribution of query times becomes longer and flatter. This is going to have to be a "federated" system, where the front-end app aggregates results rather than Solr. On Mon, Apr 4, 2011 at 6:25 PM, Jens Mueller wrote: > Hello Experts, > > > > I am a Solr newbie but read quite a lot of docs. I still do not understand > what would be the best way to setup very large scale deployments: > > > > Goal (threoretical): > > A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size) > > B) Queries: 10 Queries/ per Second > > C) Updates: 10 Updates / per Second > > > > > Solr offers: > > 1.) Replication => Scales Well for B) BUT A) and C) are not satisfied > > > 2.) Sharding => Scales well for A) BUT B) and C) are not satisfied (=> As > I understand the Sharding approach all goes through a central server, that > dispatches the updates and assembles the quries retrieved from the different > shards. But this central server has also some capacity limits...) > > > > > What is the right approach to handle such large deployments? I would be > thankfull for just a rough sketch of the concepts so I can experiment/search > further… > > > Maybe I am missing something very trivial as I think some of the “Solr > Users/Use Cases” on the homepage are that kind of large deployments. How are > they implemented? > > > > Thanky very much!!! > > Jens > -- Lance Norskog goks...@gmail.com
RE: FW: Very very large scale Solr Deployment = how to do (Expert Question)?
Hi all, I'd love to share the diagram, just not sure how to do that on the list (it's a word document I tried to send as attachment). Jens, to answer your questions: 1. Correct, in our setup the source of the data is a DB from which we pull the data using DIH (search the list for my previous post "DIH - deleting documents, high performance (delta) imports, and passing parameters" if you want info about that). We were lucky enough to have the data sharded at the DB level before we started using Solr, so using the same shards was an easy extension. Note that we're not (yet...) using SolrCloud, it was just something I thought you should consider. 2. I got the idea for the "aggregator" from the Solr book (PACKT). I don't remember if that term was used in the book or if I made it up (if Google doesn't know it, I probably mad it up...), but I think it conveys what this part of the puzzle does. As you said, this is simply a Solr instance which doesn't hold its own index, but shares the same schema as the slaves and masters. I actually defined the default query handler on this instance to include the shards parameter (see below), so the client doesn't have to know anything about the internal workings of the sharded setup, it just hits the aggregator load balancer with a regular query and everything is handled behind the scenes. This simplifies the client and allows me to change the architecture in the future (i.e. change the number of shards or their DNS name) without requiring a client change. Sharded query handler: explicit ${slaveUrls:null} All of our Solr instances share the same configs (solrconfig.xml, schema.xml, etc.) and different instances take different roles according to properties defined in solr.xml which is generated by a script specifically for each Solr instance (the script has a "map" of which instances should be on which host, and has to be run once on each host). In this case, this is how the generated solr.xml looks: -- just a name that appears in Solr management -- to make it easier to know which instance you're on -- this tells the instance is an aggregator, -- so it should use the sharded request handler by default -- masters and slaves have master/slave accordingly do define -- replication, a regular default search handler for slaves, -- and DIH on masters -- this is used by instances which are shards in order to determine which -- DB they should import from (masters) -- and which master they should replicate from (slaves) -- used by the sharded request handler -- used by load balancer to -- know if this instance is alive -- just one core for this instance -- indexers have 2 cores, one prod and one for full reindex Let me know if I can assist any further. Ephraim Ofir -Original Message- From: Jonathan DeMello [mailto:demello@googlemail.com] Sent: Wednesday, April 06, 2011 8:58 AM To: solr-user@lucene.apache.org Cc: Isan Fulia; Tirthankar Chatterjee Subject: Re: FW: Very very large scale Solr Deployment = how to do (Expert Question)? I third that request. Would greatly appreciate taking a look at that diagram! Regards, Jonathan On Wed, Apr 6, 2011 at 9:12 AM, Isan Fulia wrote: > Hi Ephraim/Jen, > > Can u share that diagram with all.It may really help all of us. > Thanks, > Isan Fulia. > > On 6 April 2011 10:15, Tirthankar Chatterjee >wrote: > > > Hi Jen, > > Can you please forward the diagram attachment too that Ephraim sent. :-) > > Thanks, > > Tirthankar > > > > -Original Message- > > From: Jens Mueller [mailto:supidupi...@googlemail.com] > > Sent: Tuesday, April 05, 2011 10:30 PM > > To: solr-user@lucene.apache.org > > Subject: Re: FW: Very very large scale Solr Deployment = how to do > (Expert > > Question)? > > > > Hello Ephraim, > > > > thank you so much for the great Document/Scaling-Concept!! > > > > First I think you really should publish this on the solr wiki. This > > approach is nowhere documented there and not really obvious for newbies > and > > your document is great and explains this very well! > > > > Please allow me to further questions regarding your document: > > 1.) Is it correct, that you mean by "DB" the Origin-Data-Source of the > data > > that is fed into the Solr "Cloud"
Re: FW: Very very large scale Solr Deployment = how to do (Expert Question)?
I third that request. Would greatly appreciate taking a look at that diagram! Regards, Jonathan On Wed, Apr 6, 2011 at 9:12 AM, Isan Fulia wrote: > Hi Ephraim/Jen, > > Can u share that diagram with all.It may really help all of us. > Thanks, > Isan Fulia. > > On 6 April 2011 10:15, Tirthankar Chatterjee >wrote: > > > Hi Jen, > > Can you please forward the diagram attachment too that Ephraim sent. :-) > > Thanks, > > Tirthankar > > > > -Original Message- > > From: Jens Mueller [mailto:supidupi...@googlemail.com] > > Sent: Tuesday, April 05, 2011 10:30 PM > > To: solr-user@lucene.apache.org > > Subject: Re: FW: Very very large scale Solr Deployment = how to do > (Expert > > Question)? > > > > Hello Ephraim, > > > > thank you so much for the great Document/Scaling-Concept!! > > > > First I think you really should publish this on the solr wiki. This > > approach is nowhere documented there and not really obvious for newbies > and > > your document is great and explains this very well! > > > > Please allow me to further questions regarding your document: > > 1.) Is it correct, that you mean by "DB" the Origin-Data-Source of the > data > > that is fed into the Solr "Cloud" for searching? > > > > 2.) Solr Aggregator: This term did not yeald any google results, but is a > > very important aspect of your design (and this was the missing piece for > me > > when thinking about solr architectures): Is it cocrrec that the > > "aggregators" are simply tomcat instances, with the solr webapp deployed? > > These Aggregators do not have their own index but only run the solr > webapp > > and I access them via the ?shard= parameter giving the shards I want to > > query? (So in the end they aggreate the data of the shards but do not > have > > their own data). This is really an important aspect that is not > documented > > well enough in the solr documentation. > > > > Thank you very much! > > Jens > > > > > > 2011/4/5 Ephraim Ofir > > > > > of course the attachment didn't get to the list, so here it is if you > > > want it... > > > > > > Ephraim Ofir > > > > > > > > > -Original Message- > > > From: Ephraim Ofir > > > Sent: Tuesday, April 05, 2011 10:20 AM > > > To: 'solr-user@lucene.apache.org' > > > Subject: RE: Very very large scale Solr Deployment = how to do (Expert > > > Question)? > > > > > > I'm not sure about the scale you're aiming for, but you probably want > > > to do both sharding and replication. There's no central server which > > > would be the bottleneck. The guidelines should probably be something > > like: > > > 1. Split your index to enough shards so it can keep up with the update > > > rate. > > > 2. Have enough replicates of each shard master to keep up with the > > > rate of queries. > > > 3. Have enough aggregators in front of the shard replicates so the > > > aggregation doesn't become a bottleneck. > > > 4. Make sure you have good load balancing across your system. > > > > > > Attached is a diagram of the setup we have. You might want to look > > > into SolrCloud as well. > > > > > > Ephraim Ofir > > > > > > > > > -Original Message- > > > From: Jens Mueller [mailto:supidupi...@googlemail.com] > > > Sent: Tuesday, April 05, 2011 4:25 AM > > > To: solr-user@lucene.apache.org > > > Subject: Very very large scale Solr Deployment = how to do (Expert > > > Question)? > > > > > > Hello Experts, > > > > > > > > > > > > I am a Solr newbie but read quite a lot of docs. I still do not > > > understand what would be the best way to setup very large scale > > > deployments: > > > > > > > > > > > > Goal (threoretical): > > > > > > A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size) > > > > > > B) Queries: 10 Queries/ per Second > > > > > > C) Updates: 10 Updates / per Second > > > > > > > > > > > > > > > Solr offers: > > > > > > 1.)Replication => Scales Well for B) BUT A) and C) are not > > > satisfied > > > > > > > > > 2.)Sharding => Scales well for A) BUT B) and C) are not satisfied > > > (=> As &g
Re: FW: Very very large scale Solr Deployment = how to do (Expert Question)?
Hi Ephraim/Jen, Can u share that diagram with all.It may really help all of us. Thanks, Isan Fulia. On 6 April 2011 10:15, Tirthankar Chatterjee wrote: > Hi Jen, > Can you please forward the diagram attachment too that Ephraim sent. :-) > Thanks, > Tirthankar > > -Original Message- > From: Jens Mueller [mailto:supidupi...@googlemail.com] > Sent: Tuesday, April 05, 2011 10:30 PM > To: solr-user@lucene.apache.org > Subject: Re: FW: Very very large scale Solr Deployment = how to do (Expert > Question)? > > Hello Ephraim, > > thank you so much for the great Document/Scaling-Concept!! > > First I think you really should publish this on the solr wiki. This > approach is nowhere documented there and not really obvious for newbies and > your document is great and explains this very well! > > Please allow me to further questions regarding your document: > 1.) Is it correct, that you mean by "DB" the Origin-Data-Source of the data > that is fed into the Solr "Cloud" for searching? > > 2.) Solr Aggregator: This term did not yeald any google results, but is a > very important aspect of your design (and this was the missing piece for me > when thinking about solr architectures): Is it cocrrec that the > "aggregators" are simply tomcat instances, with the solr webapp deployed? > These Aggregators do not have their own index but only run the solr webapp > and I access them via the ?shard= parameter giving the shards I want to > query? (So in the end they aggreate the data of the shards but do not have > their own data). This is really an important aspect that is not documented > well enough in the solr documentation. > > Thank you very much! > Jens > > > 2011/4/5 Ephraim Ofir > > > of course the attachment didn't get to the list, so here it is if you > > want it... > > > > Ephraim Ofir > > > > > > -----Original Message- > > From: Ephraim Ofir > > Sent: Tuesday, April 05, 2011 10:20 AM > > To: 'solr-user@lucene.apache.org' > > Subject: RE: Very very large scale Solr Deployment = how to do (Expert > > Question)? > > > > I'm not sure about the scale you're aiming for, but you probably want > > to do both sharding and replication. There's no central server which > > would be the bottleneck. The guidelines should probably be something > like: > > 1. Split your index to enough shards so it can keep up with the update > > rate. > > 2. Have enough replicates of each shard master to keep up with the > > rate of queries. > > 3. Have enough aggregators in front of the shard replicates so the > > aggregation doesn't become a bottleneck. > > 4. Make sure you have good load balancing across your system. > > > > Attached is a diagram of the setup we have. You might want to look > > into SolrCloud as well. > > > > Ephraim Ofir > > > > > > -Original Message- > > From: Jens Mueller [mailto:supidupi...@googlemail.com] > > Sent: Tuesday, April 05, 2011 4:25 AM > > To: solr-user@lucene.apache.org > > Subject: Very very large scale Solr Deployment = how to do (Expert > > Question)? > > > > Hello Experts, > > > > > > > > I am a Solr newbie but read quite a lot of docs. I still do not > > understand what would be the best way to setup very large scale > > deployments: > > > > > > > > Goal (threoretical): > > > > A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size) > > > > B) Queries: 10 Queries/ per Second > > > > C) Updates: 10 Updates / per Second > > > > > > > > > > Solr offers: > > > > 1.)Replication => Scales Well for B) BUT A) and C) are not > > satisfied > > > > > > 2.)Sharding => Scales well for A) BUT B) and C) are not satisfied > > (=> As > > I understand the Sharding approach all goes through a central server, > > that dispatches the updates and assembles the quries retrieved from > > the different shards. But this central server has also some capacity > > limits...) > > > > > > > > > > What is the right approach to handle such large deployments? I would > > be thankfull for just a rough sketch of the concepts so I can > > experiment/search further... > > > > > > Maybe I am missing something very trivial as I think some of the "Solr > > Users/Use Cases" on the homepage are that kind of large deployments. > > How are they implemented? > > > > > > > > Thanky very much!!! > > > > Jens > > > **Legal Disclaimer*** > "This communication may contain confidential and privileged > material for the sole use of the intended recipient. Any > unauthorized review, use or distribution by others is strictly > prohibited. If you have received the message in error, please > advise the sender by reply email and delete the message. Thank > you." > * > -- Thanks & Regards, Isan Fulia.
RE: FW: Very very large scale Solr Deployment = how to do (Expert Question)?
Hi Jen, Can you please forward the diagram attachment too that Ephraim sent. :-) Thanks, Tirthankar -Original Message- From: Jens Mueller [mailto:supidupi...@googlemail.com] Sent: Tuesday, April 05, 2011 10:30 PM To: solr-user@lucene.apache.org Subject: Re: FW: Very very large scale Solr Deployment = how to do (Expert Question)? Hello Ephraim, thank you so much for the great Document/Scaling-Concept!! First I think you really should publish this on the solr wiki. This approach is nowhere documented there and not really obvious for newbies and your document is great and explains this very well! Please allow me to further questions regarding your document: 1.) Is it correct, that you mean by "DB" the Origin-Data-Source of the data that is fed into the Solr "Cloud" for searching? 2.) Solr Aggregator: This term did not yeald any google results, but is a very important aspect of your design (and this was the missing piece for me when thinking about solr architectures): Is it cocrrec that the "aggregators" are simply tomcat instances, with the solr webapp deployed? These Aggregators do not have their own index but only run the solr webapp and I access them via the ?shard= parameter giving the shards I want to query? (So in the end they aggreate the data of the shards but do not have their own data). This is really an important aspect that is not documented well enough in the solr documentation. Thank you very much! Jens 2011/4/5 Ephraim Ofir > of course the attachment didn't get to the list, so here it is if you > want it... > > Ephraim Ofir > > > -Original Message- > From: Ephraim Ofir > Sent: Tuesday, April 05, 2011 10:20 AM > To: 'solr-user@lucene.apache.org' > Subject: RE: Very very large scale Solr Deployment = how to do (Expert > Question)? > > I'm not sure about the scale you're aiming for, but you probably want > to do both sharding and replication. There's no central server which > would be the bottleneck. The guidelines should probably be something like: > 1. Split your index to enough shards so it can keep up with the update > rate. > 2. Have enough replicates of each shard master to keep up with the > rate of queries. > 3. Have enough aggregators in front of the shard replicates so the > aggregation doesn't become a bottleneck. > 4. Make sure you have good load balancing across your system. > > Attached is a diagram of the setup we have. You might want to look > into SolrCloud as well. > > Ephraim Ofir > > > -Original Message- > From: Jens Mueller [mailto:supidupi...@googlemail.com] > Sent: Tuesday, April 05, 2011 4:25 AM > To: solr-user@lucene.apache.org > Subject: Very very large scale Solr Deployment = how to do (Expert > Question)? > > Hello Experts, > > > > I am a Solr newbie but read quite a lot of docs. I still do not > understand what would be the best way to setup very large scale > deployments: > > > > Goal (threoretical): > > A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size) > > B) Queries: 10 Queries/ per Second > > C) Updates: 10 Updates / per Second > > > > > Solr offers: > > 1.)Replication => Scales Well for B) BUT A) and C) are not > satisfied > > > 2.)Sharding => Scales well for A) BUT B) and C) are not satisfied > (=> As > I understand the Sharding approach all goes through a central server, > that dispatches the updates and assembles the quries retrieved from > the different shards. But this central server has also some capacity > limits...) > > > > > What is the right approach to handle such large deployments? I would > be thankfull for just a rough sketch of the concepts so I can > experiment/search further... > > > Maybe I am missing something very trivial as I think some of the "Solr > Users/Use Cases" on the homepage are that kind of large deployments. > How are they implemented? > > > > Thanky very much!!! > > Jens > **Legal Disclaimer*** "This communication may contain confidential and privileged material for the sole use of the intended recipient. Any unauthorized review, use or distribution by others is strictly prohibited. If you have received the message in error, please advise the sender by reply email and delete the message. Thank you." *
Re: FW: Very very large scale Solr Deployment = how to do (Expert Question)?
Hello Ephraim, thank you so much for the great Document/Scaling-Concept!! First I think you really should publish this on the solr wiki. This approach is nowhere documented there and not really obvious for newbies and your document is great and explains this very well! Please allow me to further questions regarding your document: 1.) Is it correct, that you mean by "DB" the Origin-Data-Source of the data that is fed into the Solr "Cloud" for searching? 2.) Solr Aggregator: This term did not yeald any google results, but is a very important aspect of your design (and this was the missing piece for me when thinking about solr architectures): Is it cocrrec that the "aggregators" are simply tomcat instances, with the solr webapp deployed? These Aggregators do not have their own index but only run the solr webapp and I access them via the ?shard= parameter giving the shards I want to query? (So in the end they aggreate the data of the shards but do not have their own data). This is really an important aspect that is not documented well enough in the solr documentation. Thank you very much! Jens 2011/4/5 Ephraim Ofir > of course the attachment didn't get to the list, so here it is if you > want it... > > Ephraim Ofir > > > -Original Message- > From: Ephraim Ofir > Sent: Tuesday, April 05, 2011 10:20 AM > To: 'solr-user@lucene.apache.org' > Subject: RE: Very very large scale Solr Deployment = how to do (Expert > Question)? > > I'm not sure about the scale you're aiming for, but you probably want to > do both sharding and replication. There's no central server which would > be the bottleneck. The guidelines should probably be something like: > 1. Split your index to enough shards so it can keep up with the update > rate. > 2. Have enough replicates of each shard master to keep up with the rate > of queries. > 3. Have enough aggregators in front of the shard replicates so the > aggregation doesn't become a bottleneck. > 4. Make sure you have good load balancing across your system. > > Attached is a diagram of the setup we have. You might want to look into > SolrCloud as well. > > Ephraim Ofir > > > -Original Message- > From: Jens Mueller [mailto:supidupi...@googlemail.com] > Sent: Tuesday, April 05, 2011 4:25 AM > To: solr-user@lucene.apache.org > Subject: Very very large scale Solr Deployment = how to do (Expert > Question)? > > Hello Experts, > > > > I am a Solr newbie but read quite a lot of docs. I still do not > understand what would be the best way to setup very large scale > deployments: > > > > Goal (threoretical): > > A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size) > > B) Queries: 10 Queries/ per Second > > C) Updates: 10 Updates / per Second > > > > > Solr offers: > > 1.)Replication => Scales Well for B) BUT A) and C) are not > satisfied > > > 2.)Sharding => Scales well for A) BUT B) and C) are not satisfied > (=> As > I understand the Sharding approach all goes through a central server, > that dispatches the updates and assembles the quries retrieved from the > different shards. But this central server has also some capacity > limits...) > > > > > What is the right approach to handle such large deployments? I would be > thankfull for just a rough sketch of the concepts so I can > experiment/search further... > > > Maybe I am missing something very trivial as I think some of the "Solr > Users/Use Cases" on the homepage are that kind of large deployments. How > are they implemented? > > > > Thanky very much!!! > > Jens >
Re: Very very large scale Solr Deployment = how to do (Expert Question)?
And if you have control over machine placement, split them across racks so that a power outage on one rack does not take out your search cluster. François On Apr 5, 2011, at 3:19 AM, Ephraim Ofir wrote: > I'm not sure about the scale you're aiming for, but you probably want to > do both sharding and replication. There's no central server which would > be the bottleneck. The guidelines should probably be something like: > 1. Split your index to enough shards so it can keep up with the update > rate. > 2. Have enough replicates of each shard master to keep up with the rate > of queries. > 3. Have enough aggregators in front of the shard replicates so the > aggregation doesn't become a bottleneck. > 4. Make sure you have good load balancing across your system. > > Attached is a diagram of the setup we have. You might want to look into > SolrCloud as well. > > Ephraim Ofir > > > -Original Message- > From: Jens Mueller [mailto:supidupi...@googlemail.com] > Sent: Tuesday, April 05, 2011 4:25 AM > To: solr-user@lucene.apache.org > Subject: Very very large scale Solr Deployment = how to do (Expert > Question)? > > Hello Experts, > > > > I am a Solr newbie but read quite a lot of docs. I still do not > understand > what would be the best way to setup very large scale deployments: > > > > Goal (threoretical): > > A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size) > > B) Queries: 10 Queries/ per Second > > C) Updates: 10 Updates / per Second > > > > > Solr offers: > > 1.)Replication => Scales Well for B) BUT A) and C) are not > satisfied > > > 2.)Sharding => Scales well for A) BUT B) and C) are not satisfied > (=> As > I understand the Sharding approach all goes through a central server, > that > dispatches the updates and assembles the quries retrieved from the > different > shards. But this central server has also some capacity limits...) > > > > > What is the right approach to handle such large deployments? I would be > thankfull for just a rough sketch of the concepts so I can > experiment/search > further... > > > Maybe I am missing something very trivial as I think some of the "Solr > Users/Use Cases" on the homepage are that kind of large deployments. How > are > they implemented? > > > > Thanky very much!!! > > Jens
RE: Very very large scale Solr Deployment = how to do (Expert Question)?
I'm not sure about the scale you're aiming for, but you probably want to do both sharding and replication. There's no central server which would be the bottleneck. The guidelines should probably be something like: 1. Split your index to enough shards so it can keep up with the update rate. 2. Have enough replicates of each shard master to keep up with the rate of queries. 3. Have enough aggregators in front of the shard replicates so the aggregation doesn't become a bottleneck. 4. Make sure you have good load balancing across your system. Attached is a diagram of the setup we have. You might want to look into SolrCloud as well. Ephraim Ofir -Original Message- From: Jens Mueller [mailto:supidupi...@googlemail.com] Sent: Tuesday, April 05, 2011 4:25 AM To: solr-user@lucene.apache.org Subject: Very very large scale Solr Deployment = how to do (Expert Question)? Hello Experts, I am a Solr newbie but read quite a lot of docs. I still do not understand what would be the best way to setup very large scale deployments: Goal (threoretical): A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size) B) Queries: 10 Queries/ per Second C) Updates: 10 Updates / per Second Solr offers: 1.)Replication => Scales Well for B) BUT A) and C) are not satisfied 2.)Sharding => Scales well for A) BUT B) and C) are not satisfied (=> As I understand the Sharding approach all goes through a central server, that dispatches the updates and assembles the quries retrieved from the different shards. But this central server has also some capacity limits...) What is the right approach to handle such large deployments? I would be thankfull for just a rough sketch of the concepts so I can experiment/search further... Maybe I am missing something very trivial as I think some of the "Solr Users/Use Cases" on the homepage are that kind of large deployments. How are they implemented? Thanky very much!!! Jens
Very very large scale Solr Deployment = how to do (Expert Question)?
Hello Experts, I am a Solr newbie but read quite a lot of docs. I still do not understand what would be the best way to setup very large scale deployments: Goal (threoretical): A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size) B) Queries: 10 Queries/ per Second C) Updates: 10 Updates / per Second Solr offers: 1.)Replication => Scales Well for B) BUT A) and C) are not satisfied 2.)Sharding => Scales well for A) BUT B) and C) are not satisfied (=> As I understand the Sharding approach all goes through a central server, that dispatches the updates and assembles the quries retrieved from the different shards. But this central server has also some capacity limits...) What is the right approach to handle such large deployments? I would be thankfull for just a rough sketch of the concepts so I can experiment/search further… Maybe I am missing something very trivial as I think some of the “Solr Users/Use Cases” on the homepage are that kind of large deployments. How are they implemented? Thanky very much!!! Jens