Re: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-08 Thread Andy
Perfect. Thank you very much.

Andy

--- On Fri, 4/8/11, Pascal Coupet  wrote:

> From: Pascal Coupet 
> Subject: Re: Very very large scale Solr Deployment = how to do (Expert 
> Question)?
> To: solr-user@lucene.apache.org
> Date: Friday, April 8, 2011, 10:20 AM
> I dit put a pdf version here:
> https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srcid=0B02DHBZQYYT_MmRkZTY0YjQtODJmZS00Mzg0LWJiNTEtOWJjNzViNmNjZjdh&hl=en&authkey=CL2Fq_QG
> 
> Zoom it to get a better view.
> 
> Pascal
> 
> 2011/4/8 Andy 
> 
> > Could anyone please post a version of the document in
> pdf or openoffice
> > format? I'm on Linux so there's no way for me to use
> MS Word.
> >
> > Thanks.
> >
> >
> > --- On Fri, 4/8/11, Albert Vila 
> wrote:
> >
> > > From: Albert Vila 
> > > Subject: Re: Very very large scale Solr
> Deployment = how to do (Expert
> > Question)?
> > > To: solr-user@lucene.apache.org
> > > Date: Friday, April 8, 2011, 9:25 AM
> > > Yes, It won't work if you are using
> > > OpenOffice. However it works fine
> > > with Microsoft Word.
> > >
> > > Hope it helps.
> > >
> > > Albert
> > >
> > > On 8 April 2011 14:55, Andy 
> > > wrote:
> > > > I can't view the document either -- it
> showed up
> > > empty.
> > > >
> > > > Has anyone succeeded in viewing it?
> > > >
> > > > Andy
> > > >
> > > > --- On Fri, 4/8/11, Albert Vila 
> > > wrote:
> > > >
> > > >> From: Albert Vila 
> > > >> Subject: Re: Very very large scale Solr
> Deployment
> > > = how to do (Expert Question)?
> > > >> To: solr-user@lucene.apache.org
> > > >> Date: Friday, April 8, 2011, 3:43 AM
> > > >> Ephraim, I still can't view the
> > > >> document.
> > > >>
> > > >> Don't know if I'm doing something wrong,
> but I
> > > downloaded
> > > >> it and It
> > > >> appears to be empty.
> > > >>
> > > >> Albert
> > > >>
> > > >> On 7 April 2011 09:32, Ephraim Ofir
> 
> > > >> wrote:
> > > >> > You can't view it online, but you
> should be
> > > able to
> > > >> download it from:
> > > >> >
> > https://docs.google.com/leaf?id=0BwOEbnJ7oeOrNmU5ZThjODUtYzM5MS00YjRlLWI
> > > >> >
> > >
> 2OTktZTEzNDk1YmVmOWU4&hl=en&authkey=COGel4gP
> > > >> >
> > > >> > Enjoy,
> > > >> > Ephraim Ofir
> > > >> >
> > > >> >
> > > >> > -Original Message-
> > > >> > From: Jens Mueller [mailto:supidupi...@googlemail.com]
> > > >> > Sent: Thursday, April 07, 2011 8:30
> AM
> > > >> > To: solr-user@lucene.apache.org
> > > >> > Subject: Re: Very very large scale
> Solr
> > > Deployment =
> > > >> how to do (Expert
> > > >> > Question)?
> > > >> >
> > > >> > Hello Ephraim, hello Lance, hello
> Walter,
> > > >> >
> > > >> > thanks for your replies:
> > > >> >
> > > >> > Ephraim, thanks very much for the
> further
> > > detailed
> > > >> explanation. I will
> > > >> > try
> > > >> > to setup a demo system in the next
> few days
> > > and use
> > > >> your advice.
> > > >> > LoadBalancers are an important
> aspect of your
> > > design.
> > > >> Can you recommend
> > > >> > one
> > > >> > LB specificallly? (I would be
> using
> > > haproxy.1wt.eu) .
> > > >> I think the Idea
> > > >> > with
> > > >> > uploading your document is very
> good.
> > > However
> > > >> Google-Docs seemed not be
> > > >> > be
> > > >> > working (at least for me with the
> docx
> > > format?), but
> > > >> maybe you can
> > > >> > simply
> > > >> > output the document as PDF and then
> I think
> > > Google
> > > >> Docs is working, so
> > > >> > all
> > > >> > the others can also have a look at
>

Re: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-08 Thread Pascal Coupet
I dit put a pdf version here:
https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srcid=0B02DHBZQYYT_MmRkZTY0YjQtODJmZS00Mzg0LWJiNTEtOWJjNzViNmNjZjdh&hl=en&authkey=CL2Fq_QG

Zoom it to get a better view.

Pascal

2011/4/8 Andy 

> Could anyone please post a version of the document in pdf or openoffice
> format? I'm on Linux so there's no way for me to use MS Word.
>
> Thanks.
>
>
> --- On Fri, 4/8/11, Albert Vila  wrote:
>
> > From: Albert Vila 
> > Subject: Re: Very very large scale Solr Deployment = how to do (Expert
> Question)?
> > To: solr-user@lucene.apache.org
> > Date: Friday, April 8, 2011, 9:25 AM
> > Yes, It won't work if you are using
> > OpenOffice. However it works fine
> > with Microsoft Word.
> >
> > Hope it helps.
> >
> > Albert
> >
> > On 8 April 2011 14:55, Andy 
> > wrote:
> > > I can't view the document either -- it showed up
> > empty.
> > >
> > > Has anyone succeeded in viewing it?
> > >
> > > Andy
> > >
> > > --- On Fri, 4/8/11, Albert Vila 
> > wrote:
> > >
> > >> From: Albert Vila 
> > >> Subject: Re: Very very large scale Solr Deployment
> > = how to do (Expert Question)?
> > >> To: solr-user@lucene.apache.org
> > >> Date: Friday, April 8, 2011, 3:43 AM
> > >> Ephraim, I still can't view the
> > >> document.
> > >>
> > >> Don't know if I'm doing something wrong, but I
> > downloaded
> > >> it and It
> > >> appears to be empty.
> > >>
> > >> Albert
> > >>
> > >> On 7 April 2011 09:32, Ephraim Ofir 
> > >> wrote:
> > >> > You can't view it online, but you should be
> > able to
> > >> download it from:
> > >> >
> https://docs.google.com/leaf?id=0BwOEbnJ7oeOrNmU5ZThjODUtYzM5MS00YjRlLWI
> > >> >
> > 2OTktZTEzNDk1YmVmOWU4&hl=en&authkey=COGel4gP
> > >> >
> > >> > Enjoy,
> > >> > Ephraim Ofir
> > >> >
> > >> >
> > >> > -Original Message-
> > >> > From: Jens Mueller [mailto:supidupi...@googlemail.com]
> > >> > Sent: Thursday, April 07, 2011 8:30 AM
> > >> > To: solr-user@lucene.apache.org
> > >> > Subject: Re: Very very large scale Solr
> > Deployment =
> > >> how to do (Expert
> > >> > Question)?
> > >> >
> > >> > Hello Ephraim, hello Lance, hello Walter,
> > >> >
> > >> > thanks for your replies:
> > >> >
> > >> > Ephraim, thanks very much for the further
> > detailed
> > >> explanation. I will
> > >> > try
> > >> > to setup a demo system in the next few days
> > and use
> > >> your advice.
> > >> > LoadBalancers are an important aspect of your
> > design.
> > >> Can you recommend
> > >> > one
> > >> > LB specificallly? (I would be using
> > haproxy.1wt.eu) .
> > >> I think the Idea
> > >> > with
> > >> > uploading your document is very good.
> > However
> > >> Google-Docs seemed not be
> > >> > be
> > >> > working (at least for me with the docx
> > format?), but
> > >> maybe you can
> > >> > simply
> > >> > output the document as PDF and then I think
> > Google
> > >> Docs is working, so
> > >> > all
> > >> > the others can also have a look at your
> > concept. The
> > >> best approach would
> > >> > be
> > >> > if you could upload your advice directly
> > somewhere to
> > >> the solr wiki as
> > >> > it is
> > >> > really helpful.I found some other documents
> > meanwhile,
> > >> but yours is much
> > >> > clearer and more complete, with the LBs and
> > the
> > >> Aggregators (
> > >> > http://lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf)
> > >> >
> > >> > Lance, thanks I will have a look at what
> > linkedin is
> > >> doing.
> > >> >
> > >> > Walter, thanks for the advice: Well you are
> > right,
> > >> mentioning google. My
> > >> > question was also to understand how su

Re: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-08 Thread Andy
Could anyone please post a version of the document in pdf or openoffice format? 
I'm on Linux so there's no way for me to use MS Word.

Thanks.


--- On Fri, 4/8/11, Albert Vila  wrote:

> From: Albert Vila 
> Subject: Re: Very very large scale Solr Deployment = how to do (Expert 
> Question)?
> To: solr-user@lucene.apache.org
> Date: Friday, April 8, 2011, 9:25 AM
> Yes, It won't work if you are using
> OpenOffice. However it works fine
> with Microsoft Word.
> 
> Hope it helps.
> 
> Albert
> 
> On 8 April 2011 14:55, Andy 
> wrote:
> > I can't view the document either -- it showed up
> empty.
> >
> > Has anyone succeeded in viewing it?
> >
> > Andy
> >
> > --- On Fri, 4/8/11, Albert Vila 
> wrote:
> >
> >> From: Albert Vila 
> >> Subject: Re: Very very large scale Solr Deployment
> = how to do (Expert Question)?
> >> To: solr-user@lucene.apache.org
> >> Date: Friday, April 8, 2011, 3:43 AM
> >> Ephraim, I still can't view the
> >> document.
> >>
> >> Don't know if I'm doing something wrong, but I
> downloaded
> >> it and It
> >> appears to be empty.
> >>
> >> Albert
> >>
> >> On 7 April 2011 09:32, Ephraim Ofir 
> >> wrote:
> >> > You can't view it online, but you should be
> able to
> >> download it from:
> >> > https://docs.google.com/leaf?id=0BwOEbnJ7oeOrNmU5ZThjODUtYzM5MS00YjRlLWI
> >> >
> 2OTktZTEzNDk1YmVmOWU4&hl=en&authkey=COGel4gP
> >> >
> >> > Enjoy,
> >> > Ephraim Ofir
> >> >
> >> >
> >> > -Original Message-
> >> > From: Jens Mueller [mailto:supidupi...@googlemail.com]
> >> > Sent: Thursday, April 07, 2011 8:30 AM
> >> > To: solr-user@lucene.apache.org
> >> > Subject: Re: Very very large scale Solr
> Deployment =
> >> how to do (Expert
> >> > Question)?
> >> >
> >> > Hello Ephraim, hello Lance, hello Walter,
> >> >
> >> > thanks for your replies:
> >> >
> >> > Ephraim, thanks very much for the further
> detailed
> >> explanation. I will
> >> > try
> >> > to setup a demo system in the next few days
> and use
> >> your advice.
> >> > LoadBalancers are an important aspect of your
> design.
> >> Can you recommend
> >> > one
> >> > LB specificallly? (I would be using
> haproxy.1wt.eu) .
> >> I think the Idea
> >> > with
> >> > uploading your document is very good.
> However
> >> Google-Docs seemed not be
> >> > be
> >> > working (at least for me with the docx
> format?), but
> >> maybe you can
> >> > simply
> >> > output the document as PDF and then I think
> Google
> >> Docs is working, so
> >> > all
> >> > the others can also have a look at your
> concept. The
> >> best approach would
> >> > be
> >> > if you could upload your advice directly
> somewhere to
> >> the solr wiki as
> >> > it is
> >> > really helpful.I found some other documents
> meanwhile,
> >> but yours is much
> >> > clearer and more complete, with the LBs and
> the
> >> Aggregators (
> >> > http://lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf)
> >> >
> >> > Lance, thanks I will have a look at what
> linkedin is
> >> doing.
> >> >
> >> > Walter, thanks for the advice: Well you are
> right,
> >> mentioning google. My
> >> > question was also to understand how such
> large systems
> >> like
> >> > google/facebook
> >> > are actually working. So my numbers are just
> >> theoretical and made up. My
> >> > system will be smaller,  but I would be very
> happy to
> >> understand how
> >> > such
> >> > large systems are build and I think the
> approach
> >> Ephraim showd should be
> >> > working quite well at large scale. If you
> know a good
> >> documents (besides
> >> > the
> >> > bigtable research paper that I already know)
> that
> >> technically describes
> >> > how
> >> > google is working in detail that would be of
> great
> >> interest. You seem to
> >> > be
> >> > working for a company that handles large
&

Re: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-08 Thread Albert Vila
Yes, It won't work if you are using OpenOffice. However it works fine
with Microsoft Word.

Hope it helps.

Albert

On 8 April 2011 14:55, Andy  wrote:
> I can't view the document either -- it showed up empty.
>
> Has anyone succeeded in viewing it?
>
> Andy
>
> --- On Fri, 4/8/11, Albert Vila  wrote:
>
>> From: Albert Vila 
>> Subject: Re: Very very large scale Solr Deployment = how to do (Expert 
>> Question)?
>> To: solr-user@lucene.apache.org
>> Date: Friday, April 8, 2011, 3:43 AM
>> Ephraim, I still can't view the
>> document.
>>
>> Don't know if I'm doing something wrong, but I downloaded
>> it and It
>> appears to be empty.
>>
>> Albert
>>
>> On 7 April 2011 09:32, Ephraim Ofir 
>> wrote:
>> > You can't view it online, but you should be able to
>> download it from:
>> > https://docs.google.com/leaf?id=0BwOEbnJ7oeOrNmU5ZThjODUtYzM5MS00YjRlLWI
>> > 2OTktZTEzNDk1YmVmOWU4&hl=en&authkey=COGel4gP
>> >
>> > Enjoy,
>> > Ephraim Ofir
>> >
>> >
>> > -Original Message-
>> > From: Jens Mueller [mailto:supidupi...@googlemail.com]
>> > Sent: Thursday, April 07, 2011 8:30 AM
>> > To: solr-user@lucene.apache.org
>> > Subject: Re: Very very large scale Solr Deployment =
>> how to do (Expert
>> > Question)?
>> >
>> > Hello Ephraim, hello Lance, hello Walter,
>> >
>> > thanks for your replies:
>> >
>> > Ephraim, thanks very much for the further detailed
>> explanation. I will
>> > try
>> > to setup a demo system in the next few days and use
>> your advice.
>> > LoadBalancers are an important aspect of your design.
>> Can you recommend
>> > one
>> > LB specificallly? (I would be using haproxy.1wt.eu) .
>> I think the Idea
>> > with
>> > uploading your document is very good. However
>> Google-Docs seemed not be
>> > be
>> > working (at least for me with the docx format?), but
>> maybe you can
>> > simply
>> > output the document as PDF and then I think Google
>> Docs is working, so
>> > all
>> > the others can also have a look at your concept. The
>> best approach would
>> > be
>> > if you could upload your advice directly somewhere to
>> the solr wiki as
>> > it is
>> > really helpful.I found some other documents meanwhile,
>> but yours is much
>> > clearer and more complete, with the LBs and the
>> Aggregators (
>> > http://lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf)
>> >
>> > Lance, thanks I will have a look at what linkedin is
>> doing.
>> >
>> > Walter, thanks for the advice: Well you are right,
>> mentioning google. My
>> > question was also to understand how such large systems
>> like
>> > google/facebook
>> > are actually working. So my numbers are just
>> theoretical and made up. My
>> > system will be smaller,  but I would be very happy to
>> understand how
>> > such
>> > large systems are build and I think the approach
>> Ephraim showd should be
>> > working quite well at large scale. If you know a good
>> documents (besides
>> > the
>> > bigtable research paper that I already know) that
>> technically describes
>> > how
>> > google is working in detail that would be of great
>> interest. You seem to
>> > be
>> > working for a company that handles large datasets.
>> Does google use this
>> > approach, sharing the index into N writers, and the
>> procuded index is
>> > then
>> > replicated to N "read only searchers"?
>> >
>> > thank you all.
>> > best regards
>> > jens
>> >
>> >
>> >
>> > 2011/4/7 Walter Underwood 
>> >
>> >> The bigger answer is that you cannot get to this
>> size by just
>> > configuring
>> >> Solr. You may have to invent a lot of stuff. Like
>> all of Google.
>> >>
>> >> Where did you get these numbers? The proposed
>> query rate is twice as
>> > big as
>> >> Google (Feb 2010 estimate, 34K qps).
>> >>
>> >> I work at MarkLogic, and we scale to 100's of
>> terabytes, with fast
>> > update
>> >> and query rates. If you want a real system that
>> handles that, you
>> > might want
>> >>

Re: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-08 Thread Andy
I can't view the document either -- it showed up empty.

Has anyone succeeded in viewing it?

Andy

--- On Fri, 4/8/11, Albert Vila  wrote:

> From: Albert Vila 
> Subject: Re: Very very large scale Solr Deployment = how to do (Expert 
> Question)?
> To: solr-user@lucene.apache.org
> Date: Friday, April 8, 2011, 3:43 AM
> Ephraim, I still can't view the
> document.
> 
> Don't know if I'm doing something wrong, but I downloaded
> it and It
> appears to be empty.
> 
> Albert
> 
> On 7 April 2011 09:32, Ephraim Ofir 
> wrote:
> > You can't view it online, but you should be able to
> download it from:
> > https://docs.google.com/leaf?id=0BwOEbnJ7oeOrNmU5ZThjODUtYzM5MS00YjRlLWI
> > 2OTktZTEzNDk1YmVmOWU4&hl=en&authkey=COGel4gP
> >
> > Enjoy,
> > Ephraim Ofir
> >
> >
> > -Original Message-
> > From: Jens Mueller [mailto:supidupi...@googlemail.com]
> > Sent: Thursday, April 07, 2011 8:30 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Very very large scale Solr Deployment =
> how to do (Expert
> > Question)?
> >
> > Hello Ephraim, hello Lance, hello Walter,
> >
> > thanks for your replies:
> >
> > Ephraim, thanks very much for the further detailed
> explanation. I will
> > try
> > to setup a demo system in the next few days and use
> your advice.
> > LoadBalancers are an important aspect of your design.
> Can you recommend
> > one
> > LB specificallly? (I would be using haproxy.1wt.eu) .
> I think the Idea
> > with
> > uploading your document is very good. However
> Google-Docs seemed not be
> > be
> > working (at least for me with the docx format?), but
> maybe you can
> > simply
> > output the document as PDF and then I think Google
> Docs is working, so
> > all
> > the others can also have a look at your concept. The
> best approach would
> > be
> > if you could upload your advice directly somewhere to
> the solr wiki as
> > it is
> > really helpful.I found some other documents meanwhile,
> but yours is much
> > clearer and more complete, with the LBs and the
> Aggregators (
> > http://lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf)
> >
> > Lance, thanks I will have a look at what linkedin is
> doing.
> >
> > Walter, thanks for the advice: Well you are right,
> mentioning google. My
> > question was also to understand how such large systems
> like
> > google/facebook
> > are actually working. So my numbers are just
> theoretical and made up. My
> > system will be smaller,  but I would be very happy to
> understand how
> > such
> > large systems are build and I think the approach
> Ephraim showd should be
> > working quite well at large scale. If you know a good
> documents (besides
> > the
> > bigtable research paper that I already know) that
> technically describes
> > how
> > google is working in detail that would be of great
> interest. You seem to
> > be
> > working for a company that handles large datasets.
> Does google use this
> > approach, sharing the index into N writers, and the
> procuded index is
> > then
> > replicated to N "read only searchers"?
> >
> > thank you all.
> > best regards
> > jens
> >
> >
> >
> > 2011/4/7 Walter Underwood 
> >
> >> The bigger answer is that you cannot get to this
> size by just
> > configuring
> >> Solr. You may have to invent a lot of stuff. Like
> all of Google.
> >>
> >> Where did you get these numbers? The proposed
> query rate is twice as
> > big as
> >> Google (Feb 2010 estimate, 34K qps).
> >>
> >> I work at MarkLogic, and we scale to 100's of
> terabytes, with fast
> > update
> >> and query rates. If you want a real system that
> handles that, you
> > might want
> >> to look at our product.
> >>
> >> wunder
> >>
> >> On Apr 6, 2011, at 8:06 PM, Lance Norskog wrote:
> >>
> >> > I would not use replication. LinkedIn
> consumer search is a flat
> > system
> >> > where one process indexes new entries and
> does queries
> > simultaneously.
> >> > It's a custom Lucene app called Zoie. Their
> stuff is on Github..
> >> >
> >> > I would get documents to indexers via a
> multicast IP-based queueing
> >> > system. This scales very well and there's a
> lot of hardware support.
> >> >
> &

Re: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-08 Thread François Schiettecatte
You might also want to look at the heritrix crawler too:

http://crawler.archive.org/

I have written three crawlers in the past, all for RSS feeds, it is not easy. 
Happy to provide tips and help if you want to go down that route.

François

On Apr 8, 2011, at 1:53 AM, Andrea Campi wrote:

> On Fri, Apr 8, 2011 at 6:23 AM, Jens Mueller 
> wrote:
> 
>> Hello all,
>> 
>> thanks for your generous help.
>> 
>> I think I now know everything:  (What I want to do is to build a web
>> crawler
>> and index the documents found). I will start with the setup as suggested by
>> 
>> 
> Write a web crawler from scratch is... ambitious.
> Have you looked at Nutch (http://nutch.apache.org/)?  It uses Solr for
> indexing, it may help you get a head start.
> If you've never used Hadoop before it may take some getting used to, but I
> have helped a customer implement it and helped a couple of their devs
> (medium-seniority) get up to speed, and it didn't take them too long to get
> used to it.
> 
> Andrea



Re: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-08 Thread Albert Vila
Ephraim, I still can't view the document.

Don't know if I'm doing something wrong, but I downloaded it and It
appears to be empty.

Albert

On 7 April 2011 09:32, Ephraim Ofir  wrote:
> You can't view it online, but you should be able to download it from:
> https://docs.google.com/leaf?id=0BwOEbnJ7oeOrNmU5ZThjODUtYzM5MS00YjRlLWI
> 2OTktZTEzNDk1YmVmOWU4&hl=en&authkey=COGel4gP
>
> Enjoy,
> Ephraim Ofir
>
>
> -Original Message-
> From: Jens Mueller [mailto:supidupi...@googlemail.com]
> Sent: Thursday, April 07, 2011 8:30 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Very very large scale Solr Deployment = how to do (Expert
> Question)?
>
> Hello Ephraim, hello Lance, hello Walter,
>
> thanks for your replies:
>
> Ephraim, thanks very much for the further detailed explanation. I will
> try
> to setup a demo system in the next few days and use your advice.
> LoadBalancers are an important aspect of your design. Can you recommend
> one
> LB specificallly? (I would be using haproxy.1wt.eu) . I think the Idea
> with
> uploading your document is very good. However Google-Docs seemed not be
> be
> working (at least for me with the docx format?), but maybe you can
> simply
> output the document as PDF and then I think Google Docs is working, so
> all
> the others can also have a look at your concept. The best approach would
> be
> if you could upload your advice directly somewhere to the solr wiki as
> it is
> really helpful.I found some other documents meanwhile, but yours is much
> clearer and more complete, with the LBs and the Aggregators (
> http://lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf)
>
> Lance, thanks I will have a look at what linkedin is doing.
>
> Walter, thanks for the advice: Well you are right, mentioning google. My
> question was also to understand how such large systems like
> google/facebook
> are actually working. So my numbers are just theoretical and made up. My
> system will be smaller,  but I would be very happy to understand how
> such
> large systems are build and I think the approach Ephraim showd should be
> working quite well at large scale. If you know a good documents (besides
> the
> bigtable research paper that I already know) that technically describes
> how
> google is working in detail that would be of great interest. You seem to
> be
> working for a company that handles large datasets. Does google use this
> approach, sharing the index into N writers, and the procuded index is
> then
> replicated to N "read only searchers"?
>
> thank you all.
> best regards
> jens
>
>
>
> 2011/4/7 Walter Underwood 
>
>> The bigger answer is that you cannot get to this size by just
> configuring
>> Solr. You may have to invent a lot of stuff. Like all of Google.
>>
>> Where did you get these numbers? The proposed query rate is twice as
> big as
>> Google (Feb 2010 estimate, 34K qps).
>>
>> I work at MarkLogic, and we scale to 100's of terabytes, with fast
> update
>> and query rates. If you want a real system that handles that, you
> might want
>> to look at our product.
>>
>> wunder
>>
>> On Apr 6, 2011, at 8:06 PM, Lance Norskog wrote:
>>
>> > I would not use replication. LinkedIn consumer search is a flat
> system
>> > where one process indexes new entries and does queries
> simultaneously.
>> > It's a custom Lucene app called Zoie. Their stuff is on Github..
>> >
>> > I would get documents to indexers via a multicast IP-based queueing
>> > system. This scales very well and there's a lot of hardware support.
>> >
>> > The problem with distributed search is that it is a) inherently
> slower
>> > and b) has inherently more and longer jitter. The "airplane wing"
>> > distribution of query times becomes longer and flatter.
>> >
>> > This is going to have to be a "federated" system, where the
> front-end
>> > app aggregates results rather than Solr.
>> >
>> > On Mon, Apr 4, 2011 at 6:25 PM, Jens Mueller
> 
>> wrote:
>> >> Hello Experts,
>> >>
>> >>
>> >>
>> >> I am a Solr newbie but read quite a lot of docs. I still do not
>> understand
>> >> what would be the best way to setup very large scale deployments:
>> >>
>> >>
>> >>
>> >> Goal (threoretical):
>> >>
>> >>  A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size)
>> >>
>> >>  B) Queries: 10 Queries/ per Se

Re: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-07 Thread Andrea Campi
On Fri, Apr 8, 2011 at 6:23 AM, Jens Mueller wrote:

> Hello all,
>
> thanks for your generous help.
>
> I think I now know everything:  (What I want to do is to build a web
> crawler
> and index the documents found). I will start with the setup as suggested by
>
>
Write a web crawler from scratch is... ambitious.
Have you looked at Nutch (http://nutch.apache.org/)?  It uses Solr for
indexing, it may help you get a head start.
If you've never used Hadoop before it may take some getting used to, but I
have helped a customer implement it and helped a couple of their devs
(medium-seniority) get up to speed, and it didn't take them too long to get
used to it.

Andrea


Re: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-07 Thread Jens Mueller
Hello all,

thanks for your generous help.

I think I now know everything:  (What I want to do is to build a web crawler
and index the documents found). I will start with the setup as suggested by
Ephraim (Several sharded masters, each with at least one slave for reads and
some aggregators for querying). This is only a prototype to learn more...

And the Google PDF from Walter is very interesting, that is something that I
can then try if I hit the limits with the setup above.  But before that, I
have to learn much more about all this indexing / index building and
solr/lucene stuff.

Thanks again for your help!!
best regards
jens



2011/4/7 Walter Underwood 

> On Apr 6, 2011, at 10:29 PM, Jens Mueller wrote:
>
> > Walter, thanks for the advice: Well you are right, mentioning google. My
> > question was also to understand how such large systems like
> google/facebook
> > are actually working. So my numbers are just theoretical and made up. My
> > system will be smaller,  but I would be very happy to understand how such
> > large systems are build and I think the approach Ephraim showd should be
> > working quite well at large scale.
>
> Understanding what Google does will NOT help you build your engine. Just
> like understanding a F1 race car does not help you build a Toyota Camry. One
> is built for performance only, and requires LOTS of support, the other for
> supportability and stability. Very different engineering goals and designs.
>
> Here is one view of Google's search setup:
> http://www.linesave.co.uk/google_search_engine.html
>
> This talk gives a lot more detail. Summary in the blog post, slides in the
> PDF. Google's search is entirely in-memory. They load off disk and run.
>
> http://glinden.blogspot.com/2009/02/jeff-dean-keynote-at-wsdm-2009.html
> http://research.google.com/people/jeff/WSDM09-keynote.pdf
>
> How big will your system be? Does it require real-time updates?
>
> wunder
> --
> Walter Underwood
> Lead Engineer, MarkLogic
>
>


Re: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-07 Thread Walter Underwood
On Apr 6, 2011, at 10:29 PM, Jens Mueller wrote:

> Walter, thanks for the advice: Well you are right, mentioning google. My
> question was also to understand how such large systems like google/facebook
> are actually working. So my numbers are just theoretical and made up. My
> system will be smaller,  but I would be very happy to understand how such
> large systems are build and I think the approach Ephraim showd should be
> working quite well at large scale. 

Understanding what Google does will NOT help you build your engine. Just like 
understanding a F1 race car does not help you build a Toyota Camry. One is 
built for performance only, and requires LOTS of support, the other for 
supportability and stability. Very different engineering goals and designs.

Here is one view of Google's search setup: 
http://www.linesave.co.uk/google_search_engine.html

This talk gives a lot more detail. Summary in the blog post, slides in the PDF. 
Google's search is entirely in-memory. They load off disk and run.

http://glinden.blogspot.com/2009/02/jeff-dean-keynote-at-wsdm-2009.html
http://research.google.com/people/jeff/WSDM09-keynote.pdf

How big will your system be? Does it require real-time updates?

wunder
--
Walter Underwood
Lead Engineer, MarkLogic



RE: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-07 Thread Ephraim Ofir
You can't view it online, but you should be able to download it from:
https://docs.google.com/leaf?id=0BwOEbnJ7oeOrNmU5ZThjODUtYzM5MS00YjRlLWI
2OTktZTEzNDk1YmVmOWU4&hl=en&authkey=COGel4gP

Enjoy,
Ephraim Ofir


-Original Message-
From: Jens Mueller [mailto:supidupi...@googlemail.com] 
Sent: Thursday, April 07, 2011 8:30 AM
To: solr-user@lucene.apache.org
Subject: Re: Very very large scale Solr Deployment = how to do (Expert
Question)?

Hello Ephraim, hello Lance, hello Walter,

thanks for your replies:

Ephraim, thanks very much for the further detailed explanation. I will
try
to setup a demo system in the next few days and use your advice.
LoadBalancers are an important aspect of your design. Can you recommend
one
LB specificallly? (I would be using haproxy.1wt.eu) . I think the Idea
with
uploading your document is very good. However Google-Docs seemed not be
be
working (at least for me with the docx format?), but maybe you can
simply
output the document as PDF and then I think Google Docs is working, so
all
the others can also have a look at your concept. The best approach would
be
if you could upload your advice directly somewhere to the solr wiki as
it is
really helpful.I found some other documents meanwhile, but yours is much
clearer and more complete, with the LBs and the Aggregators (
http://lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf)

Lance, thanks I will have a look at what linkedin is doing.

Walter, thanks for the advice: Well you are right, mentioning google. My
question was also to understand how such large systems like
google/facebook
are actually working. So my numbers are just theoretical and made up. My
system will be smaller,  but I would be very happy to understand how
such
large systems are build and I think the approach Ephraim showd should be
working quite well at large scale. If you know a good documents (besides
the
bigtable research paper that I already know) that technically describes
how
google is working in detail that would be of great interest. You seem to
be
working for a company that handles large datasets. Does google use this
approach, sharing the index into N writers, and the procuded index is
then
replicated to N "read only searchers"?

thank you all.
best regards
jens



2011/4/7 Walter Underwood 

> The bigger answer is that you cannot get to this size by just
configuring
> Solr. You may have to invent a lot of stuff. Like all of Google.
>
> Where did you get these numbers? The proposed query rate is twice as
big as
> Google (Feb 2010 estimate, 34K qps).
>
> I work at MarkLogic, and we scale to 100's of terabytes, with fast
update
> and query rates. If you want a real system that handles that, you
might want
> to look at our product.
>
> wunder
>
> On Apr 6, 2011, at 8:06 PM, Lance Norskog wrote:
>
> > I would not use replication. LinkedIn consumer search is a flat
system
> > where one process indexes new entries and does queries
simultaneously.
> > It's a custom Lucene app called Zoie. Their stuff is on Github..
> >
> > I would get documents to indexers via a multicast IP-based queueing
> > system. This scales very well and there's a lot of hardware support.
> >
> > The problem with distributed search is that it is a) inherently
slower
> > and b) has inherently more and longer jitter. The "airplane wing"
> > distribution of query times becomes longer and flatter.
> >
> > This is going to have to be a "federated" system, where the
front-end
> > app aggregates results rather than Solr.
> >
> > On Mon, Apr 4, 2011 at 6:25 PM, Jens Mueller

> wrote:
> >> Hello Experts,
> >>
> >>
> >>
> >> I am a Solr newbie but read quite a lot of docs. I still do not
> understand
> >> what would be the best way to setup very large scale deployments:
> >>
> >>
> >>
> >> Goal (threoretical):
> >>
> >>  A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size)
> >>
> >>  B) Queries: 10 Queries/ per Second
> >>
> >>  C) Updates: 10 Updates / per Second
> >>
> >>
> >>
> >>
> >> Solr offers:
> >>
> >> 1.)Replication => Scales Well for B)  BUT  A) and C) are not
> satisfied
> >>
> >>
> >> 2.)Sharding => Scales well for A) BUT B) and C) are not
satisfied
> (=> As
> >> I understand the Sharding approach all goes through a central
server,
> that
> >> dispatches the updates and assembles the quries retrieved from the
> different
> >> shards. But this central server has also some capacity limits...)
> >>
> >>
> >>
> >>
> >> What is the right approach to handle such large deployments? I
would be
> >> thankfull for just a rough sketch of the concepts so I can
> experiment/search
> >> further...
> >>
> >>
> >> Maybe I am missing something very trivial as I think some of the
"Solr
> >> Users/Use Cases" on the homepage are that kind of large
deployments. How
> are
> >> they implemented?
> >>
> >>
> >>
> >> Thanky very much!!!
> >>
> >> Jens
> >>
> >
>
>
>
>
>


Re: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-06 Thread Otis Gospodnetic
Just a quick comment re LinkedIn's stuff.  You can look at Zoie (also covered 
in 
Lucene in Action 2), but you may be more interested in Sensei.

And yes, big systems like that need sharding and replication, multiple master 
and lots of slaves.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Jens Mueller 
> To: solr-user@lucene.apache.org
> Sent: Thu, April 7, 2011 1:29:40 AM
> Subject: Re: Very very large scale Solr Deployment = how to do (Expert 
>Question)?
> 
> Hello Ephraim, hello Lance, hello Walter,
> 
> thanks for your  replies:
> 
> Ephraim, thanks very much for the further detailed explanation.  I will try
> to setup a demo system in the next few days and use your  advice.
> LoadBalancers are an important aspect of your design. Can you  recommend one
> LB specificallly? (I would be using haproxy.1wt.eu) . I think  the Idea with
> uploading your document is very good. However Google-Docs  seemed not be be
> working (at least for me with the docx format?), but maybe  you can simply
> output the document as PDF and then I think Google Docs is  working, so all
> the others can also have a look at your concept. The best  approach would be
> if you could upload your advice directly somewhere to the  solr wiki as it is
> really helpful.I found some other documents meanwhile, but  yours is much
> clearer and more complete, with the LBs and the Aggregators  (
> http://lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf)
> 
> Lance,  thanks I will have a look at what linkedin is doing.
> 
> Walter, thanks for  the advice: Well you are right, mentioning google. My
> question was also to  understand how such large systems like google/facebook
> are actually working.  So my numbers are just theoretical and made up. My
> system will be  smaller,  but I would be very happy to understand how such
> large systems  are build and I think the approach Ephraim showd should be
> working quite well  at large scale. If you know a good documents (besides the
> bigtable research  paper that I already know) that technically describes how
> google is working  in detail that would be of great interest. You seem to be
> working for a  company that handles large datasets. Does google use this
> approach, sharing  the index into N writers, and the procuded index is then
> replicated to N  "read only searchers"?
> 
> thank you all.
> best  regards
> jens
> 
> 
> 
> 2011/4/7 Walter Underwood 
> 
> >  The bigger answer is that you cannot get to this size by just  configuring
> > Solr. You may have to invent a lot of stuff. Like all of  Google.
> >
> > Where did you get these numbers? The proposed query rate  is twice as big as
> > Google (Feb 2010 estimate, 34K qps).
> >
> >  I work at MarkLogic, and we scale to 100's of terabytes, with fast  update
> > and query rates. If you want a real system that handles that, you  might 
want
> > to look at our product.
> >
> >  wunder
> >
> > On Apr 6, 2011, at 8:06 PM, Lance Norskog  wrote:
> >
> > > I would not use replication. LinkedIn consumer  search is a flat system
> > > where one process indexes new entries and  does queries simultaneously.
> > > It's a custom Lucene app called Zoie.  Their stuff is on Github..
> > >
> > > I would get documents to  indexers via a multicast IP-based queueing
> > > system. This scales very  well and there's a lot of hardware support.
> > >
> > > The  problem with distributed search is that it is a) inherently slower
> > >  and b) has inherently more and longer jitter. The "airplane wing"
> > >  distribution of query times becomes longer and flatter.
> > >
> >  > This is going to have to be a "federated" system, where the  front-end
> > > app aggregates results rather than Solr.
> >  >
> > > On Mon, Apr 4, 2011 at 6:25 PM, Jens Mueller 
> >  wrote:
> > >> Hello Experts,
> > >>
> >  >>
> > >>
> > >> I am a Solr newbie but read quite a  lot of docs. I still do not
> > understand
> > >> what would be  the best way to setup very large scale deployments:
> > >>
> >  >>
> > >>
> > >> Goal (threoretical):
> >  >>
> > >>  A.) Index-Size: 1 Petabyte (1 Document is about  5 KB in Size)
> > >>
> > >>  B) Queries: 10  Queries/ per Second
> > >>
> > >>  C) Updates: 10  Updates / per Second
> > >>
> > >>
> > 

Re: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-06 Thread Jens Mueller
Hello Ephraim, hello Lance, hello Walter,

thanks for your replies:

Ephraim, thanks very much for the further detailed explanation. I will try
to setup a demo system in the next few days and use your advice.
LoadBalancers are an important aspect of your design. Can you recommend one
LB specificallly? (I would be using haproxy.1wt.eu) . I think the Idea with
uploading your document is very good. However Google-Docs seemed not be be
working (at least for me with the docx format?), but maybe you can simply
output the document as PDF and then I think Google Docs is working, so all
the others can also have a look at your concept. The best approach would be
if you could upload your advice directly somewhere to the solr wiki as it is
really helpful.I found some other documents meanwhile, but yours is much
clearer and more complete, with the LBs and the Aggregators (
http://lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf)

Lance, thanks I will have a look at what linkedin is doing.

Walter, thanks for the advice: Well you are right, mentioning google. My
question was also to understand how such large systems like google/facebook
are actually working. So my numbers are just theoretical and made up. My
system will be smaller,  but I would be very happy to understand how such
large systems are build and I think the approach Ephraim showd should be
working quite well at large scale. If you know a good documents (besides the
bigtable research paper that I already know) that technically describes how
google is working in detail that would be of great interest. You seem to be
working for a company that handles large datasets. Does google use this
approach, sharing the index into N writers, and the procuded index is then
replicated to N "read only searchers"?

thank you all.
best regards
jens



2011/4/7 Walter Underwood 

> The bigger answer is that you cannot get to this size by just configuring
> Solr. You may have to invent a lot of stuff. Like all of Google.
>
> Where did you get these numbers? The proposed query rate is twice as big as
> Google (Feb 2010 estimate, 34K qps).
>
> I work at MarkLogic, and we scale to 100's of terabytes, with fast update
> and query rates. If you want a real system that handles that, you might want
> to look at our product.
>
> wunder
>
> On Apr 6, 2011, at 8:06 PM, Lance Norskog wrote:
>
> > I would not use replication. LinkedIn consumer search is a flat system
> > where one process indexes new entries and does queries simultaneously.
> > It's a custom Lucene app called Zoie. Their stuff is on Github..
> >
> > I would get documents to indexers via a multicast IP-based queueing
> > system. This scales very well and there's a lot of hardware support.
> >
> > The problem with distributed search is that it is a) inherently slower
> > and b) has inherently more and longer jitter. The "airplane wing"
> > distribution of query times becomes longer and flatter.
> >
> > This is going to have to be a "federated" system, where the front-end
> > app aggregates results rather than Solr.
> >
> > On Mon, Apr 4, 2011 at 6:25 PM, Jens Mueller 
> wrote:
> >> Hello Experts,
> >>
> >>
> >>
> >> I am a Solr newbie but read quite a lot of docs. I still do not
> understand
> >> what would be the best way to setup very large scale deployments:
> >>
> >>
> >>
> >> Goal (threoretical):
> >>
> >>  A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size)
> >>
> >>  B) Queries: 10 Queries/ per Second
> >>
> >>  C) Updates: 10 Updates / per Second
> >>
> >>
> >>
> >>
> >> Solr offers:
> >>
> >> 1.)Replication => Scales Well for B)  BUT  A) and C) are not
> satisfied
> >>
> >>
> >> 2.)Sharding => Scales well for A) BUT B) and C) are not satisfied
> (=> As
> >> I understand the Sharding approach all goes through a central server,
> that
> >> dispatches the updates and assembles the quries retrieved from the
> different
> >> shards. But this central server has also some capacity limits...)
> >>
> >>
> >>
> >>
> >> What is the right approach to handle such large deployments? I would be
> >> thankfull for just a rough sketch of the concepts so I can
> experiment/search
> >> further…
> >>
> >>
> >> Maybe I am missing something very trivial as I think some of the “Solr
> >> Users/Use Cases” on the homepage are that kind of large deployments. How
> are
> >> they implemented?
> >>
> >>
> >>
> >> Thanky very much!!!
> >>
> >> Jens
> >>
> >
>
>
>
>
>


Re: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-06 Thread Walter Underwood
The bigger answer is that you cannot get to this size by just configuring Solr. 
You may have to invent a lot of stuff. Like all of Google.

Where did you get these numbers? The proposed query rate is twice as big as 
Google (Feb 2010 estimate, 34K qps).

I work at MarkLogic, and we scale to 100's of terabytes, with fast update and 
query rates. If you want a real system that handles that, you might want to 
look at our product.

wunder

On Apr 6, 2011, at 8:06 PM, Lance Norskog wrote:

> I would not use replication. LinkedIn consumer search is a flat system
> where one process indexes new entries and does queries simultaneously.
> It's a custom Lucene app called Zoie. Their stuff is on Github..
> 
> I would get documents to indexers via a multicast IP-based queueing
> system. This scales very well and there's a lot of hardware support.
> 
> The problem with distributed search is that it is a) inherently slower
> and b) has inherently more and longer jitter. The "airplane wing"
> distribution of query times becomes longer and flatter.
> 
> This is going to have to be a "federated" system, where the front-end
> app aggregates results rather than Solr.
> 
> On Mon, Apr 4, 2011 at 6:25 PM, Jens Mueller  
> wrote:
>> Hello Experts,
>> 
>> 
>> 
>> I am a Solr newbie but read quite a lot of docs. I still do not understand
>> what would be the best way to setup very large scale deployments:
>> 
>> 
>> 
>> Goal (threoretical):
>> 
>>  A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size)
>> 
>>  B) Queries: 10 Queries/ per Second
>> 
>>  C) Updates: 10 Updates / per Second
>> 
>> 
>> 
>> 
>> Solr offers:
>> 
>> 1.)Replication => Scales Well for B)  BUT  A) and C) are not satisfied
>> 
>> 
>> 2.)Sharding => Scales well for A) BUT B) and C) are not satisfied (=> As
>> I understand the Sharding approach all goes through a central server, that
>> dispatches the updates and assembles the quries retrieved from the different
>> shards. But this central server has also some capacity limits...)
>> 
>> 
>> 
>> 
>> What is the right approach to handle such large deployments? I would be
>> thankfull for just a rough sketch of the concepts so I can experiment/search
>> further…
>> 
>> 
>> Maybe I am missing something very trivial as I think some of the “Solr
>> Users/Use Cases” on the homepage are that kind of large deployments. How are
>> they implemented?
>> 
>> 
>> 
>> Thanky very much!!!
>> 
>> Jens
>> 
> 






Re: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-06 Thread Lance Norskog
I would not use replication. LinkedIn consumer search is a flat system
where one process indexes new entries and does queries simultaneously.
It's a custom Lucene app called Zoie. Their stuff is on Github..

I would get documents to indexers via a multicast IP-based queueing
system. This scales very well and there's a lot of hardware support.

The problem with distributed search is that it is a) inherently slower
and b) has inherently more and longer jitter. The "airplane wing"
distribution of query times becomes longer and flatter.

This is going to have to be a "federated" system, where the front-end
app aggregates results rather than Solr.

On Mon, Apr 4, 2011 at 6:25 PM, Jens Mueller  wrote:
> Hello Experts,
>
>
>
> I am a Solr newbie but read quite a lot of docs. I still do not understand
> what would be the best way to setup very large scale deployments:
>
>
>
> Goal (threoretical):
>
>  A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size)
>
>  B) Queries: 10 Queries/ per Second
>
>  C) Updates: 10 Updates / per Second
>
>
>
>
> Solr offers:
>
> 1.)    Replication => Scales Well for B)  BUT  A) and C) are not satisfied
>
>
> 2.)    Sharding => Scales well for A) BUT B) and C) are not satisfied (=> As
> I understand the Sharding approach all goes through a central server, that
> dispatches the updates and assembles the quries retrieved from the different
> shards. But this central server has also some capacity limits...)
>
>
>
>
> What is the right approach to handle such large deployments? I would be
> thankfull for just a rough sketch of the concepts so I can experiment/search
> further…
>
>
> Maybe I am missing something very trivial as I think some of the “Solr
> Users/Use Cases” on the homepage are that kind of large deployments. How are
> they implemented?
>
>
>
> Thanky very much!!!
>
> Jens
>



-- 
Lance Norskog
goks...@gmail.com


RE: FW: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-06 Thread Ephraim Ofir
Hi all,
I'd love to share the diagram, just not sure how to do that on the list
(it's a word document I tried to send as attachment).

Jens, to answer your questions:
1. Correct, in our setup the source of the data is a DB from which we
pull the data using DIH (search the list for my previous post "DIH -
deleting documents, high performance (delta) imports, and passing
parameters" if you want info about that).  We were lucky enough to have
the data sharded at the DB level before we started using Solr, so using
the same shards was an easy extension.  Note that we're not (yet...)
using SolrCloud, it was just something I thought you should consider.
2. I got the idea for the "aggregator" from the Solr book (PACKT).  I
don't remember if that term was used in the book or if I made it up (if
Google doesn't know it, I probably mad it up...), but I think it conveys
what this part of the puzzle does.  As you said, this is simply a Solr
instance which doesn't hold its own index, but shares the same schema as
the slaves and masters.  I actually defined the default query handler on
this instance to include the shards parameter (see below), so the client
doesn't have to know anything about the internal workings of the sharded
setup, it just hits the aggregator load balancer with a regular query
and everything is handled behind the scenes.  This simplifies the client
and allows me to change the architecture in the future (i.e. change the
number of shards or their DNS name) without requiring a client change.

Sharded query handler:

  

 
   explicit
   ${slaveUrls:null}
 
  

All of our Solr instances share the same configs (solrconfig.xml,
schema.xml, etc.) and different instances take different roles according
to properties defined in solr.xml which is generated by a script
specifically for each Solr instance (the script has a "map" of which
instances should be on which host, and has to be run once on each host).
In this case, this is how the generated solr.xml looks:


   -- just a name that
appears in Solr management
  -- to make it easier
to know which instance you're on

   -- this tells the
instance is an aggregator,
  -- so it should use
the sharded request handler by default
  -- masters and slaves
have master/slave accordingly do define
  -- replication, a
regular default search handler for slaves,
  -- and DIH on masters

 -- this is used by instances
which are shards in order to determine which
 -- DB they should import from
(masters)
 -- and which master they should
replicate from (slaves)

 --
used by the sharded request handler

-- used by load balancer to
 
-- know if this instance is alive
   
  -- just
one core for this instance
  --
indexers have 2 cores, one prod and one for full reindex
   



Let me know if I can assist any further.
Ephraim Ofir


-Original Message-
From: Jonathan DeMello [mailto:demello@googlemail.com] 
Sent: Wednesday, April 06, 2011 8:58 AM
To: solr-user@lucene.apache.org
Cc: Isan Fulia; Tirthankar Chatterjee
Subject: Re: FW: Very very large scale Solr Deployment = how to do
(Expert Question)?

I third that request.

Would greatly appreciate taking a look at that diagram!

Regards,

Jonathan

On Wed, Apr 6, 2011 at 9:12 AM, Isan Fulia 
wrote:

> Hi Ephraim/Jen,
>
> Can u share that diagram with all.It may really help all of us.
> Thanks,
> Isan Fulia.
>
> On 6 April 2011 10:15, Tirthankar Chatterjee
 >wrote:
>
> > Hi Jen,
> > Can you please forward the diagram attachment too that Ephraim sent.
:-)
> > Thanks,
> > Tirthankar
> >
> > -Original Message-
> > From: Jens Mueller [mailto:supidupi...@googlemail.com]
> > Sent: Tuesday, April 05, 2011 10:30 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: FW: Very very large scale Solr Deployment = how to do
> (Expert
> > Question)?
> >
> > Hello Ephraim,
> >
> > thank you so much for the great Document/Scaling-Concept!!
> >
> > First I think you really should publish this on the solr wiki. This
> > approach is nowhere documented there and not really obvious for
newbies
> and
> > your document is great and explains this very well!
> >
> > Please allow me to further questions regarding your document:
> > 1.) Is it correct, that you mean by "DB" the Origin-Data-Source of
the
> data
> > that is fed into the Solr "Cloud"

Re: FW: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-05 Thread Jonathan DeMello
I third that request.

Would greatly appreciate taking a look at that diagram!

Regards,

Jonathan

On Wed, Apr 6, 2011 at 9:12 AM, Isan Fulia  wrote:

> Hi Ephraim/Jen,
>
> Can u share that diagram with all.It may really help all of us.
> Thanks,
> Isan Fulia.
>
> On 6 April 2011 10:15, Tirthankar Chatterjee  >wrote:
>
> > Hi Jen,
> > Can you please forward the diagram attachment too that Ephraim sent. :-)
> > Thanks,
> > Tirthankar
> >
> > -Original Message-
> > From: Jens Mueller [mailto:supidupi...@googlemail.com]
> > Sent: Tuesday, April 05, 2011 10:30 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: FW: Very very large scale Solr Deployment = how to do
> (Expert
> > Question)?
> >
> > Hello Ephraim,
> >
> > thank you so much for the great Document/Scaling-Concept!!
> >
> > First I think you really should publish this on the solr wiki. This
> > approach is nowhere documented there and not really obvious for newbies
> and
> > your document is great and explains this very well!
> >
> > Please allow me to further questions regarding your document:
> > 1.) Is it correct, that you mean by "DB" the Origin-Data-Source of the
> data
> > that is fed into the Solr "Cloud" for searching?
> >
> > 2.) Solr Aggregator: This term did not yeald any google results, but is a
> > very important aspect of your design (and this was the missing piece for
> me
> > when thinking about solr architectures): Is it cocrrec that the
> > "aggregators" are simply tomcat instances, with the solr webapp deployed?
> > These Aggregators do not have their own index but only run the solr
> webapp
> > and I access them via the ?shard= parameter giving the shards I want to
> > query? (So in the end they aggreate the data of the shards but do not
> have
> > their own data). This is really an important aspect that is not
> documented
> > well enough in the solr documentation.
> >
> > Thank you very much!
> > Jens
> >
> >
> > 2011/4/5 Ephraim Ofir 
> >
> > > of course the attachment didn't get to the list, so here it is if you
> > > want it...
> > >
> > > Ephraim Ofir
> > >
> > >
> > > -Original Message-
> > > From: Ephraim Ofir
> > > Sent: Tuesday, April 05, 2011 10:20 AM
> > > To: 'solr-user@lucene.apache.org'
> > > Subject: RE: Very very large scale Solr Deployment = how to do (Expert
> > > Question)?
> > >
> > > I'm not sure about the scale you're aiming for, but you probably want
> > > to do both sharding and replication.  There's no central server which
> > > would be the bottleneck. The guidelines should probably be something
> > like:
> > > 1. Split your index to enough shards so it can keep up with the update
> > > rate.
> > > 2. Have enough replicates of each shard master to keep up with the
> > > rate of queries.
> > > 3. Have enough aggregators in front of the shard replicates so the
> > > aggregation doesn't become a bottleneck.
> > > 4. Make sure you have good load balancing across your system.
> > >
> > > Attached is a diagram of the setup we have.  You might want to look
> > > into SolrCloud as well.
> > >
> > > Ephraim Ofir
> > >
> > >
> > > -Original Message-
> > > From: Jens Mueller [mailto:supidupi...@googlemail.com]
> > > Sent: Tuesday, April 05, 2011 4:25 AM
> > > To: solr-user@lucene.apache.org
> > > Subject: Very very large scale Solr Deployment = how to do (Expert
> > > Question)?
> > >
> > > Hello Experts,
> > >
> > >
> > >
> > > I am a Solr newbie but read quite a lot of docs. I still do not
> > > understand what would be the best way to setup very large scale
> > > deployments:
> > >
> > >
> > >
> > > Goal (threoretical):
> > >
> > >  A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size)
> > >
> > >  B) Queries: 10 Queries/ per Second
> > >
> > >  C) Updates: 10 Updates / per Second
> > >
> > >
> > >
> > >
> > > Solr offers:
> > >
> > > 1.)Replication => Scales Well for B)  BUT  A) and C) are not
> > > satisfied
> > >
> > >
> > > 2.)Sharding => Scales well for A) BUT B) and C) are not satisfied
> > > (=> As
&g

Re: FW: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-05 Thread Isan Fulia
Hi Ephraim/Jen,

Can u share that diagram with all.It may really help all of us.
Thanks,
Isan Fulia.

On 6 April 2011 10:15, Tirthankar Chatterjee wrote:

> Hi Jen,
> Can you please forward the diagram attachment too that Ephraim sent. :-)
> Thanks,
> Tirthankar
>
> -Original Message-
> From: Jens Mueller [mailto:supidupi...@googlemail.com]
> Sent: Tuesday, April 05, 2011 10:30 PM
> To: solr-user@lucene.apache.org
> Subject: Re: FW: Very very large scale Solr Deployment = how to do (Expert
> Question)?
>
> Hello Ephraim,
>
> thank you so much for the great Document/Scaling-Concept!!
>
> First I think you really should publish this on the solr wiki. This
> approach is nowhere documented there and not really obvious for newbies and
> your document is great and explains this very well!
>
> Please allow me to further questions regarding your document:
> 1.) Is it correct, that you mean by "DB" the Origin-Data-Source of the data
> that is fed into the Solr "Cloud" for searching?
>
> 2.) Solr Aggregator: This term did not yeald any google results, but is a
> very important aspect of your design (and this was the missing piece for me
> when thinking about solr architectures): Is it cocrrec that the
> "aggregators" are simply tomcat instances, with the solr webapp deployed?
> These Aggregators do not have their own index but only run the solr webapp
> and I access them via the ?shard= parameter giving the shards I want to
> query? (So in the end they aggreate the data of the shards but do not have
> their own data). This is really an important aspect that is not documented
> well enough in the solr documentation.
>
> Thank you very much!
> Jens
>
>
> 2011/4/5 Ephraim Ofir 
>
> > of course the attachment didn't get to the list, so here it is if you
> > want it...
> >
> > Ephraim Ofir
> >
> >
> > -----Original Message-
> > From: Ephraim Ofir
> > Sent: Tuesday, April 05, 2011 10:20 AM
> > To: 'solr-user@lucene.apache.org'
> > Subject: RE: Very very large scale Solr Deployment = how to do (Expert
> > Question)?
> >
> > I'm not sure about the scale you're aiming for, but you probably want
> > to do both sharding and replication.  There's no central server which
> > would be the bottleneck. The guidelines should probably be something
> like:
> > 1. Split your index to enough shards so it can keep up with the update
> > rate.
> > 2. Have enough replicates of each shard master to keep up with the
> > rate of queries.
> > 3. Have enough aggregators in front of the shard replicates so the
> > aggregation doesn't become a bottleneck.
> > 4. Make sure you have good load balancing across your system.
> >
> > Attached is a diagram of the setup we have.  You might want to look
> > into SolrCloud as well.
> >
> > Ephraim Ofir
> >
> >
> > -Original Message-
> > From: Jens Mueller [mailto:supidupi...@googlemail.com]
> > Sent: Tuesday, April 05, 2011 4:25 AM
> > To: solr-user@lucene.apache.org
> > Subject: Very very large scale Solr Deployment = how to do (Expert
> > Question)?
> >
> > Hello Experts,
> >
> >
> >
> > I am a Solr newbie but read quite a lot of docs. I still do not
> > understand what would be the best way to setup very large scale
> > deployments:
> >
> >
> >
> > Goal (threoretical):
> >
> >  A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size)
> >
> >  B) Queries: 10 Queries/ per Second
> >
> >  C) Updates: 10 Updates / per Second
> >
> >
> >
> >
> > Solr offers:
> >
> > 1.)Replication => Scales Well for B)  BUT  A) and C) are not
> > satisfied
> >
> >
> > 2.)Sharding => Scales well for A) BUT B) and C) are not satisfied
> > (=> As
> > I understand the Sharding approach all goes through a central server,
> > that dispatches the updates and assembles the quries retrieved from
> > the different shards. But this central server has also some capacity
> > limits...)
> >
> >
> >
> >
> > What is the right approach to handle such large deployments? I would
> > be thankfull for just a rough sketch of the concepts so I can
> > experiment/search further...
> >
> >
> > Maybe I am missing something very trivial as I think some of the "Solr
> > Users/Use Cases" on the homepage are that kind of large deployments.
> > How are they implemented?
> >
> >
> >
> > Thanky very much!!!
> >
> > Jens
> >
> **Legal Disclaimer***
> "This communication may contain confidential and privileged
> material for the sole use of the intended recipient. Any
> unauthorized review, use or distribution by others is strictly
> prohibited. If you have received the message in error, please
> advise the sender by reply email and delete the message. Thank
> you."
> *
>



-- 
Thanks & Regards,
Isan Fulia.


RE: FW: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-05 Thread Tirthankar Chatterjee
Hi Jen,
Can you please forward the diagram attachment too that Ephraim sent. :-)
Thanks,
Tirthankar 

-Original Message-
From: Jens Mueller [mailto:supidupi...@googlemail.com] 
Sent: Tuesday, April 05, 2011 10:30 PM
To: solr-user@lucene.apache.org
Subject: Re: FW: Very very large scale Solr Deployment = how to do (Expert 
Question)?

Hello Ephraim,

thank you so much for the great Document/Scaling-Concept!!

First I think you really should publish this on the solr wiki. This approach is 
nowhere documented there and not really obvious for newbies and your document 
is great and explains this very well!

Please allow me to further questions regarding your document:
1.) Is it correct, that you mean by "DB" the Origin-Data-Source of the data 
that is fed into the Solr "Cloud" for searching?

2.) Solr Aggregator: This term did not yeald any google results, but is a very 
important aspect of your design (and this was the missing piece for me when 
thinking about solr architectures): Is it cocrrec that the "aggregators" are 
simply tomcat instances, with the solr webapp deployed?
These Aggregators do not have their own index but only run the solr webapp and 
I access them via the ?shard= parameter giving the shards I want to query? (So 
in the end they aggreate the data of the shards but do not have their own 
data). This is really an important aspect that is not documented well enough in 
the solr documentation.

Thank you very much!
Jens


2011/4/5 Ephraim Ofir 

> of course the attachment didn't get to the list, so here it is if you 
> want it...
>
> Ephraim Ofir
>
>
> -Original Message-
> From: Ephraim Ofir
> Sent: Tuesday, April 05, 2011 10:20 AM
> To: 'solr-user@lucene.apache.org'
> Subject: RE: Very very large scale Solr Deployment = how to do (Expert 
> Question)?
>
> I'm not sure about the scale you're aiming for, but you probably want 
> to do both sharding and replication.  There's no central server which 
> would be the bottleneck. The guidelines should probably be something like:
> 1. Split your index to enough shards so it can keep up with the update 
> rate.
> 2. Have enough replicates of each shard master to keep up with the 
> rate of queries.
> 3. Have enough aggregators in front of the shard replicates so the 
> aggregation doesn't become a bottleneck.
> 4. Make sure you have good load balancing across your system.
>
> Attached is a diagram of the setup we have.  You might want to look 
> into SolrCloud as well.
>
> Ephraim Ofir
>
>
> -Original Message-
> From: Jens Mueller [mailto:supidupi...@googlemail.com]
> Sent: Tuesday, April 05, 2011 4:25 AM
> To: solr-user@lucene.apache.org
> Subject: Very very large scale Solr Deployment = how to do (Expert 
> Question)?
>
> Hello Experts,
>
>
>
> I am a Solr newbie but read quite a lot of docs. I still do not 
> understand what would be the best way to setup very large scale
> deployments:
>
>
>
> Goal (threoretical):
>
>  A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size)
>
>  B) Queries: 10 Queries/ per Second
>
>  C) Updates: 10 Updates / per Second
>
>
>
>
> Solr offers:
>
> 1.)Replication => Scales Well for B)  BUT  A) and C) are not
> satisfied
>
>
> 2.)Sharding => Scales well for A) BUT B) and C) are not satisfied
> (=> As
> I understand the Sharding approach all goes through a central server, 
> that dispatches the updates and assembles the quries retrieved from 
> the different shards. But this central server has also some capacity
> limits...)
>
>
>
>
> What is the right approach to handle such large deployments? I would 
> be thankfull for just a rough sketch of the concepts so I can 
> experiment/search further...
>
>
> Maybe I am missing something very trivial as I think some of the "Solr 
> Users/Use Cases" on the homepage are that kind of large deployments. 
> How are they implemented?
>
>
>
> Thanky very much!!!
>
> Jens
>
**Legal Disclaimer***
"This communication may contain confidential and privileged
material for the sole use of the intended recipient. Any
unauthorized review, use or distribution by others is strictly
prohibited. If you have received the message in error, please
advise the sender by reply email and delete the message. Thank
you."
*


Re: FW: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-05 Thread Jens Mueller
Hello Ephraim,

thank you so much for the great Document/Scaling-Concept!!

First I think you really should publish this on the solr wiki. This approach
is nowhere documented there and not really obvious for newbies and your
document is great and explains this very well!

Please allow me to further questions regarding your document:
1.) Is it correct, that you mean by "DB" the Origin-Data-Source of the data
that is fed into the Solr "Cloud" for searching?

2.) Solr Aggregator: This term did not yeald any google results, but is a
very important aspect of your design (and this was the missing piece for me
when thinking about solr architectures): Is it cocrrec that the
"aggregators" are simply tomcat instances, with the solr webapp deployed?
These Aggregators do not have their own index but only run the solr webapp
and I access them via the ?shard= parameter giving the shards I want to
query? (So in the end they aggreate the data of the shards but do not have
their own data). This is really an important aspect that is not documented
well enough in the solr documentation.

Thank you very much!
Jens


2011/4/5 Ephraim Ofir 

> of course the attachment didn't get to the list, so here it is if you
> want it...
>
> Ephraim Ofir
>
>
> -Original Message-
> From: Ephraim Ofir
> Sent: Tuesday, April 05, 2011 10:20 AM
> To: 'solr-user@lucene.apache.org'
> Subject: RE: Very very large scale Solr Deployment = how to do (Expert
> Question)?
>
> I'm not sure about the scale you're aiming for, but you probably want to
> do both sharding and replication.  There's no central server which would
> be the bottleneck. The guidelines should probably be something like:
> 1. Split your index to enough shards so it can keep up with the update
> rate.
> 2. Have enough replicates of each shard master to keep up with the rate
> of queries.
> 3. Have enough aggregators in front of the shard replicates so the
> aggregation doesn't become a bottleneck.
> 4. Make sure you have good load balancing across your system.
>
> Attached is a diagram of the setup we have.  You might want to look into
> SolrCloud as well.
>
> Ephraim Ofir
>
>
> -Original Message-
> From: Jens Mueller [mailto:supidupi...@googlemail.com]
> Sent: Tuesday, April 05, 2011 4:25 AM
> To: solr-user@lucene.apache.org
> Subject: Very very large scale Solr Deployment = how to do (Expert
> Question)?
>
> Hello Experts,
>
>
>
> I am a Solr newbie but read quite a lot of docs. I still do not
> understand what would be the best way to setup very large scale
> deployments:
>
>
>
> Goal (threoretical):
>
>  A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size)
>
>  B) Queries: 10 Queries/ per Second
>
>  C) Updates: 10 Updates / per Second
>
>
>
>
> Solr offers:
>
> 1.)Replication => Scales Well for B)  BUT  A) and C) are not
> satisfied
>
>
> 2.)Sharding => Scales well for A) BUT B) and C) are not satisfied
> (=> As
> I understand the Sharding approach all goes through a central server,
> that dispatches the updates and assembles the quries retrieved from the
> different shards. But this central server has also some capacity
> limits...)
>
>
>
>
> What is the right approach to handle such large deployments? I would be
> thankfull for just a rough sketch of the concepts so I can
> experiment/search further...
>
>
> Maybe I am missing something very trivial as I think some of the "Solr
> Users/Use Cases" on the homepage are that kind of large deployments. How
> are they implemented?
>
>
>
> Thanky very much!!!
>
> Jens
>


Re: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-05 Thread François Schiettecatte
And if you have control over machine placement, split them across racks so that 
a power outage on one rack does not take out your search cluster.

François

On Apr 5, 2011, at 3:19 AM, Ephraim Ofir wrote:

> I'm not sure about the scale you're aiming for, but you probably want to
> do both sharding and replication.  There's no central server which would
> be the bottleneck. The guidelines should probably be something like:
> 1. Split your index to enough shards so it can keep up with the update
> rate.
> 2. Have enough replicates of each shard master to keep up with the rate
> of queries.
> 3. Have enough aggregators in front of the shard replicates so the
> aggregation doesn't become a bottleneck.
> 4. Make sure you have good load balancing across your system.
> 
> Attached is a diagram of the setup we have.  You might want to look into
> SolrCloud as well.
> 
> Ephraim Ofir
> 
> 
> -Original Message-
> From: Jens Mueller [mailto:supidupi...@googlemail.com] 
> Sent: Tuesday, April 05, 2011 4:25 AM
> To: solr-user@lucene.apache.org
> Subject: Very very large scale Solr Deployment = how to do (Expert
> Question)?
> 
> Hello Experts,
> 
> 
> 
> I am a Solr newbie but read quite a lot of docs. I still do not
> understand
> what would be the best way to setup very large scale deployments:
> 
> 
> 
> Goal (threoretical):
> 
> A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size)
> 
> B) Queries: 10 Queries/ per Second
> 
> C) Updates: 10 Updates / per Second
> 
> 
> 
> 
> Solr offers:
> 
> 1.)Replication => Scales Well for B)  BUT  A) and C) are not
> satisfied
> 
> 
> 2.)Sharding => Scales well for A) BUT B) and C) are not satisfied
> (=> As
> I understand the Sharding approach all goes through a central server,
> that
> dispatches the updates and assembles the quries retrieved from the
> different
> shards. But this central server has also some capacity limits...)
> 
> 
> 
> 
> What is the right approach to handle such large deployments? I would be
> thankfull for just a rough sketch of the concepts so I can
> experiment/search
> further...
> 
> 
> Maybe I am missing something very trivial as I think some of the "Solr
> Users/Use Cases" on the homepage are that kind of large deployments. How
> are
> they implemented?
> 
> 
> 
> Thanky very much!!!
> 
> Jens



RE: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-05 Thread Ephraim Ofir
I'm not sure about the scale you're aiming for, but you probably want to
do both sharding and replication.  There's no central server which would
be the bottleneck. The guidelines should probably be something like:
1. Split your index to enough shards so it can keep up with the update
rate.
2. Have enough replicates of each shard master to keep up with the rate
of queries.
3. Have enough aggregators in front of the shard replicates so the
aggregation doesn't become a bottleneck.
4. Make sure you have good load balancing across your system.

Attached is a diagram of the setup we have.  You might want to look into
SolrCloud as well.

Ephraim Ofir


-Original Message-
From: Jens Mueller [mailto:supidupi...@googlemail.com] 
Sent: Tuesday, April 05, 2011 4:25 AM
To: solr-user@lucene.apache.org
Subject: Very very large scale Solr Deployment = how to do (Expert
Question)?

Hello Experts,



I am a Solr newbie but read quite a lot of docs. I still do not
understand
what would be the best way to setup very large scale deployments:



Goal (threoretical):

 A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size)

 B) Queries: 10 Queries/ per Second

 C) Updates: 10 Updates / per Second




Solr offers:

1.)Replication => Scales Well for B)  BUT  A) and C) are not
satisfied


2.)Sharding => Scales well for A) BUT B) and C) are not satisfied
(=> As
I understand the Sharding approach all goes through a central server,
that
dispatches the updates and assembles the quries retrieved from the
different
shards. But this central server has also some capacity limits...)




What is the right approach to handle such large deployments? I would be
thankfull for just a rough sketch of the concepts so I can
experiment/search
further...


Maybe I am missing something very trivial as I think some of the "Solr
Users/Use Cases" on the homepage are that kind of large deployments. How
are
they implemented?



Thanky very much!!!

Jens


Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-04 Thread Jens Mueller
Hello Experts,



I am a Solr newbie but read quite a lot of docs. I still do not understand
what would be the best way to setup very large scale deployments:



Goal (threoretical):

 A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size)

 B) Queries: 10 Queries/ per Second

 C) Updates: 10 Updates / per Second




Solr offers:

1.)Replication => Scales Well for B)  BUT  A) and C) are not satisfied


2.)Sharding => Scales well for A) BUT B) and C) are not satisfied (=> As
I understand the Sharding approach all goes through a central server, that
dispatches the updates and assembles the quries retrieved from the different
shards. But this central server has also some capacity limits...)




What is the right approach to handle such large deployments? I would be
thankfull for just a rough sketch of the concepts so I can experiment/search
further…


Maybe I am missing something very trivial as I think some of the “Solr
Users/Use Cases” on the homepage are that kind of large deployments. How are
they implemented?



Thanky very much!!!

Jens