Re: multicore for 20k users?
2009/5/17 Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com: A few questions, 1) what is the frequency of inserts? A few per day per user at MOST. 2) how many cores need to be up and running at any given point That depends on the people. I would love to be able to tie it to their webapp session, maybe 100 at once? No idea, really. Thankyou, Chris On Mon, May 18, 2009 at 3:23 AM, Chris Cornell srchn...@gmail.com wrote: Trying to create a search solution for about 20k users at a company. Each person's documents are private and different (some overlap... it would be nice to not have to store/index copies). Is multicore something that would work or should we auto-insert a facet into each query generated by the person? Thanks for any advice, I am very new to solr. Any tiny push in the right direction would be appreciated. Thanks, Chris -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: multicore for 20k users?
since there is so little overlap, I would look at a core for each user... However, to manage 20K cores, you will not want to use the off the shelf core management implementation to maintain these cores. Consider overriding SolrDispatchFilter to initialize a CoreContainer that you manage. On May 17, 2009, at 10:11 PM, Chris Cornell wrote: On Sun, May 17, 2009 at 8:38 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Chris, Yes, disk space is cheap, and with so little overlap you won't gain much by putting everything in a single index. Plus, when each user has a separate index, it's easy to to split users and distribute over multiple machines if you ever need to do that, it's easy and fast to completely reindex one user's data without affecting other users, etc. Several years ago I built Simpy at http://www.simpy.com/ that way (but pre-Solr, so it uses Lucene directly) and never regretted it. There are way more than 20K users there with many searches per second and with constant indexing. Each user has an index for bookmarks and an index for notes. Each group has its own index, shared by all group members. The main bookmark search is another index. People search is yet another index. And so on. Single server. Thankyou very much for your insight and experience, sounds like we shouldn't be thinking about prematurely optimizing this. Has someone actually used multicore this way, though? With thousands of them? Independently of advice in that regard, I guess our next step is to explore and create some dummy scenarios/tests to try and stress multicore (search latency is not as much of a factor as memory usage is). I'll report back on any conclusion we come to. Thanks! Chris
multicore for 20k users?
Trying to create a search solution for about 20k users at a company. Each person's documents are private and different (some overlap... it would be nice to not have to store/index copies). Is multicore something that would work or should we auto-insert a facet into each query generated by the person? Thanks for any advice, I am very new to solr. Any tiny push in the right direction would be appreciated. Thanks, Chris
Re: multicore for 20k users?
how much overlap is there with the 20k user documents? if you create a separate index for each of them will you be indexing 90% of the documents 20K times? How many total documents could an individual user typically see? How many total distinct documents are you talking about? Is the indexing strategy the same for all users? (the same analysis etc) Is it actually possible to limit visibility by role rather then user? I would start with trying to put everything in one index -- if that is not possible, then look at a multi-core option. On May 17, 2009, at 5:53 PM, Chris Cornell wrote: Trying to create a search solution for about 20k users at a company. Each person's documents are private and different (some overlap... it would be nice to not have to store/index copies). Is multicore something that would work or should we auto-insert a facet into each query generated by the person? Thanks for any advice, I am very new to solr. Any tiny push in the right direction would be appreciated. Thanks, Chris
Re: multicore for 20k users?
Thanks for helping Ryan, On Sun, May 17, 2009 at 7:17 PM, Ryan McKinley ryan...@gmail.com wrote: how much overlap is there with the 20k user documents? There are around 20k users but each one has anywhere from zero to thousands of documents. The final overlap is unknown because there is a current set of documents but each user will add documents on the fly (it's like their own personal search engine in a way). if you create a separate index for each of them will you be indexing 90% of the documents 20K times? Probably more like 5-10% How many total documents could an individual user typically see? Average is around 100 now but we want them to be able to add more. How many total distinct documents are you talking about? Is the indexing strategy the same for all users? (the same analysis etc) The indexing strategy is the same for each user. Is it actually possible to limit visibility by role rather then user? No, it has to be by user since it is a private document set. We just want to save on diskspace when there are big documents that are the same across users (based on document checksum). I would start with trying to put everything in one index -- if that is not possible, then look at a multi-core option. OK. Another thing is that we want to allow the user to restrict searches based on when the document was added... if we do share an indexed item and insert some attribute into each query (like user:ralph) then it couldn't have date-added based search. Unless a field was added like date-added-by-ralph, date-added-by-sally (ugh!). Or maybe diskspace is cheap and we just should strive for simplicity? Thanks, Chris On May 17, 2009, at 5:53 PM, Chris Cornell wrote: Trying to create a search solution for about 20k users at a company. Each person's documents are private and different (some overlap... it would be nice to not have to store/index copies). Is multicore something that would work or should we auto-insert a facet into each query generated by the person? Thanks for any advice, I am very new to solr. Any tiny push in the right direction would be appreciated. Thanks, Chris
Re: multicore for 20k users?
Chris, Yes, disk space is cheap, and with so little overlap you won't gain much by putting everything in a single index. Plus, when each user has a separate index, it's easy to to split users and distribute over multiple machines if you ever need to do that, it's easy and fast to completely reindex one user's data without affecting other users, etc. Several years ago I built Simpy at http://www.simpy.com/ that way (but pre-Solr, so it uses Lucene directly) and never regretted it. There are way more than 20K users there with many searches per second and with constant indexing. Each user has an index for bookmarks and an index for notes. Each group has its own index, shared by all group members. The main bookmark search is another index. People search is yet another index. And so on. Single server. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Chris Cornell srchn...@gmail.com To: solr-user@lucene.apache.org Sent: Sunday, May 17, 2009 8:37:44 PM Subject: Re: multicore for 20k users? Thanks for helping Ryan, On Sun, May 17, 2009 at 7:17 PM, Ryan McKinley wrote: how much overlap is there with the 20k user documents? There are around 20k users but each one has anywhere from zero to thousands of documents. The final overlap is unknown because there is a current set of documents but each user will add documents on the fly (it's like their own personal search engine in a way). if you create a separate index for each of them will you be indexing 90% of the documents 20K times? Probably more like 5-10% How many total documents could an individual user typically see? Average is around 100 now but we want them to be able to add more. How many total distinct documents are you talking about? Is the indexing strategy the same for all users? (the same analysis etc) The indexing strategy is the same for each user. Is it actually possible to limit visibility by role rather then user? No, it has to be by user since it is a private document set. We just want to save on diskspace when there are big documents that are the same across users (based on document checksum). I would start with trying to put everything in one index -- if that is not possible, then look at a multi-core option. OK. Another thing is that we want to allow the user to restrict searches based on when the document was added... if we do share an indexed item and insert some attribute into each query (like user:ralph) then it couldn't have date-added based search. Unless a field was added like date-added-by-ralph, date-added-by-sally (ugh!). Or maybe diskspace is cheap and we just should strive for simplicity? Thanks, Chris On May 17, 2009, at 5:53 PM, Chris Cornell wrote: Trying to create a search solution for about 20k users at a company. Each person's documents are private and different (some overlap... it would be nice to not have to store/index copies). Is multicore something that would work or should we auto-insert a facet into each query generated by the person? Thanks for any advice, I am very new to solr. Any tiny push in the right direction would be appreciated. Thanks, Chris
Re: multicore for 20k users?
A few questions, 1) what is the frequency of inserts? 2) how many cores need to be up and running at any given point On Mon, May 18, 2009 at 3:23 AM, Chris Cornell srchn...@gmail.com wrote: Trying to create a search solution for about 20k users at a company. Each person's documents are private and different (some overlap... it would be nice to not have to store/index copies). Is multicore something that would work or should we auto-insert a facet into each query generated by the person? Thanks for any advice, I am very new to solr. Any tiny push in the right direction would be appreciated. Thanks, Chris -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: multicore for 20k users?
On Mon, May 18, 2009 at 8:18 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Chris, As far as I know, AOL is using Solr with lots of cores. What I don't know is how they are handling shutting down of idle cores, which is something you'll need to do if your machine can't handle all cores being open and their data structures being populated at all times. I know I had to do that same for Simpy. :) we have a custom build of Solr. we do just in time automatic loading of cores and an LRU based unloading of cores when the upper water mark is crossed Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Chris Cornell srchn...@gmail.com To: solr-user@lucene.apache.org Sent: Sunday, May 17, 2009 10:11:10 PM Subject: Re: multicore for 20k users? On Sun, May 17, 2009 at 8:38 PM, Otis Gospodnetic wrote: Chris, Yes, disk space is cheap, and with so little overlap you won't gain much by putting everything in a single index. Plus, when each user has a separate index, it's easy to to split users and distribute over multiple machines if you ever need to do that, it's easy and fast to completely reindex one user's data without affecting other users, etc. Several years ago I built Simpy at http://www.simpy.com/ that way (but pre-Solr, so it uses Lucene directly) and never regretted it. There are way more than 20K users there with many searches per second and with constant indexing. Each user has an index for bookmarks and an index for notes. Each group has its own index, shared by all group members. The main bookmark search is another index. People search is yet another index. And so on. Single server. Thankyou very much for your insight and experience, sounds like we shouldn't be thinking about prematurely optimizing this. Has someone actually used multicore this way, though? With thousands of them? Independently of advice in that regard, I guess our next step is to explore and create some dummy scenarios/tests to try and stress multicore (search latency is not as much of a factor as memory usage is). I'll report back on any conclusion we come to. Thanks! Chris -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: multicore for 20k users?
On Sun, May 17, 2009 at 8:38 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Chris, Yes, disk space is cheap, and with so little overlap you won't gain much by putting everything in a single index. Plus, when each user has a separate index, it's easy to to split users and distribute over multiple machines if you ever need to do that, it's easy and fast to completely reindex one user's data without affecting other users, etc. Several years ago I built Simpy at http://www.simpy.com/ that way (but pre-Solr, so it uses Lucene directly) and never regretted it. There are way more than 20K users there with many searches per second and with constant indexing. Each user has an index for bookmarks and an index for notes. Each group has its own index, shared by all group members. The main bookmark search is another index. People search is yet another index. And so on. Single server. Thankyou very much for your insight and experience, sounds like we shouldn't be thinking about prematurely optimizing this. Has someone actually used multicore this way, though? With thousands of them? Independently of advice in that regard, I guess our next step is to explore and create some dummy scenarios/tests to try and stress multicore (search latency is not as much of a factor as memory usage is). I'll report back on any conclusion we come to. Thanks! Chris
Re: multicore for 20k users?
Chris, As far as I know, AOL is using Solr with lots of cores. What I don't know is how they are handling shutting down of idle cores, which is something you'll need to do if your machine can't handle all cores being open and their data structures being populated at all times. I know I had to do that same for Simpy. :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Chris Cornell srchn...@gmail.com To: solr-user@lucene.apache.org Sent: Sunday, May 17, 2009 10:11:10 PM Subject: Re: multicore for 20k users? On Sun, May 17, 2009 at 8:38 PM, Otis Gospodnetic wrote: Chris, Yes, disk space is cheap, and with so little overlap you won't gain much by putting everything in a single index. Plus, when each user has a separate index, it's easy to to split users and distribute over multiple machines if you ever need to do that, it's easy and fast to completely reindex one user's data without affecting other users, etc. Several years ago I built Simpy at http://www.simpy.com/ that way (but pre-Solr, so it uses Lucene directly) and never regretted it. There are way more than 20K users there with many searches per second and with constant indexing. Each user has an index for bookmarks and an index for notes. Each group has its own index, shared by all group members. The main bookmark search is another index. People search is yet another index. And so on. Single server. Thankyou very much for your insight and experience, sounds like we shouldn't be thinking about prematurely optimizing this. Has someone actually used multicore this way, though? With thousands of them? Independently of advice in that regard, I guess our next step is to explore and create some dummy scenarios/tests to try and stress multicore (search latency is not as much of a factor as memory usage is). I'll report back on any conclusion we come to. Thanks! Chris