Re: multicore for 20k users?

2009-05-18 Thread Chris Cornell
2009/5/17 Noble Paul നോബിള്‍  नोब्ळ् noble.p...@corp.aol.com:
 A few questions,
 1) what is the frequency of inserts?

A few per day per user at MOST.

 2) how many cores need to be up and running at any given point


That depends on the people.  I would love to be able to tie it to
their webapp session, maybe 100 at once?  No idea, really.

Thankyou,
Chris



 On Mon, May 18, 2009 at 3:23 AM, Chris Cornell srchn...@gmail.com wrote:
 Trying to create a search solution for about 20k users at a company.
 Each person's documents are private and different (some overlap... it
 would be nice to not have to store/index copies).

 Is multicore something that would work or should we auto-insert a
 facet into each query generated by the person?

 Thanks for any advice, I am very new to solr.  Any tiny push in the
 right direction would be appreciated.

 Thanks,
 Chris




 --
 -
 Noble Paul | Principal Engineer| AOL | http://aol.com



Re: multicore for 20k users?

2009-05-18 Thread Ryan McKinley
since there is so little overlap, I would look at a core for each  
user...


However, to manage 20K cores, you will not want to use the off the  
shelf core management implementation to maintain these cores.   
Consider overriding SolrDispatchFilter to initialize a CoreContainer  
that you manage.



On May 17, 2009, at 10:11 PM, Chris Cornell wrote:


On Sun, May 17, 2009 at 8:38 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:


Chris,

Yes, disk space is cheap, and with so little overlap you won't gain  
much by putting everything in a single index.  Plus, when each user  
has a separate index, it's easy to to split users and distribute  
over multiple machines if you ever need to do that, it's easy and  
fast to completely reindex one user's data without affecting other  
users, etc.


Several years ago I built Simpy at http://www.simpy.com/ that way  
(but pre-Solr, so it uses Lucene directly) and never regretted it.   
There are way more than 20K users there with many searches per  
second and with constant indexing.  Each user has an index for  
bookmarks and an index for notes.  Each group has its own index,  
shared by all group members.  The main bookmark search is another  
index.  People search is yet another index.  And so on.  Single  
server.




Thankyou very much for your insight and experience, sounds like we
shouldn't be thinking about prematurely optimizing this.

Has someone actually used multicore this way, though?  With  
thousands of them?


Independently of advice in that regard, I guess our next step is to
explore and create some dummy scenarios/tests to try and stress
multicore (search latency is not as much of a factor as memory usage
is).  I'll report back on any conclusion we come to.

Thanks!
Chris




multicore for 20k users?

2009-05-17 Thread Chris Cornell
Trying to create a search solution for about 20k users at a company.
Each person's documents are private and different (some overlap... it
would be nice to not have to store/index copies).

Is multicore something that would work or should we auto-insert a
facet into each query generated by the person?

Thanks for any advice, I am very new to solr.  Any tiny push in the
right direction would be appreciated.

Thanks,
Chris


Re: multicore for 20k users?

2009-05-17 Thread Ryan McKinley

how much overlap is there with the 20k user documents?

if you create a separate index for each of them will you be indexing  
90% of the documents 20K times?  How many total documents could an  
individual user typically see?  How many total distinct documents are  
you talking about?  Is the indexing strategy the same for all users?   
(the same analysis etc)


Is it actually possible to limit visibility by role rather then user?

I would start with trying to put everything in one index -- if that is  
not possible, then look at a multi-core option.




On May 17, 2009, at 5:53 PM, Chris Cornell wrote:


Trying to create a search solution for about 20k users at a company.
Each person's documents are private and different (some overlap... it
would be nice to not have to store/index copies).

Is multicore something that would work or should we auto-insert a
facet into each query generated by the person?

Thanks for any advice, I am very new to solr.  Any tiny push in the
right direction would be appreciated.

Thanks,
Chris




Re: multicore for 20k users?

2009-05-17 Thread Chris Cornell
Thanks for helping Ryan,

On Sun, May 17, 2009 at 7:17 PM, Ryan McKinley ryan...@gmail.com wrote:
 how much overlap is there with the 20k user documents?

There are around 20k users but each one has anywhere from zero to
thousands of documents.  The final overlap is unknown because there is
a current set of documents but each user will add documents on the fly
(it's like their own personal search engine in a way).


 if you create a separate index for each of them will you be indexing 90% of
 the documents 20K times?

Probably more like 5-10%

 How many total documents could an individual user
 typically see?

Average is around 100 now but we want them to be able to add more.

 How many total distinct documents are you talking about?  Is
 the indexing strategy the same for all users?  (the same analysis etc)

The indexing strategy is the same for each user.


 Is it actually possible to limit visibility by role rather then user?

No, it has to be by user since it is a private document set.  We just
want to save on diskspace when there are big documents that are the
same across users (based on document checksum).


 I would start with trying to put everything in one index -- if that is not
 possible, then look at a multi-core option.

OK.  Another thing is that we want to allow the user to restrict
searches based on when the document was added... if we do share an
indexed item and insert some attribute into each query (like
user:ralph) then it couldn't have date-added based search.  Unless a
field was added like date-added-by-ralph, date-added-by-sally (ugh!).

Or maybe diskspace is cheap and we just should strive for simplicity?

Thanks,
Chris




 On May 17, 2009, at 5:53 PM, Chris Cornell wrote:

 Trying to create a search solution for about 20k users at a company.
 Each person's documents are private and different (some overlap... it
 would be nice to not have to store/index copies).

 Is multicore something that would work or should we auto-insert a
 facet into each query generated by the person?

 Thanks for any advice, I am very new to solr.  Any tiny push in the
 right direction would be appreciated.

 Thanks,
 Chris




Re: multicore for 20k users?

2009-05-17 Thread Otis Gospodnetic

Chris,

Yes, disk space is cheap, and with so little overlap you won't gain much by 
putting everything in a single index.  Plus, when each user has a separate 
index, it's easy to to split users and distribute over multiple machines if you 
ever need to do that, it's easy and fast to completely reindex one user's data 
without affecting other users, etc.

Several years ago I built Simpy at http://www.simpy.com/ that way (but 
pre-Solr, so it uses Lucene directly) and never regretted it.  There are way 
more than 20K users there with many searches per second and with constant 
indexing.  Each user has an index for bookmarks and an index for notes.  Each 
group has its own index, shared by all group members.  The main bookmark search 
is another index.  People search is yet another index.  And so on.  Single 
server.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Chris Cornell srchn...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Sunday, May 17, 2009 8:37:44 PM
 Subject: Re: multicore for 20k users?
 
 Thanks for helping Ryan,
 
 On Sun, May 17, 2009 at 7:17 PM, Ryan McKinley wrote:
  how much overlap is there with the 20k user documents?
 
 There are around 20k users but each one has anywhere from zero to
 thousands of documents.  The final overlap is unknown because there is
 a current set of documents but each user will add documents on the fly
 (it's like their own personal search engine in a way).
 
 
  if you create a separate index for each of them will you be indexing 90% of
  the documents 20K times?
 
 Probably more like 5-10%
 
  How many total documents could an individual user
  typically see?
 
 Average is around 100 now but we want them to be able to add more.
 
  How many total distinct documents are you talking about?  Is
  the indexing strategy the same for all users?  (the same analysis etc)
 
 The indexing strategy is the same for each user.
 
 
  Is it actually possible to limit visibility by role rather then user?
 
 No, it has to be by user since it is a private document set.  We just
 want to save on diskspace when there are big documents that are the
 same across users (based on document checksum).
 
 
  I would start with trying to put everything in one index -- if that is not
  possible, then look at a multi-core option.
 
 OK.  Another thing is that we want to allow the user to restrict
 searches based on when the document was added... if we do share an
 indexed item and insert some attribute into each query (like
 user:ralph) then it couldn't have date-added based search.  Unless a
 field was added like date-added-by-ralph, date-added-by-sally (ugh!).
 
 Or maybe diskspace is cheap and we just should strive for simplicity?
 
 Thanks,
 Chris
 
 
 
 
  On May 17, 2009, at 5:53 PM, Chris Cornell wrote:
 
  Trying to create a search solution for about 20k users at a company.
  Each person's documents are private and different (some overlap... it
  would be nice to not have to store/index copies).
 
  Is multicore something that would work or should we auto-insert a
  facet into each query generated by the person?
 
  Thanks for any advice, I am very new to solr.  Any tiny push in the
  right direction would be appreciated.
 
  Thanks,
  Chris
 
 



Re: multicore for 20k users?

2009-05-17 Thread Noble Paul നോബിള്‍ नोब्ळ्
A few questions,
1) what is the frequency of inserts?
2) how many cores need to be up and running at any given point



On Mon, May 18, 2009 at 3:23 AM, Chris Cornell srchn...@gmail.com wrote:
 Trying to create a search solution for about 20k users at a company.
 Each person's documents are private and different (some overlap... it
 would be nice to not have to store/index copies).

 Is multicore something that would work or should we auto-insert a
 facet into each query generated by the person?

 Thanks for any advice, I am very new to solr.  Any tiny push in the
 right direction would be appreciated.

 Thanks,
 Chris




-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com


Re: multicore for 20k users?

2009-05-17 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Mon, May 18, 2009 at 8:18 AM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:

 Chris,

 As far as I know, AOL is using Solr with lots of cores.  What I don't know is 
 how they are handling shutting down of idle cores, which is something you'll 
 need to do if your machine can't handle all cores being open and their data 
 structures being populated at all times.  I know I had to do that same for 
 Simpy. :)

we have a custom build of Solr. we do just in time automatic loading
of cores and an LRU based unloading of cores when the upper water mark
is crossed

  Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
 From: Chris Cornell srchn...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Sunday, May 17, 2009 10:11:10 PM
 Subject: Re: multicore for 20k users?

 On Sun, May 17, 2009 at 8:38 PM, Otis Gospodnetic
 wrote:
 
  Chris,
 
  Yes, disk space is cheap, and with so little overlap you won't gain much by
 putting everything in a single index.  Plus, when each user has a separate
 index, it's easy to to split users and distribute over multiple machines if 
 you
 ever need to do that, it's easy and fast to completely reindex one user's 
 data
 without affecting other users, etc.
 
  Several years ago I built Simpy at http://www.simpy.com/ that way (but
 pre-Solr, so it uses Lucene directly) and never regretted it.  There are way
 more than 20K users there with many searches per second and with constant
 indexing.  Each user has an index for bookmarks and an index for notes.  Each
 group has its own index, shared by all group members.  The main bookmark 
 search
 is another index.  People search is yet another index.  And so on.  Single
 server.
 

 Thankyou very much for your insight and experience, sounds like we
 shouldn't be thinking about prematurely optimizing this.

 Has someone actually used multicore this way, though?  With thousands of 
 them?

 Independently of advice in that regard, I guess our next step is to
 explore and create some dummy scenarios/tests to try and stress
 multicore (search latency is not as much of a factor as memory usage
 is).  I'll report back on any conclusion we come to.

 Thanks!
 Chris





-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com


Re: multicore for 20k users?

2009-05-17 Thread Chris Cornell
On Sun, May 17, 2009 at 8:38 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:

 Chris,

 Yes, disk space is cheap, and with so little overlap you won't gain much by 
 putting everything in a single index.  Plus, when each user has a separate 
 index, it's easy to to split users and distribute over multiple machines if 
 you ever need to do that, it's easy and fast to completely reindex one user's 
 data without affecting other users, etc.

 Several years ago I built Simpy at http://www.simpy.com/ that way (but 
 pre-Solr, so it uses Lucene directly) and never regretted it.  There are way 
 more than 20K users there with many searches per second and with constant 
 indexing.  Each user has an index for bookmarks and an index for notes.  Each 
 group has its own index, shared by all group members.  The main bookmark 
 search is another index.  People search is yet another index.  And so on.  
 Single server.


Thankyou very much for your insight and experience, sounds like we
shouldn't be thinking about prematurely optimizing this.

Has someone actually used multicore this way, though?  With thousands of them?

Independently of advice in that regard, I guess our next step is to
explore and create some dummy scenarios/tests to try and stress
multicore (search latency is not as much of a factor as memory usage
is).  I'll report back on any conclusion we come to.

Thanks!
Chris


Re: multicore for 20k users?

2009-05-17 Thread Otis Gospodnetic

Chris,

As far as I know, AOL is using Solr with lots of cores.  What I don't know is 
how they are handling shutting down of idle cores, which is something you'll 
need to do if your machine can't handle all cores being open and their data 
structures being populated at all times.  I know I had to do that same for 
Simpy. :)

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Chris Cornell srchn...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Sunday, May 17, 2009 10:11:10 PM
 Subject: Re: multicore for 20k users?
 
 On Sun, May 17, 2009 at 8:38 PM, Otis Gospodnetic
 wrote:
 
  Chris,
 
  Yes, disk space is cheap, and with so little overlap you won't gain much by 
 putting everything in a single index.  Plus, when each user has a separate 
 index, it's easy to to split users and distribute over multiple machines if 
 you 
 ever need to do that, it's easy and fast to completely reindex one user's 
 data 
 without affecting other users, etc.
 
  Several years ago I built Simpy at http://www.simpy.com/ that way (but 
 pre-Solr, so it uses Lucene directly) and never regretted it.  There are way 
 more than 20K users there with many searches per second and with constant 
 indexing.  Each user has an index for bookmarks and an index for notes.  Each 
 group has its own index, shared by all group members.  The main bookmark 
 search 
 is another index.  People search is yet another index.  And so on.  Single 
 server.
 
 
 Thankyou very much for your insight and experience, sounds like we
 shouldn't be thinking about prematurely optimizing this.
 
 Has someone actually used multicore this way, though?  With thousands of them?
 
 Independently of advice in that regard, I guess our next step is to
 explore and create some dummy scenarios/tests to try and stress
 multicore (search latency is not as much of a factor as memory usage
 is).  I'll report back on any conclusion we come to.
 
 Thanks!
 Chris