Re: feedback on Solr 4.x LotsOfCores feature

2013-10-22 Thread Soyez Olivier
 all cores from the start. Start Solr without any core entries in
 solr.xml, and we will use the cores Auto option to create load or only load
 the core on the fly, based on the existence of the core on the disk
 (absolute path calculated from the core name).
 
  Thanks for your interest,
 
  Olivier
  
  De : Erick Erickson 
  [erickerick...@gmail.commailto:erickerick...@gmail.commailto:
 erickerick...@gmail.commailto:erickerick...@gmail.com]
  Date d'envoi : lundi 7 octobre 2013 14:33
  À : 
  solr-user@lucene.apache.orgmailto:solr-user@lucene.apache.orgmailto:solr-user@lucene.apache.org
  Objet : Re: feedback on Solr 4.x LotsOfCores feature
 
  Thanks for the great writeup! It's always interesting to see how
  a feature plays out in the real world. A couple of questions
  though:
 
  bq: We added 2 Cores options :
  Do you mean you patched Solr? If so are you willing to shard the code
  back? If both are yes, please open a JIRA, attach the patch and assign
  it to me.
 
  bq:  the number of file descriptors, it used a lot (need to increase
 global
  max and per process fd)
 
  Right, this makes sense since you have a bunch of cores all with their
  own descriptors open. I'm assuming that you hit a rather high max
  number and it stays pretty steady
 
  bq: the overhead to parse solrconfig.xml and load dependencies to open
  each core
 
  Right, I tried to look at sharing the underlying solrconfig object but
  it seemed pretty hairy. There are some extensive comments in the
  JIRA of the problems I foresaw. There may be some action on this
  in the future.
 
  bq: lotsOfCores doesn’t work with SolrCloud
 
  Right, we haven't concentrated on that, it's an interesting problem.
  In particular it's not clear what happens when nodes go up/down,
  replicate, resynch, all that.
 
  bq: When you start, it spend a lot of times to discover cores due to a
 big
 
  How long? I tried 15K cores on my laptop and I think I was getting 15
  second delays or roughly 1K cores discovered/second. Is your delay
  on the order of 50 seconds with 50K cores?
 
  I'm not sure how you could do that in the background, but I haven't
  thought about it much. I tried multi-threading core discovery and that
  didn't help (SSD disk), I assumed that the problem was mostly I/O
  contention (but didn't prove it). What if a request came in for a core
  before you'd found it? I'm not sure what the right behavior would be
  except perhaps to block on that request until core discovery was
  complete. Hm. How would that work for your case? That
  seems do-able.
 
  BTW, so far you get the prize for the most cores on a node I think.
 
  Thanks again for the great feedback!
 
  Erick
 
  On Mon, Oct 7, 2013 at 3:53 AM, Soyez Olivier
  olivier.so...@worldline.commailto:olivier.so...@worldline.commailto:olivier.so...@worldline.com
   wrote:
  Hello,
 
  In my company, we use Solr in production to offer full text search on
  mailboxes.
  We host dozens million of mailboxes, but only webmail users have such
  feature (few millions).
  We have the following use case :
  - non static indexes with more update (indexing and deleting), than
  select requests (ratio 7:1)
  - homogeneous configuration for all indexes
  - not so much user at the same time
 
  We started to index mailboxes with Solr 1.4 in 2010, on a subset of
  400,000 users.
  - we had a cluster of 50 servers, 4 Solr per server, 2000 users per Solr
  instance
  - we grow to 6000 users per Solr instance, 8 Solr per server, 60Go per
  index (~2 million users)
  - we upgraded to Solr 3.5 in 2012
  As indexes grew, IOPS and the response times have increased more and
 more.
 
  The index size was mainly due to stored fields (large .fdt files)
  Retrieving these fields from the index was costly, because of many seek
  in large files, and no limit usage possible.
  There is also an overhead on queries : too many results are filtered to
  find only results concerning user.
  For these reason and others, like not pooled users, hardware savings,
  better scoring, some requests that do not support filtering, we have
  decided to use the LotsOfCores feature.
 
  Our goal was to change the current I/O usage : from lots of random I/O
  access on huge segments to mostly sequential I/O access on small
 segments.
  For our use case, it's not a big deal, that the first query to one not
  yet loaded core will be slow.
  And, we don’t need to fit all the cores into memory at once.
 
  We started from the SOLR-1293 issue and the LotsOfCores wiki page to
  finally use a patched Solr 4.2.1 LotsOfCores in production (1 user = 1
  core).
  We don't need anymore to run so many Solr per node. We are now able to
  have around 5 cores per Solr and we plan to grow to 100,000 cores
  per instance.
  In a first time, we used the solr.xml persistence. All cores have
  loadOnStartup=false and transient=true attributes, so a cold start
  is very quick. The response times

Re: feedback on Solr 4.x LotsOfCores feature

2013-10-19 Thread Erick Erickson
]
  Date d'envoi : lundi 7 octobre 2013 14:33
  À : solr-user@lucene.apache.orgmailto:solr-user@lucene.apache.org
  Objet : Re: feedback on Solr 4.x LotsOfCores feature
 
  Thanks for the great writeup! It's always interesting to see how
  a feature plays out in the real world. A couple of questions
  though:
 
  bq: We added 2 Cores options :
  Do you mean you patched Solr? If so are you willing to shard the code
  back? If both are yes, please open a JIRA, attach the patch and assign
  it to me.
 
  bq:  the number of file descriptors, it used a lot (need to increase
 global
  max and per process fd)
 
  Right, this makes sense since you have a bunch of cores all with their
  own descriptors open. I'm assuming that you hit a rather high max
  number and it stays pretty steady
 
  bq: the overhead to parse solrconfig.xml and load dependencies to open
  each core
 
  Right, I tried to look at sharing the underlying solrconfig object but
  it seemed pretty hairy. There are some extensive comments in the
  JIRA of the problems I foresaw. There may be some action on this
  in the future.
 
  bq: lotsOfCores doesn’t work with SolrCloud
 
  Right, we haven't concentrated on that, it's an interesting problem.
  In particular it's not clear what happens when nodes go up/down,
  replicate, resynch, all that.
 
  bq: When you start, it spend a lot of times to discover cores due to a
 big
 
  How long? I tried 15K cores on my laptop and I think I was getting 15
  second delays or roughly 1K cores discovered/second. Is your delay
  on the order of 50 seconds with 50K cores?
 
  I'm not sure how you could do that in the background, but I haven't
  thought about it much. I tried multi-threading core discovery and that
  didn't help (SSD disk), I assumed that the problem was mostly I/O
  contention (but didn't prove it). What if a request came in for a core
  before you'd found it? I'm not sure what the right behavior would be
  except perhaps to block on that request until core discovery was
  complete. Hm. How would that work for your case? That
  seems do-able.
 
  BTW, so far you get the prize for the most cores on a node I think.
 
  Thanks again for the great feedback!
 
  Erick
 
  On Mon, Oct 7, 2013 at 3:53 AM, Soyez Olivier
  olivier.so...@worldline.commailto:olivier.so...@worldline.com wrote:
  Hello,
 
  In my company, we use Solr in production to offer full text search on
  mailboxes.
  We host dozens million of mailboxes, but only webmail users have such
  feature (few millions).
  We have the following use case :
  - non static indexes with more update (indexing and deleting), than
  select requests (ratio 7:1)
  - homogeneous configuration for all indexes
  - not so much user at the same time
 
  We started to index mailboxes with Solr 1.4 in 2010, on a subset of
  400,000 users.
  - we had a cluster of 50 servers, 4 Solr per server, 2000 users per Solr
  instance
  - we grow to 6000 users per Solr instance, 8 Solr per server, 60Go per
  index (~2 million users)
  - we upgraded to Solr 3.5 in 2012
  As indexes grew, IOPS and the response times have increased more and
 more.
 
  The index size was mainly due to stored fields (large .fdt files)
  Retrieving these fields from the index was costly, because of many seek
  in large files, and no limit usage possible.
  There is also an overhead on queries : too many results are filtered to
  find only results concerning user.
  For these reason and others, like not pooled users, hardware savings,
  better scoring, some requests that do not support filtering, we have
  decided to use the LotsOfCores feature.
 
  Our goal was to change the current I/O usage : from lots of random I/O
  access on huge segments to mostly sequential I/O access on small
 segments.
  For our use case, it's not a big deal, that the first query to one not
  yet loaded core will be slow.
  And, we don’t need to fit all the cores into memory at once.
 
  We started from the SOLR-1293 issue and the LotsOfCores wiki page to
  finally use a patched Solr 4.2.1 LotsOfCores in production (1 user = 1
  core).
  We don't need anymore to run so many Solr per node. We are now able to
  have around 5 cores per Solr and we plan to grow to 100,000 cores
  per instance.
  In a first time, we used the solr.xml persistence. All cores have
  loadOnStartup=false and transient=true attributes, so a cold start
  is very quick. The response times were better than ever, in comparaison
  with poor response times, we had before using LotsOfCores.
 
  We added 2 Cores options :
  - numBuckets to create a subdirectory based on a hash on the corename
  % numBuckets in the core Datadir, because all cores cannot live in the
  same directory
  - Auto with 3 differents values :
  1) false : default behaviour
  2) createLoad : create, if not exist, and load the core on the fly on
  the first incoming request (update, select).
  3) onlyLoad : load the core on the fly on the first incoming request
  (update

Re: feedback on Solr 4.x LotsOfCores feature

2013-10-18 Thread Soyez Olivier
15K cores is around 4 minutes : no network drive, just a spinning disk
But, one important thing, to simulate a cold start or an useless linux buffer 
cache,
I used the following command to empty the linux buffer cache :
sync  echo 3  /proc/sys/vm/drop_caches
Then, I started Solr and I found the result above


Le 11/10/2013 13:06, Erick Erickson a écrit :


bq: sharing the underlying solrconfig object the configset introduced
in JIRA SOLR-4478 seems to be the solution for non-SolrCloud mode

SOLR-4478 will NOT share the underlying config objects, it simply
shares the underlying directory. Each core will, at least as presently
envisioned, simply read the files that exist there and create their
own solrconfig object. Schema objects may be shared, but not config
objects. It may turn out to be relatively easy to do in the configset
situation, but last time I looked at sharing the underlying config
object it was too fraught with problems.

bq: 15K cores is around 4 minutes

I find this very odd. On my laptop, spinning disk, I think I was
seeing 1k cores discovered/sec. You're seeing roughly 16x slower, so I
have no idea what's going on here. If this is just reading the files,
you should be seeing horrible disk contention. Are you on some kind of
networked drive?

bq: To do that in background and to block on that request until core
discovery is complete, should not work for us (due to the worst case).
What other choices are there? Either you have to do it up front or
with some kind of blocking. Hmmm, I suppose you could keep some kind
of custom store (DB? File? ZooKeeper?) that would keep the last known
layout. You'd still have some kind of worst-case situation where the
core you were trying to load wouldn't be in your persistent store and
you'd _still_ have to wait for the discovery process to complete.

bq: and we will use the cores Auto option to create load or only load
the core on
Interesting. I can see how this could all work without any core
discovery but it does require a very specific setup.

On Thu, Oct 10, 2013 at 11:42 AM, Soyez Olivier
olivier.so...@worldline.commailto:olivier.so...@worldline.com wrote:
 The corresponding patch for Solr 4.2.1 LotsOfCores can be found in SOLR-5316, 
 including the new Cores options :
 - numBuckets to create a subdirectory based on a hash on the corename % 
 numBuckets in the core Datadir
 - Auto with 3 differents values :
   1) false : default behaviour
   2) createLoad : create, if not exist, and load the core on the fly on the 
 first incoming request (update, select)
   3) onlyLoad : load the core on the fly on the first incoming request 
 (update, select), if exist on disk

 Concerning :
 - sharing the underlying solrconfig object, the configset introduced in JIRA 
 SOLR-4478 seems to be the solution for non-SolrCloud mode.
 We need to test it for our use case. If another solution exists, please tell 
 me. We are very interested in such functionality and to contribute, if we can.

 - the possibility of lotsOfCores in SolrCloud, we don't know in details how 
 SolrCloud is working.
 But one possible limit is the maximum number of entries that can be added to 
 a zookeeper node.
 Maybe, a solution will be just a kind of hashing in the zookeeper tree.

 - the time to discover cores in Solr 4.4 : with spinning disk under linux, 
 all cores with transient=true and loadOnStartup=false, the linux buffer 
 cache empty before starting Solr :
 15K cores is around 4 minutes. It's linear in the cores number, so for 50K 
 it's more than 13 minutes. In fact, it corresponding to the time to read all 
 core.properties files.
 To do that in background and to block on that request until core discovery is 
 complete, should not work for us (due to the worst case).
 So, we will just disable the core Discovery, because we don't need to know 
 all cores from the start. Start Solr without any core entries in solr.xml, 
 and we will use the cores Auto option to create load or only load the core on 
 the fly, based on the existence of the core on the disk (absolute path 
 calculated from the core name).

 Thanks for your interest,

 Olivier
 
 De : Erick Erickson [erickerick...@gmail.commailto:erickerick...@gmail.com]
 Date d'envoi : lundi 7 octobre 2013 14:33
 À : solr-user@lucene.apache.orgmailto:solr-user@lucene.apache.org
 Objet : Re: feedback on Solr 4.x LotsOfCores feature

 Thanks for the great writeup! It's always interesting to see how
 a feature plays out in the real world. A couple of questions
 though:

 bq: We added 2 Cores options :
 Do you mean you patched Solr? If so are you willing to shard the code
 back? If both are yes, please open a JIRA, attach the patch and assign
 it to me.

 bq:  the number of file descriptors, it used a lot (need to increase global
 max and per process fd)

 Right, this makes sense since you have a bunch of cores all with their
 own descriptors open. I'm assuming that you hit a rather high max
 number

Re: Re: feedback on Solr 4.x LotsOfCores feature

2013-10-11 Thread Erick Erickson
bq: sharing the underlying solrconfig object the configset introduced
in JIRA SOLR-4478 seems to be the solution for non-SolrCloud mode

SOLR-4478 will NOT share the underlying config objects, it simply
shares the underlying directory. Each core will, at least as presently
envisioned, simply read the files that exist there and create their
own solrconfig object. Schema objects may be shared, but not config
objects. It may turn out to be relatively easy to do in the configset
situation, but last time I looked at sharing the underlying config
object it was too fraught with problems.

bq: 15K cores is around 4 minutes

I find this very odd. On my laptop, spinning disk, I think I was
seeing 1k cores discovered/sec. You're seeing roughly 16x slower, so I
have no idea what's going on here. If this is just reading the files,
you should be seeing horrible disk contention. Are you on some kind of
networked drive?

bq: To do that in background and to block on that request until core
discovery is complete, should not work for us (due to the worst case).
What other choices are there? Either you have to do it up front or
with some kind of blocking. Hmmm, I suppose you could keep some kind
of custom store (DB? File? ZooKeeper?) that would keep the last known
layout. You'd still have some kind of worst-case situation where the
core you were trying to load wouldn't be in your persistent store and
you'd _still_ have to wait for the discovery process to complete.

bq: and we will use the cores Auto option to create load or only load
the core on
Interesting. I can see how this could all work without any core
discovery but it does require a very specific setup.

On Thu, Oct 10, 2013 at 11:42 AM, Soyez Olivier
olivier.so...@worldline.com wrote:
 The corresponding patch for Solr 4.2.1 LotsOfCores can be found in SOLR-5316, 
 including the new Cores options :
 - numBuckets to create a subdirectory based on a hash on the corename % 
 numBuckets in the core Datadir
 - Auto with 3 differents values :
   1) false : default behaviour
   2) createLoad : create, if not exist, and load the core on the fly on the 
 first incoming request (update, select)
   3) onlyLoad : load the core on the fly on the first incoming request 
 (update, select), if exist on disk

 Concerning :
 - sharing the underlying solrconfig object, the configset introduced in JIRA 
 SOLR-4478 seems to be the solution for non-SolrCloud mode.
 We need to test it for our use case. If another solution exists, please tell 
 me. We are very interested in such functionality and to contribute, if we can.

 - the possibility of lotsOfCores in SolrCloud, we don't know in details how 
 SolrCloud is working.
 But one possible limit is the maximum number of entries that can be added to 
 a zookeeper node.
 Maybe, a solution will be just a kind of hashing in the zookeeper tree.

 - the time to discover cores in Solr 4.4 : with spinning disk under linux, 
 all cores with transient=true and loadOnStartup=false, the linux buffer 
 cache empty before starting Solr :
 15K cores is around 4 minutes. It's linear in the cores number, so for 50K 
 it's more than 13 minutes. In fact, it corresponding to the time to read all 
 core.properties files.
 To do that in background and to block on that request until core discovery is 
 complete, should not work for us (due to the worst case).
 So, we will just disable the core Discovery, because we don't need to know 
 all cores from the start. Start Solr without any core entries in solr.xml, 
 and we will use the cores Auto option to create load or only load the core on 
 the fly, based on the existence of the core on the disk (absolute path 
 calculated from the core name).

 Thanks for your interest,

 Olivier
 
 De : Erick Erickson [erickerick...@gmail.com]
 Date d'envoi : lundi 7 octobre 2013 14:33
 À : solr-user@lucene.apache.org
 Objet : Re: feedback on Solr 4.x LotsOfCores feature

 Thanks for the great writeup! It's always interesting to see how
 a feature plays out in the real world. A couple of questions
 though:

 bq: We added 2 Cores options :
 Do you mean you patched Solr? If so are you willing to shard the code
 back? If both are yes, please open a JIRA, attach the patch and assign
 it to me.

 bq:  the number of file descriptors, it used a lot (need to increase global
 max and per process fd)

 Right, this makes sense since you have a bunch of cores all with their
 own descriptors open. I'm assuming that you hit a rather high max
 number and it stays pretty steady

 bq: the overhead to parse solrconfig.xml and load dependencies to open
 each core

 Right, I tried to look at sharing the underlying solrconfig object but
 it seemed pretty hairy. There are some extensive comments in the
 JIRA of the problems I foresaw. There may be some action on this
 in the future.

 bq: lotsOfCores doesn’t work with SolrCloud

 Right, we haven't concentrated on that, it's an interesting problem

Re: Re: feedback on Solr 4.x LotsOfCores feature

2013-10-10 Thread Soyez Olivier
The corresponding patch for Solr 4.2.1 LotsOfCores can be found in SOLR-5316, 
including the new Cores options :
- numBuckets to create a subdirectory based on a hash on the corename % 
numBuckets in the core Datadir
- Auto with 3 differents values :
  1) false : default behaviour
  2) createLoad : create, if not exist, and load the core on the fly on the 
first incoming request (update, select)
  3) onlyLoad : load the core on the fly on the first incoming request (update, 
select), if exist on disk

Concerning :
- sharing the underlying solrconfig object, the configset introduced in JIRA 
SOLR-4478 seems to be the solution for non-SolrCloud mode.
We need to test it for our use case. If another solution exists, please tell 
me. We are very interested in such functionality and to contribute, if we can.

- the possibility of lotsOfCores in SolrCloud, we don't know in details how 
SolrCloud is working.
But one possible limit is the maximum number of entries that can be added to a 
zookeeper node.
Maybe, a solution will be just a kind of hashing in the zookeeper tree.

- the time to discover cores in Solr 4.4 : with spinning disk under linux, all 
cores with transient=true and loadOnStartup=false, the linux buffer cache 
empty before starting Solr :
15K cores is around 4 minutes. It's linear in the cores number, so for 50K it's 
more than 13 minutes. In fact, it corresponding to the time to read all 
core.properties files.
To do that in background and to block on that request until core discovery is 
complete, should not work for us (due to the worst case).
So, we will just disable the core Discovery, because we don't need to know all 
cores from the start. Start Solr without any core entries in solr.xml, and we 
will use the cores Auto option to create load or only load the core on the fly, 
based on the existence of the core on the disk (absolute path calculated from 
the core name).

Thanks for your interest,

Olivier

De : Erick Erickson [erickerick...@gmail.com]
Date d'envoi : lundi 7 octobre 2013 14:33
À : solr-user@lucene.apache.org
Objet : Re: feedback on Solr 4.x LotsOfCores feature

Thanks for the great writeup! It's always interesting to see how
a feature plays out in the real world. A couple of questions
though:

bq: We added 2 Cores options :
Do you mean you patched Solr? If so are you willing to shard the code
back? If both are yes, please open a JIRA, attach the patch and assign
it to me.

bq:  the number of file descriptors, it used a lot (need to increase global
max and per process fd)

Right, this makes sense since you have a bunch of cores all with their
own descriptors open. I'm assuming that you hit a rather high max
number and it stays pretty steady

bq: the overhead to parse solrconfig.xml and load dependencies to open
each core

Right, I tried to look at sharing the underlying solrconfig object but
it seemed pretty hairy. There are some extensive comments in the
JIRA of the problems I foresaw. There may be some action on this
in the future.

bq: lotsOfCores doesn’t work with SolrCloud

Right, we haven't concentrated on that, it's an interesting problem.
In particular it's not clear what happens when nodes go up/down,
replicate, resynch, all that.

bq: When you start, it spend a lot of times to discover cores due to a big

How long? I tried 15K cores on my laptop and I think I was getting 15
second delays or roughly 1K cores discovered/second. Is your delay
on the order of 50 seconds with 50K cores?

I'm not sure how you could do that in the background, but I haven't
thought about it much. I tried multi-threading core discovery and that
didn't help (SSD disk), I assumed that the problem was mostly I/O
contention (but didn't prove it). What if a request came in for a core
before you'd found it? I'm not sure what the right behavior would be
except perhaps to block on that request until core discovery was
complete. Hm. How would that work for your case? That
seems do-able.

BTW, so far you get the prize for the most cores on a node I think.

Thanks again for the great feedback!

Erick

On Mon, Oct 7, 2013 at 3:53 AM, Soyez Olivier
olivier.so...@worldline.com wrote:
 Hello,

 In my company, we use Solr in production to offer full text search on
 mailboxes.
 We host dozens million of mailboxes, but only webmail users have such
 feature (few millions).
 We have the following use case :
 - non static indexes with more update (indexing and deleting), than
 select requests (ratio 7:1)
 - homogeneous configuration for all indexes
 - not so much user at the same time

 We started to index mailboxes with Solr 1.4 in 2010, on a subset of
 400,000 users.
 - we had a cluster of 50 servers, 4 Solr per server, 2000 users per Solr
 instance
 - we grow to 6000 users per Solr instance, 8 Solr per server, 60Go per
 index (~2 million users)
 - we upgraded to Solr 3.5 in 2012
 As indexes grew, IOPS and the response times have increased more and more

feedback on Solr 4.x LotsOfCores feature

2013-10-07 Thread Soyez Olivier
Hello,

In my company, we use Solr in production to offer full text search on
mailboxes.
We host dozens million of mailboxes, but only webmail users have such
feature (few millions).
We have the following use case :
- non static indexes with more update (indexing and deleting), than
select requests (ratio 7:1)
- homogeneous configuration for all indexes
- not so much user at the same time

We started to index mailboxes with Solr 1.4 in 2010, on a subset of
400,000 users.
- we had a cluster of 50 servers, 4 Solr per server, 2000 users per Solr
instance
- we grow to 6000 users per Solr instance, 8 Solr per server, 60Go per
index (~2 million users)
- we upgraded to Solr 3.5 in 2012
As indexes grew, IOPS and the response times have increased more and more.

The index size was mainly due to stored fields (large .fdt files)
Retrieving these fields from the index was costly, because of many seek
in large files, and no limit usage possible.
There is also an overhead on queries : too many results are filtered to
find only results concerning user.
For these reason and others, like not pooled users, hardware savings,
better scoring, some requests that do not support filtering, we have
decided to use the LotsOfCores feature.

Our goal was to change the current I/O usage : from lots of random I/O
access on huge segments to mostly sequential I/O access on small segments.
For our use case, it's not a big deal, that the first query to one not
yet loaded core will be slow.
And, we don’t need to fit all the cores into memory at once.

We started from the SOLR-1293 issue and the LotsOfCores wiki page to
finally use a patched Solr 4.2.1 LotsOfCores in production (1 user = 1
core).
We don't need anymore to run so many Solr per node. We are now able to
have around 5 cores per Solr and we plan to grow to 100,000 cores
per instance.
In a first time, we used the solr.xml persistence. All cores have
loadOnStartup=false and transient=true attributes, so a cold start
is very quick. The response times were better than ever, in comparaison
with poor response times, we had before using LotsOfCores.

We added 2 Cores options :
- numBuckets to create a subdirectory based on a hash on the corename
% numBuckets in the core Datadir, because all cores cannot live in the
same directory
- Auto with 3 differents values :
1) false : default behaviour
2) createLoad : create, if not exist, and load the core on the fly on
the first incoming request (update, select).
3) onlyLoad : load the core on the fly on the first incoming request
(update, select), if exist on disk

Then, to improve performance and avoid synchronization in the solr.xml
persistence : we disabled it.
The drawback is we cannot see anymore all the availables cores list with
the admin core status command, only those warmed up.
Finally, we can achieve very good performances with Solr LotsOfCores :
- Index 5 emails (avg) + commit + search : x4.9 faster response time
(Mean), x5.4 faster (95th per)
- Delete 5 documents (avg) : x8.4 faster response time (Mean) x7.4
faster (95th per)
- Search : x3.7 faster response time (Mean) 4x faster (95th per)

In fact, the better performance is mainly due to the little size of each
index, but also thanks to the isolation between cores (updates and
queries on many mailboxes don’t have side effects to each other).
One important thing with the LotsOfCores feature is to take care of :
- the number of file descriptors, it used a lot (need to increase global
max and per process fd)
- the value of the transientCacheSize depending of the RAM size and the
PermGen allocated size
- the leak of ClassLoader that increase minor GC times, when CMS GC is
enabled (use -XX:+CMSClassUnloadingEnabled)
- the overhead to parse solrconfig.xml and load dependencies to open
each core
- lotsOfCores doesn’t work with SolrCloud, then we store indexes
location outside of Solr. We have Solr proxies to route requests to the
right instance.

Not in production, we try the core discovery feature in Solr 4.4 with a
lots of cores.
When you start, it spend a lot of times to discover cores due to a big
number of cores, meanwhile all requests fail (SolrDispatchFilter.init()
not done yet). It will be great to have for example an option for a core
discovery in background, or just to be able to disable it, like we do in
our use case.

If someone is interested in these new options for LotsOfCores feature,
just tell me


Ce message et les pièces jointes sont confidentiels et réservés à l'usage 
exclusif de ses destinataires. Il peut également être protégé par le secret 
professionnel. Si vous recevez ce message par erreur, merci d'en avertir 
immédiatement l'expéditeur et de le détruire. L'intégrité du message ne pouvant 
être assurée sur Internet, la responsabilité de Worldline ne pourra être 
recherchée quant au contenu de ce message. Bien que les meilleurs efforts 
soient faits pour maintenir cette transmission exempte de tout virus, 
l'expéditeur ne donne aucune garantie à cet égard et 

Re: feedback on Solr 4.x LotsOfCores feature

2013-10-07 Thread Erick Erickson
Thanks for the great writeup! It's always interesting to see how
a feature plays out in the real world. A couple of questions
though:

bq: We added 2 Cores options :
Do you mean you patched Solr? If so are you willing to shard the code
back? If both are yes, please open a JIRA, attach the patch and assign
it to me.

bq:  the number of file descriptors, it used a lot (need to increase global
max and per process fd)

Right, this makes sense since you have a bunch of cores all with their
own descriptors open. I'm assuming that you hit a rather high max
number and it stays pretty steady

bq: the overhead to parse solrconfig.xml and load dependencies to open
each core

Right, I tried to look at sharing the underlying solrconfig object but
it seemed pretty hairy. There are some extensive comments in the
JIRA of the problems I foresaw. There may be some action on this
in the future.

bq: lotsOfCores doesn’t work with SolrCloud

Right, we haven't concentrated on that, it's an interesting problem.
In particular it's not clear what happens when nodes go up/down,
replicate, resynch, all that.

bq: When you start, it spend a lot of times to discover cores due to a big

How long? I tried 15K cores on my laptop and I think I was getting 15
second delays or roughly 1K cores discovered/second. Is your delay
on the order of 50 seconds with 50K cores?

I'm not sure how you could do that in the background, but I haven't
thought about it much. I tried multi-threading core discovery and that
didn't help (SSD disk), I assumed that the problem was mostly I/O
contention (but didn't prove it). What if a request came in for a core
before you'd found it? I'm not sure what the right behavior would be
except perhaps to block on that request until core discovery was
complete. Hm. How would that work for your case? That
seems do-able.

BTW, so far you get the prize for the most cores on a node I think.

Thanks again for the great feedback!

Erick

On Mon, Oct 7, 2013 at 3:53 AM, Soyez Olivier
olivier.so...@worldline.com wrote:
 Hello,

 In my company, we use Solr in production to offer full text search on
 mailboxes.
 We host dozens million of mailboxes, but only webmail users have such
 feature (few millions).
 We have the following use case :
 - non static indexes with more update (indexing and deleting), than
 select requests (ratio 7:1)
 - homogeneous configuration for all indexes
 - not so much user at the same time

 We started to index mailboxes with Solr 1.4 in 2010, on a subset of
 400,000 users.
 - we had a cluster of 50 servers, 4 Solr per server, 2000 users per Solr
 instance
 - we grow to 6000 users per Solr instance, 8 Solr per server, 60Go per
 index (~2 million users)
 - we upgraded to Solr 3.5 in 2012
 As indexes grew, IOPS and the response times have increased more and more.

 The index size was mainly due to stored fields (large .fdt files)
 Retrieving these fields from the index was costly, because of many seek
 in large files, and no limit usage possible.
 There is also an overhead on queries : too many results are filtered to
 find only results concerning user.
 For these reason and others, like not pooled users, hardware savings,
 better scoring, some requests that do not support filtering, we have
 decided to use the LotsOfCores feature.

 Our goal was to change the current I/O usage : from lots of random I/O
 access on huge segments to mostly sequential I/O access on small segments.
 For our use case, it's not a big deal, that the first query to one not
 yet loaded core will be slow.
 And, we don’t need to fit all the cores into memory at once.

 We started from the SOLR-1293 issue and the LotsOfCores wiki page to
 finally use a patched Solr 4.2.1 LotsOfCores in production (1 user = 1
 core).
 We don't need anymore to run so many Solr per node. We are now able to
 have around 5 cores per Solr and we plan to grow to 100,000 cores
 per instance.
 In a first time, we used the solr.xml persistence. All cores have
 loadOnStartup=false and transient=true attributes, so a cold start
 is very quick. The response times were better than ever, in comparaison
 with poor response times, we had before using LotsOfCores.

 We added 2 Cores options :
 - numBuckets to create a subdirectory based on a hash on the corename
 % numBuckets in the core Datadir, because all cores cannot live in the
 same directory
 - Auto with 3 differents values :
 1) false : default behaviour
 2) createLoad : create, if not exist, and load the core on the fly on
 the first incoming request (update, select).
 3) onlyLoad : load the core on the fly on the first incoming request
 (update, select), if exist on disk

 Then, to improve performance and avoid synchronization in the solr.xml
 persistence : we disabled it.
 The drawback is we cannot see anymore all the availables cores list with
 the admin core status command, only those warmed up.
 Finally, we can achieve very good performances with Solr LotsOfCores :
 - Index 5 emails (avg) + 

Re: feedback on Solr 4.x LotsOfCores feature

2013-10-07 Thread Yago Riveiro
I assume that the lotOfCores feature doesn't use zookeeper

I tried simulate the cores as collection, but when the size of 
clusterstate.json is bigger than 1M and -Djute.maxbuffer is needed to increase 
the 1 mega limitation.  

A naive question, why clusterstate.json is doesn't by collection?  

--  
Yago Riveiro
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Monday, October 7, 2013 at 1:33 PM, Erick Erickson wrote:

 Thanks for the great writeup! It's always interesting to see how
 a feature plays out in the real world. A couple of questions
 though:
  
 bq: We added 2 Cores options :
 Do you mean you patched Solr? If so are you willing to shard the code
 back? If both are yes, please open a JIRA, attach the patch and assign
 it to me.
  
 bq: the number of file descriptors, it used a lot (need to increase global
 max and per process fd)
  
 Right, this makes sense since you have a bunch of cores all with their
 own descriptors open. I'm assuming that you hit a rather high max
 number and it stays pretty steady
  
 bq: the overhead to parse solrconfig.xml and load dependencies to open
 each core
  
 Right, I tried to look at sharing the underlying solrconfig object but
 it seemed pretty hairy. There are some extensive comments in the
 JIRA of the problems I foresaw. There may be some action on this
 in the future.
  
 bq: lotsOfCores doesn’t work with SolrCloud
  
 Right, we haven't concentrated on that, it's an interesting problem.
 In particular it's not clear what happens when nodes go up/down,
 replicate, resynch, all that.
  
 bq: When you start, it spend a lot of times to discover cores due to a big
  
 How long? I tried 15K cores on my laptop and I think I was getting 15
 second delays or roughly 1K cores discovered/second. Is your delay
 on the order of 50 seconds with 50K cores?
  
 I'm not sure how you could do that in the background, but I haven't
 thought about it much. I tried multi-threading core discovery and that
 didn't help (SSD disk), I assumed that the problem was mostly I/O
 contention (but didn't prove it). What if a request came in for a core
 before you'd found it? I'm not sure what the right behavior would be
 except perhaps to block on that request until core discovery was
 complete. Hm. How would that work for your case? That
 seems do-able.
  
 BTW, so far you get the prize for the most cores on a node I think.
  
 Thanks again for the great feedback!
  
 Erick
  
 On Mon, Oct 7, 2013 at 3:53 AM, Soyez Olivier
 olivier.so...@worldline.com (mailto:olivier.so...@worldline.com) wrote:
  Hello,
   
  In my company, we use Solr in production to offer full text search on
  mailboxes.
  We host dozens million of mailboxes, but only webmail users have such
  feature (few millions).
  We have the following use case :
  - non static indexes with more update (indexing and deleting), than
  select requests (ratio 7:1)
  - homogeneous configuration for all indexes
  - not so much user at the same time
   
  We started to index mailboxes with Solr 1.4 in 2010, on a subset of
  400,000 users.
  - we had a cluster of 50 servers, 4 Solr per server, 2000 users per Solr
  instance
  - we grow to 6000 users per Solr instance, 8 Solr per server, 60Go per
  index (~2 million users)
  - we upgraded to Solr 3.5 in 2012
  As indexes grew, IOPS and the response times have increased more and more.
   
  The index size was mainly due to stored fields (large .fdt files)
  Retrieving these fields from the index was costly, because of many seek
  in large files, and no limit usage possible.
  There is also an overhead on queries : too many results are filtered to
  find only results concerning user.
  For these reason and others, like not pooled users, hardware savings,
  better scoring, some requests that do not support filtering, we have
  decided to use the LotsOfCores feature.
   
  Our goal was to change the current I/O usage : from lots of random I/O
  access on huge segments to mostly sequential I/O access on small segments.
  For our use case, it's not a big deal, that the first query to one not
  yet loaded core will be slow.
  And, we don’t need to fit all the cores into memory at once.
   
  We started from the SOLR-1293 issue and the LotsOfCores wiki page to
  finally use a patched Solr 4.2.1 LotsOfCores in production (1 user = 1
  core).
  We don't need anymore to run so many Solr per node. We are now able to
  have around 5 cores per Solr and we plan to grow to 100,000 cores
  per instance.
  In a first time, we used the solr.xml persistence. All cores have
  loadOnStartup=false and transient=true attributes, so a cold start
  is very quick. The response times were better than ever, in comparaison
  with poor response times, we had before using LotsOfCores.
   
  We added 2 Cores options :
  - numBuckets to create a subdirectory based on a hash on the corename
  % numBuckets in the core Datadir, because all cores cannot live in the
  same directory
  - Auto with 3 

Re: feedback on Solr 4.x LotsOfCores feature

2013-10-07 Thread Shalin Shekhar Mangar
I think we'd all love to see those improvements land in Solr.

I was involved in the work at AOL WebMail where the LotsOfCores idea
originated. We had many of the problems that you've had to solve yourself.
I remember that we switched to compound file format to reduce file
descriptors. Also we had to switch back to the Log Merge Policy from
TieredMergePolicy because TieredMergePolicy increased the overall random
disk i/o and we had latency issues because of it.


On Mon, Oct 7, 2013 at 1:23 PM, Soyez Olivier
olivier.so...@worldline.comwrote:

 Hello,

 In my company, we use Solr in production to offer full text search on
 mailboxes.
 We host dozens million of mailboxes, but only webmail users have such
 feature (few millions).
 We have the following use case :
 - non static indexes with more update (indexing and deleting), than
 select requests (ratio 7:1)
 - homogeneous configuration for all indexes
 - not so much user at the same time

 We started to index mailboxes with Solr 1.4 in 2010, on a subset of
 400,000 users.
 - we had a cluster of 50 servers, 4 Solr per server, 2000 users per Solr
 instance
 - we grow to 6000 users per Solr instance, 8 Solr per server, 60Go per
 index (~2 million users)
 - we upgraded to Solr 3.5 in 2012
 As indexes grew, IOPS and the response times have increased more and more.

 The index size was mainly due to stored fields (large .fdt files)
 Retrieving these fields from the index was costly, because of many seek
 in large files, and no limit usage possible.
 There is also an overhead on queries : too many results are filtered to
 find only results concerning user.
 For these reason and others, like not pooled users, hardware savings,
 better scoring, some requests that do not support filtering, we have
 decided to use the LotsOfCores feature.

 Our goal was to change the current I/O usage : from lots of random I/O
 access on huge segments to mostly sequential I/O access on small segments.
 For our use case, it's not a big deal, that the first query to one not
 yet loaded core will be slow.
 And, we don’t need to fit all the cores into memory at once.

 We started from the SOLR-1293 issue and the LotsOfCores wiki page to
 finally use a patched Solr 4.2.1 LotsOfCores in production (1 user = 1
 core).
 We don't need anymore to run so many Solr per node. We are now able to
 have around 5 cores per Solr and we plan to grow to 100,000 cores
 per instance.
 In a first time, we used the solr.xml persistence. All cores have
 loadOnStartup=false and transient=true attributes, so a cold start
 is very quick. The response times were better than ever, in comparaison
 with poor response times, we had before using LotsOfCores.

 We added 2 Cores options :
 - numBuckets to create a subdirectory based on a hash on the corename
 % numBuckets in the core Datadir, because all cores cannot live in the
 same directory
 - Auto with 3 differents values :
 1) false : default behaviour
 2) createLoad : create, if not exist, and load the core on the fly on
 the first incoming request (update, select).
 3) onlyLoad : load the core on the fly on the first incoming request
 (update, select), if exist on disk

 Then, to improve performance and avoid synchronization in the solr.xml
 persistence : we disabled it.
 The drawback is we cannot see anymore all the availables cores list with
 the admin core status command, only those warmed up.
 Finally, we can achieve very good performances with Solr LotsOfCores :
 - Index 5 emails (avg) + commit + search : x4.9 faster response time
 (Mean), x5.4 faster (95th per)
 - Delete 5 documents (avg) : x8.4 faster response time (Mean) x7.4
 faster (95th per)
 - Search : x3.7 faster response time (Mean) 4x faster (95th per)

 In fact, the better performance is mainly due to the little size of each
 index, but also thanks to the isolation between cores (updates and
 queries on many mailboxes don’t have side effects to each other).
 One important thing with the LotsOfCores feature is to take care of :
 - the number of file descriptors, it used a lot (need to increase global
 max and per process fd)
 - the value of the transientCacheSize depending of the RAM size and the
 PermGen allocated size
 - the leak of ClassLoader that increase minor GC times, when CMS GC is
 enabled (use -XX:+CMSClassUnloadingEnabled)
 - the overhead to parse solrconfig.xml and load dependencies to open
 each core
 - lotsOfCores doesn’t work with SolrCloud, then we store indexes
 location outside of Solr. We have Solr proxies to route requests to the
 right instance.

 Not in production, we try the core discovery feature in Solr 4.4 with a
 lots of cores.
 When you start, it spend a lot of times to discover cores due to a big
 number of cores, meanwhile all requests fail (SolrDispatchFilter.init()
 not done yet). It will be great to have for example an option for a core
 discovery in background, or just to be able to disable it, like we do in
 our use case.

 If someone is interested in these