Deciding how to correctly use Solr multicore

2014-02-09 Thread Pisarev, Vitaliy
Hello!

We are evaluating Solr usage in our organization and have come to the point 
where we are past the functional tests and are now looking in choosing the best 
deployment topology.
Here are some details about the structure of the problem: The application deals 
with storing and retrieving artifacts of various types. The artifact are stored 
in Projects. Each project can have hundreds of thousands of artifacts (total on 
all types) and our largest customers have hundreds of projects (~300-800) 
though the vast majority have tens of project (~30-100).

Core granularity
In terms of Core granularity- it seems to me that a core per project is 
sensible, as pushing everything to a single core will probably be too much. The 
entities themselves will have a special type field for distinction.
Moreover, it may be that not all of the project are active in a given time so 
this allows their indexes to remain on latent on disk.


Availability and synchronization
Our application is deployed on premise on our customers sites- we cannot go too 
crazy as to the amount of extra resources we demand from them- e.g. dedicated 
indexing servers. We pretty much need to make do with what is already there.

For now, we are planning to use the DIH to maintain the index. Each node the 
cluster on the app will have its own local index. When a project is created (or 
the feature is enabled on an existing project), a core is created for it on 
each one of the nodes, a full import is executed and then a delta import is 
scheduled to run on each one of the nodes. This gives us simplicity but I am 
wondering about the performance and memory consumption costs? Also, I am 
wondering whether we should use replication for this purpose. The requirement 
is for the index to be updated once in 30 seconds - are delta imports design 
for this?

I understand that this is a very complex problem in general. I tried to 
highlight all the most significant aspects and will appreciate some initial 
guidance. Note that we are planning to execute performance and stress testing 
no matter what but the assumption is that the topology of the solution can be 
predetermined with the existing data.






Re: Deciding how to correctly use Solr multicore

2014-02-09 Thread Jack Krupansky
The first question I always is ask is how do you want to query the data - 
what is the full range of query use cases?


For example, might a customer every want to query across all of their 
projects?


You didn't say how many customers you must be able to support. This leads to 
questions about how many customers or projects run on a single Solr server. 
It sounds like you may require quite a number of Solr servers, each 
multi-core. And in some cases a single customer might not fit on a single 
Solr server. SolrCloud might begin to make sense even though it sounds like 
a single collection would rarely need to be sharded.


You didn't speak at all about HA (High Availability) requirements or 
replication.


Or about query latency requirements or query load - which can impact 
replication requirements.


-- Jack Krupansky

-Original Message- 
From: Pisarev, Vitaliy

Sent: Sunday, February 9, 2014 4:22 AM
To: solr-user@lucene.apache.org
Subject: Deciding how to correctly use Solr multicore

Hello!

We are evaluating Solr usage in our organization and have come to the point 
where we are past the functional tests and are now looking in choosing the 
best deployment topology.
Here are some details about the structure of the problem: The application 
deals with storing and retrieving artifacts of various types. The artifact 
are stored in Projects. Each project can have hundreds of thousands of 
artifacts (total on all types) and our largest customers have hundreds of 
projects (~300-800) though the vast majority have tens of project (~30-100).


Core granularity
In terms of Core granularity- it seems to me that a core per project is 
sensible, as pushing everything to a single core will probably be too much. 
The entities themselves will have a special type field for distinction.
Moreover, it may be that not all of the project are active in a given time 
so this allows their indexes to remain on latent on disk.



Availability and synchronization
Our application is deployed on premise on our customers sites- we cannot go 
too crazy as to the amount of extra resources we demand from them- e.g. 
dedicated indexing servers. We pretty much need to make do with what is 
already there.


For now, we are planning to use the DIH to maintain the index. Each node the 
cluster on the app will have its own local index. When a project is created 
(or the feature is enabled on an existing project), a core is created for it 
on each one of the nodes, a full import is executed and then a delta import 
is scheduled to run on each one of the nodes. This gives us simplicity but I 
am wondering about the performance and memory consumption costs? Also, I am 
wondering whether we should use replication for this purpose. The 
requirement is for the index to be updated once in 30 seconds - are delta 
imports design for this?


I understand that this is a very complex problem in general. I tried to 
highlight all the most significant aspects and will appreciate some initial 
guidance. Note that we are planning to execute performance and stress 
testing no matter what but the assumption is that the topology of the 
solution can be predetermined with the existing data.







Re: Deciding how to correctly use Solr multicore

2014-02-09 Thread Erick Erickson
You might also get some mileage out of the transient core concept, see:
http://wiki.apache.org/solr/LotsOfCores

The underlying idea is to allow N cores to be active simultaneously aged out
on an LRU basis. The penalty here is that the first request for a core
that's not
already loaded will be the time it takes to load it up, which can be noticeable.

Also, Solr easily handles 10s of millions of documents. An alternate design is
to simply index everything in a single core with a type field (which
I think you
already have). Then you restrict results with simple fq clauses, like
fq=type:whatever

These are cached in the filterCache which you control through solrconfig.xml.
There are nuances around document relevance etc, but we'll leave that for later.

NOTE: there is some overhead for having multiple cores rather than all-in-one.
That said, I know of a bunch of organizations that use the many-core approach
so it's not a X is always better kind of thing.

Best,
Erick

On Sun, Feb 9, 2014 at 6:04 AM, Jack Krupansky j...@basetechnology.com wrote:
 The first question I always is ask is how do you want to query the data -
 what is the full range of query use cases?

 For example, might a customer every want to query across all of their
 projects?

 You didn't say how many customers you must be able to support. This leads to
 questions about how many customers or projects run on a single Solr server.
 It sounds like you may require quite a number of Solr servers, each
 multi-core. And in some cases a single customer might not fit on a single
 Solr server. SolrCloud might begin to make sense even though it sounds like
 a single collection would rarely need to be sharded.

 You didn't speak at all about HA (High Availability) requirements or
 replication.

 Or about query latency requirements or query load - which can impact
 replication requirements.

 -- Jack Krupansky

 -Original Message- From: Pisarev, Vitaliy
 Sent: Sunday, February 9, 2014 4:22 AM
 To: solr-user@lucene.apache.org
 Subject: Deciding how to correctly use Solr multicore


 Hello!

 We are evaluating Solr usage in our organization and have come to the point
 where we are past the functional tests and are now looking in choosing the
 best deployment topology.
 Here are some details about the structure of the problem: The application
 deals with storing and retrieving artifacts of various types. The artifact
 are stored in Projects. Each project can have hundreds of thousands of
 artifacts (total on all types) and our largest customers have hundreds of
 projects (~300-800) though the vast majority have tens of project (~30-100).

 Core granularity
 In terms of Core granularity- it seems to me that a core per project is
 sensible, as pushing everything to a single core will probably be too much.
 The entities themselves will have a special type field for distinction.
 Moreover, it may be that not all of the project are active in a given time
 so this allows their indexes to remain on latent on disk.


 Availability and synchronization
 Our application is deployed on premise on our customers sites- we cannot go
 too crazy as to the amount of extra resources we demand from them- e.g.
 dedicated indexing servers. We pretty much need to make do with what is
 already there.

 For now, we are planning to use the DIH to maintain the index. Each node the
 cluster on the app will have its own local index. When a project is created
 (or the feature is enabled on an existing project), a core is created for it
 on each one of the nodes, a full import is executed and then a delta import
 is scheduled to run on each one of the nodes. This gives us simplicity but I
 am wondering about the performance and memory consumption costs? Also, I am
 wondering whether we should use replication for this purpose. The
 requirement is for the index to be updated once in 30 seconds - are delta
 imports design for this?

 I understand that this is a very complex problem in general. I tried to
 highlight all the most significant aspects and will appreciate some initial
 guidance. Note that we are planning to execute performance and stress
 testing no matter what but the assumption is that the topology of the
 solution can be predetermined with the existing data.