subject:"Partition Question"

Re: Partition Question

2012-05-12 Thread Erick Erickson

No, this isn't what sharding is all about. Sharding is taking a single
logical index and splitting it up amongst a number of physical
units, often on individual machines. Load and unload partitions
dynamically doesn't make any sense when talking about shards.

So let's back up. You could create your own _cores_ that you load/unload
and take over the distribution of the incoming queries manually. By that I mean
your once in 10,000 queries instance you go ahead and send your queries
to older cores and then unload them when you're done. You could even
fire off a query to one core, unload it, fire off the query to the next core,
unload it, etc.

Of course your query would be very slow, but in such a rare case this may
be acceptable.

Or you could get some more memory/machines and just have some unused
resources.

Best
Erick

On Wed, May 9, 2012 at 5:08 AM, Yuval Dotan yuvaldo...@gmail.com wrote:
Thanks Lance

There is already a clear partition - as you assumed, by date.

My requirement is for the best setup for:
1. A *single machine*
2. Quickly changing index - so i need to have the option to load and unload
partitions dynamically

Do you think that the sharding model that solr offers is the most suitable
for this setup?
What about the solr multi core model?

On Wed, May 9, 2012 at 12:23 AM, Lance Norskog goks...@gmail.com wrote:

Lucene does not support more 2^32 unique documents, so you need to
partition. In Solr this is done with Distributed Search:

http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/DistributedSearch

First, you have to decide a policy for which documents go to which
'shard'. It is common to make a hash code as the unique id, then
distribute the documents modulo this value. This gives a roughly equal
distribution of documents. If there is already a clear partition, like
the date of the document (like newspaper articles) you could use that
also.

You have new documents and existing documents. For new documents you
need code for this policy to get all new documents to the right index.
This could be one master program that passes them out, or each indexer
could know which documents it gets.

If you want to split up your current index, that's different. I have
done this: for each shard, make a copy of the full index,
delete-by-query all of the documents that are NOT in that shard, and
optimize. We had to do this in sequence so it took a few days :) You
don't need a full optimize. Use 'maxSegments=50' or '100' to suppress
that last final giant merge.

On Tue, May 8, 2012 at 12:02 AM, Yuval Dotan yuvaldo...@gmail.com wrote:
Hi
Can someone please guide me to the right way to partition the solr index?

On Mon, May 7, 2012 at 11:41 AM, Yuval Dotan yuvaldo...@gmail.com
wrote:

Hi All
Jan, thanks for the reply - answers for your questions are located below
Please update me if you have ideas that can solve my problems.

First, some corrections to my previous mail:

Hi All
We have an index of ~2,000,000,000 Documents and the query and facet
times
are too slow for us - our index in fact will be much larger

Most of our queries will be limited by time, hence we want to
partition
the
data by date/time - even when unlimited – which is mostly what will
happen, we have results in the recent records and querying the whole
dataset is redundant

We want to partition the data because the index size is too big and
doesn't
fit into memory (80 Gb's) - our data actually continuously grows over
time, it will never fit into memory, but has to be available for
queries in
case results are found in older records or a full facet is required

1. Is multi core the best way to implement my requirement?
2. I noticed there are some LOAD / UNLOAD actions on a core - should i
use
these action when managing my cores? if so how can i LOAD a core that
i
have unloaded
for example:
I have 7 partitions / cores - one for each day of the week - we might
have 2000 per day

In most cases I will search for documents only on the last day core.
Once every 1 queries I need documents from all cores.
Question: Do I need to unload all of the old cores and then load them
on
demand (when i see i need data from these cores)?
3. If the question to the last answer is no, how do i ensure that only
cores that are loaded into memory are the ones I want?

Thanks
Yuval
*
*
*Answers to Jan:*

Hi,

First you need to investigate WHY faceting and querying is too slow.
What exactly do you mean by slow? Can you please tell us more about your
setup?

* How large documents and how many fields?
small records ~200bytes, 20 fields avg most of them are not stored -
attached schema and config file

* What kind of queries? How many hits? How many facets? Have you studies
debugQuery=true output?
problem is not with queries being slow per se, it is with getting 50

Re: Partition Question

2012-05-09 Thread Michael Kuhlmann


Am 08.05.2012 23:23, schrieb Lance Norskog:

Lucene does not support more 2^32 unique documents, so you need to
partition.


Just a small note:

I doubt that Solr supports more than 2^31 unique documents, as most 
other Java applications that use int values.


Greetings,
Kuli

Re: Partition Question

2012-05-09 Thread Yuval Dotan

Thanks Lance