Re: Partition Question

2012-05-12 Thread Erick Erickson
No, this isn't what sharding is all about. Sharding is taking a single
logical index and splitting it up amongst a number of physical
units, often on individual machines. Load and unload partitions
dynamically doesn't make any sense when talking about shards.

So let's back up. You could create your own _cores_ that you load/unload
and take over the distribution of the incoming queries manually. By that I mean
your once in 10,000 queries instance you go ahead and send your queries
to older cores and then unload them when you're done. You could even
fire off a query to one core, unload it, fire off the query to the next core,
unload it, etc.

Of course your query would be very slow, but in such a rare case this may
be acceptable.

Or you could get some more memory/machines and just have some unused
resources.

Best
Erick

On Wed, May 9, 2012 at 5:08 AM, Yuval Dotan yuvaldo...@gmail.com wrote:
 Thanks Lance

 There is already a clear partition - as you assumed, by date.

 My requirement is for the best setup for:
 1. A *single machine*
 2. Quickly changing index - so i need to have the option to load and unload
 partitions dynamically

 Do you think that the sharding model that solr offers is the most suitable
 for this setup?
 What about the solr multi core model?

 On Wed, May 9, 2012 at 12:23 AM, Lance Norskog goks...@gmail.com wrote:

 Lucene does not support more 2^32 unique documents, so you need to
 partition. In Solr this is done with Distributed Search:

 http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/DistributedSearch

 First, you have to decide a policy for which documents go to which
 'shard'. It is common to make a hash code as the unique id, then
 distribute the documents modulo this value. This gives a roughly equal
 distribution of documents. If there is already a clear partition, like
 the date of the document (like newspaper articles) you could use that
 also.

 You have new documents and existing documents. For new documents you
 need code for this policy to get all new documents to the right index.
 This could be one master program that passes them out, or each indexer
 could know which documents it gets.

 If you want to split up your current index, that's different. I have
 done this: for each shard, make a copy of the full index,
 delete-by-query all of the documents that are NOT in that shard, and
 optimize. We had to do this in sequence so it took a few days :) You
 don't need a full optimize. Use 'maxSegments=50' or '100' to suppress
 that last final giant merge.

 On Tue, May 8, 2012 at 12:02 AM, Yuval Dotan yuvaldo...@gmail.com wrote:
  Hi
  Can someone please guide me to the right way to partition the solr index?
 
  On Mon, May 7, 2012 at 11:41 AM, Yuval Dotan yuvaldo...@gmail.com
 wrote:
 
  Hi All
  Jan, thanks for the reply - answers for your questions are located below
  Please update me if you have ideas that can solve my problems.
 
  First, some corrections to my previous mail:
 
   Hi All
   We have an index of ~2,000,000,000 Documents and the query and facet
  times
   are too slow for us - our index in fact will be much larger
 
   Most of our queries will be limited by time, hence we want to
 partition
  the
   data by date/time - even when unlimited – which is mostly what will
  happen, we have results in the recent records and querying the whole
  dataset is redundant
 
   We want to partition the data because the index size is too big and
  doesn't
   fit into memory (80 Gb's) - our data actually continuously grows over
  time, it will never fit into memory, but has to be available for
 queries in
  case results are found in older records or a full facet is required
 
  
   1. Is multi core the best way to implement my requirement?
   2. I noticed there are some LOAD / UNLOAD actions on a core - should i
  use
   these action when managing my cores? if so how can i LOAD a core that
 i
   have unloaded
   for example:
   I have 7 partitions / cores - one for each day of the week - we might
  have 2000 per day
 
   In most cases I will search for documents only on the last day core.
   Once every 1 queries I need documents from all cores.
   Question: Do I need to unload all of the old cores and then load them
 on
   demand (when i see i need data from these cores)?
   3. If the question to the last answer is no, how do i ensure that only
   cores that are loaded into memory are the ones I want?
  
   Thanks
   Yuval
  *
  *
  *Answers to Jan:*
 
  Hi,
 
  First you need to investigate WHY faceting and querying is too slow.
  What exactly do you mean by slow? Can you please tell us more about your
  setup?
 
  * How large documents and how many fields?
  small records ~200bytes, 20 fields avg most of them are not stored -
  attached schema and config file
 
  * What kind of queries? How many hits? How many facets? Have you studies
  debugQuery=true output?
  problem is not with queries being slow per se, it is with getting 50
  

Re: Partition Question

2012-05-09 Thread Michael Kuhlmann

Am 08.05.2012 23:23, schrieb Lance Norskog:

Lucene does not support more 2^32 unique documents, so you need to
partition.


Just a small note:

I doubt that Solr supports more than 2^31 unique documents, as most 
other Java applications that use int values.


Greetings,
Kuli




Re: Partition Question

2012-05-09 Thread Yuval Dotan
Thanks Lance

There is already a clear partition - as you assumed, by date.

My requirement is for the best setup for:
1. A *single machine*
2. Quickly changing index - so i need to have the option to load and unload
partitions dynamically

Do you think that the sharding model that solr offers is the most suitable
for this setup?
What about the solr multi core model?

On Wed, May 9, 2012 at 12:23 AM, Lance Norskog goks...@gmail.com wrote:

 Lucene does not support more 2^32 unique documents, so you need to
 partition. In Solr this is done with Distributed Search:

 http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/DistributedSearch

 First, you have to decide a policy for which documents go to which
 'shard'. It is common to make a hash code as the unique id, then
 distribute the documents modulo this value. This gives a roughly equal
 distribution of documents. If there is already a clear partition, like
 the date of the document (like newspaper articles) you could use that
 also.

 You have new documents and existing documents. For new documents you
 need code for this policy to get all new documents to the right index.
 This could be one master program that passes them out, or each indexer
 could know which documents it gets.

 If you want to split up your current index, that's different. I have
 done this: for each shard, make a copy of the full index,
 delete-by-query all of the documents that are NOT in that shard, and
 optimize. We had to do this in sequence so it took a few days :) You
 don't need a full optimize. Use 'maxSegments=50' or '100' to suppress
 that last final giant merge.

 On Tue, May 8, 2012 at 12:02 AM, Yuval Dotan yuvaldo...@gmail.com wrote:
  Hi
  Can someone please guide me to the right way to partition the solr index?
 
  On Mon, May 7, 2012 at 11:41 AM, Yuval Dotan yuvaldo...@gmail.com
 wrote:
 
  Hi All
  Jan, thanks for the reply - answers for your questions are located below
  Please update me if you have ideas that can solve my problems.
 
  First, some corrections to my previous mail:
 
   Hi All
   We have an index of ~2,000,000,000 Documents and the query and facet
  times
   are too slow for us - our index in fact will be much larger
 
   Most of our queries will be limited by time, hence we want to
 partition
  the
   data by date/time - even when unlimited – which is mostly what will
  happen, we have results in the recent records and querying the whole
  dataset is redundant
 
   We want to partition the data because the index size is too big and
  doesn't
   fit into memory (80 Gb's) - our data actually continuously grows over
  time, it will never fit into memory, but has to be available for
 queries in
  case results are found in older records or a full facet is required
 
  
   1. Is multi core the best way to implement my requirement?
   2. I noticed there are some LOAD / UNLOAD actions on a core - should i
  use
   these action when managing my cores? if so how can i LOAD a core that
 i
   have unloaded
   for example:
   I have 7 partitions / cores - one for each day of the week - we might
  have 2000 per day
 
   In most cases I will search for documents only on the last day core.
   Once every 1 queries I need documents from all cores.
   Question: Do I need to unload all of the old cores and then load them
 on
   demand (when i see i need data from these cores)?
   3. If the question to the last answer is no, how do i ensure that only
   cores that are loaded into memory are the ones I want?
  
   Thanks
   Yuval
  *
  *
  *Answers to Jan:*
 
  Hi,
 
  First you need to investigate WHY faceting and querying is too slow.
  What exactly do you mean by slow? Can you please tell us more about your
  setup?
 
  * How large documents and how many fields?
  small records ~200bytes, 20 fields avg most of them are not stored -
  attached schema and config file
 
  * What kind of queries? How many hits? How many facets? Have you studies
  debugQuery=true output?
  problem is not with queries being slow per se, it is with getting 50
  matches out of billions of matching docs
 
  * Do you use filter queries (fq) extensively?
  user generated queries, fq would not reduce the dataset for some of our
  usecases
 
  * What data do you facet on? Many unique values per field? Text or
 ranges?
  What facet.method?
   problem is not just faceting, it’s with queries – let’s start there
 
  * What kind of hardware? RAM/CPU
  HP DL180G6 , 2 E5645 (12 core)
  48 GB RAM
   * How have you configured your JVM? How much memory? GC?
  java -Xms512M -Xmx40960M -jar start.jar
 
  As you see, you will have to provide a lot more information on your use
  case and setup in order for us to judge correct action to take. You
 might
  need to adjust your config, or to optimize your queries or caches, slim
  your schema, buy some more RAM, or an SSD :)
 
  Normally, going multi core on one box will not necessarily help in
 itself,
  as there is overhead in sharding 

Re: Partition Question

2012-05-08 Thread Yuval Dotan
Hi
Can someone please guide me to the right way to partition the solr index?

On Mon, May 7, 2012 at 11:41 AM, Yuval Dotan yuvaldo...@gmail.com wrote:

 Hi All
 Jan, thanks for the reply - answers for your questions are located below
 Please update me if you have ideas that can solve my problems.

 First, some corrections to my previous mail:

  Hi All
  We have an index of ~2,000,000,000 Documents and the query and facet
 times
  are too slow for us - our index in fact will be much larger

  Most of our queries will be limited by time, hence we want to partition
 the
  data by date/time - even when unlimited – which is mostly what will
 happen, we have results in the recent records and querying the whole
 dataset is redundant

  We want to partition the data because the index size is too big and
 doesn't
  fit into memory (80 Gb's) - our data actually continuously grows over
 time, it will never fit into memory, but has to be available for queries in
 case results are found in older records or a full facet is required

 
  1. Is multi core the best way to implement my requirement?
  2. I noticed there are some LOAD / UNLOAD actions on a core - should i
 use
  these action when managing my cores? if so how can i LOAD a core that i
  have unloaded
  for example:
  I have 7 partitions / cores - one for each day of the week - we might
 have 2000 per day

  In most cases I will search for documents only on the last day core.
  Once every 1 queries I need documents from all cores.
  Question: Do I need to unload all of the old cores and then load them on
  demand (when i see i need data from these cores)?
  3. If the question to the last answer is no, how do i ensure that only
  cores that are loaded into memory are the ones I want?
 
  Thanks
  Yuval
 *
 *
 *Answers to Jan:*

 Hi,

 First you need to investigate WHY faceting and querying is too slow.
 What exactly do you mean by slow? Can you please tell us more about your
 setup?

 * How large documents and how many fields?
 small records ~200bytes, 20 fields avg most of them are not stored -
 attached schema and config file

 * What kind of queries? How many hits? How many facets? Have you studies
 debugQuery=true output?
 problem is not with queries being slow per se, it is with getting 50
 matches out of billions of matching docs

 * Do you use filter queries (fq) extensively?
 user generated queries, fq would not reduce the dataset for some of our
 usecases

 * What data do you facet on? Many unique values per field? Text or ranges?
 What facet.method?
  problem is not just faceting, it’s with queries – let’s start there

 * What kind of hardware? RAM/CPU
 HP DL180G6 , 2 E5645 (12 core)
 48 GB RAM
  * How have you configured your JVM? How much memory? GC?
 java -Xms512M -Xmx40960M -jar start.jar

 As you see, you will have to provide a lot more information on your use
 case and setup in order for us to judge correct action to take. You might
 need to adjust your config, or to optimize your queries or caches, slim
 your schema, buy some more RAM, or an SSD :)

 Normally, going multi core on one box will not necessarily help in itself,
 as there is overhead in sharding multi cores as well. However, it COULD be
 a solution since you say that most of the time you only need to consider
 1/7 of your data. I would perhaps consider one hot core for last 24h, and
 one archive core for older data. You could then tune these differently
 regarding caches etc.

 Can you get back with some more details?

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Solr Training - www.solrtraining.com




Re: Partition Question

2012-05-08 Thread Lance Norskog
Lucene does not support more 2^32 unique documents, so you need to
partition. In Solr this is done with Distributed Search:
http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/DistributedSearch

First, you have to decide a policy for which documents go to which
'shard'. It is common to make a hash code as the unique id, then
distribute the documents modulo this value. This gives a roughly equal
distribution of documents. If there is already a clear partition, like
the date of the document (like newspaper articles) you could use that
also.

You have new documents and existing documents. For new documents you
need code for this policy to get all new documents to the right index.
This could be one master program that passes them out, or each indexer
could know which documents it gets.

If you want to split up your current index, that's different. I have
done this: for each shard, make a copy of the full index,
delete-by-query all of the documents that are NOT in that shard, and
optimize. We had to do this in sequence so it took a few days :) You
don't need a full optimize. Use 'maxSegments=50' or '100' to suppress
that last final giant merge.

On Tue, May 8, 2012 at 12:02 AM, Yuval Dotan yuvaldo...@gmail.com wrote:
 Hi
 Can someone please guide me to the right way to partition the solr index?

 On Mon, May 7, 2012 at 11:41 AM, Yuval Dotan yuvaldo...@gmail.com wrote:

 Hi All
 Jan, thanks for the reply - answers for your questions are located below
 Please update me if you have ideas that can solve my problems.

 First, some corrections to my previous mail:

  Hi All
  We have an index of ~2,000,000,000 Documents and the query and facet
 times
  are too slow for us - our index in fact will be much larger

  Most of our queries will be limited by time, hence we want to partition
 the
  data by date/time - even when unlimited – which is mostly what will
 happen, we have results in the recent records and querying the whole
 dataset is redundant

  We want to partition the data because the index size is too big and
 doesn't
  fit into memory (80 Gb's) - our data actually continuously grows over
 time, it will never fit into memory, but has to be available for queries in
 case results are found in older records or a full facet is required

 
  1. Is multi core the best way to implement my requirement?
  2. I noticed there are some LOAD / UNLOAD actions on a core - should i
 use
  these action when managing my cores? if so how can i LOAD a core that i
  have unloaded
  for example:
  I have 7 partitions / cores - one for each day of the week - we might
 have 2000 per day

  In most cases I will search for documents only on the last day core.
  Once every 1 queries I need documents from all cores.
  Question: Do I need to unload all of the old cores and then load them on
  demand (when i see i need data from these cores)?
  3. If the question to the last answer is no, how do i ensure that only
  cores that are loaded into memory are the ones I want?
 
  Thanks
  Yuval
 *
 *
 *Answers to Jan:*

 Hi,

 First you need to investigate WHY faceting and querying is too slow.
 What exactly do you mean by slow? Can you please tell us more about your
 setup?

 * How large documents and how many fields?
 small records ~200bytes, 20 fields avg most of them are not stored -
 attached schema and config file

 * What kind of queries? How many hits? How many facets? Have you studies
 debugQuery=true output?
 problem is not with queries being slow per se, it is with getting 50
 matches out of billions of matching docs

 * Do you use filter queries (fq) extensively?
 user generated queries, fq would not reduce the dataset for some of our
 usecases

 * What data do you facet on? Many unique values per field? Text or ranges?
 What facet.method?
  problem is not just faceting, it’s with queries – let’s start there

 * What kind of hardware? RAM/CPU
 HP DL180G6 , 2 E5645 (12 core)
 48 GB RAM
  * How have you configured your JVM? How much memory? GC?
 java -Xms512M -Xmx40960M -jar start.jar

 As you see, you will have to provide a lot more information on your use
 case and setup in order for us to judge correct action to take. You might
 need to adjust your config, or to optimize your queries or caches, slim
 your schema, buy some more RAM, or an SSD :)

 Normally, going multi core on one box will not necessarily help in itself,
 as there is overhead in sharding multi cores as well. However, it COULD be
 a solution since you say that most of the time you only need to consider
 1/7 of your data. I would perhaps consider one hot core for last 24h, and
 one archive core for older data. You could then tune these differently
 regarding caches etc.

 Can you get back with some more details?

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Solr Training - www.solrtraining.com





-- 
Lance Norskog
goks...@gmail.com


Partition Question

2012-05-06 Thread Yuval Dotan
Hi All
We have an index of ~2,000,000,000 Documents and the query and facet times
are too slow for us.
Before using the shards solution for improving performance, we thought
about using the multicore feature (our goal is to maximize performance for
a single machine).
Most of our queries will be limited by time, hence we want to partition the
data by date/time.
We want to partition the data because the index size is too big and doesn't
fit into memory (80 Gb's).

1. Is multi core the best way to implement my requirement?
2. I noticed there are some LOAD / UNLOAD actions on a core - should i use
these action when managing my cores? if so how can i LOAD a core that i
have unloaded
for example:
I have 7 partitions / cores - one for each day of the week
In most cases I will search for documents only on the last day core.
Once every 1 queries I need documents from all cores.
Question: Do I need to unload all of the old cores and then load them on
demand (when i see i need data from these cores)?
3. If the question to the last answer is no, how do i ensure that only
cores that are loaded into memory are the ones I want?

Thanks
Yuval


Re: Partition Question

2012-05-06 Thread Jan Høydahl
Hi,

First you need to investigate WHY faceting and querying is too slow.
What exactly do you mean by slow? Can you please tell us more about your setup?
* How large documents and how many fields?
* What kind of queries? How many hits? How many facets? Have you studies 
debugQuery=true output?
* Do you use filter queries (fq) extensively?
* What data do you facet on? Many unique values per field? Text or ranges? What 
facet.method?
* What kind of hardware? RAM/CPU
* How have you configured your JVM? How much memory? GC?

As you see, you will have to provide a lot more information on your use case 
and setup in order for us to judge correct action to take. You might need to 
adjust your config, or to optimize your queries or caches, slim your schema, 
buy some more RAM, or an SSD :)

Normally, going multi core on one box will not necessarily help in itself, as 
there is overhead in sharding multi cores as well. However, it COULD be a 
solution since you say that most of the time you only need to consider 1/7 of 
your data. I would perhaps consider one hot core for last 24h, and one 
archive core for older data. You could then tune these differently regarding 
caches etc.

Can you get back with some more details?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 6. mai 2012, at 17:07, Yuval Dotan wrote:

 Hi All
 We have an index of ~2,000,000,000 Documents and the query and facet times
 are too slow for us.
 Before using the shards solution for improving performance, we thought
 about using the multicore feature (our goal is to maximize performance for
 a single machine).
 Most of our queries will be limited by time, hence we want to partition the
 data by date/time.
 We want to partition the data because the index size is too big and doesn't
 fit into memory (80 Gb's).
 
 1. Is multi core the best way to implement my requirement?
 2. I noticed there are some LOAD / UNLOAD actions on a core - should i use
 these action when managing my cores? if so how can i LOAD a core that i
 have unloaded
 for example:
 I have 7 partitions / cores - one for each day of the week
 In most cases I will search for documents only on the last day core.
 Once every 1 queries I need documents from all cores.
 Question: Do I need to unload all of the old cores and then load them on
 demand (when i see i need data from these cores)?
 3. If the question to the last answer is no, how do i ensure that only
 cores that are loaded into memory are the ones I want?
 
 Thanks
 Yuval