Re: Partition Question
No, this isn't what sharding is all about. Sharding is taking a single logical index and splitting it up amongst a number of physical units, often on individual machines. Load and unload partitions dynamically doesn't make any sense when talking about shards. So let's back up. You could create your own _cores_ that you load/unload and take over the distribution of the incoming queries manually. By that I mean your once in 10,000 queries instance you go ahead and send your queries to older cores and then unload them when you're done. You could even fire off a query to one core, unload it, fire off the query to the next core, unload it, etc. Of course your query would be very slow, but in such a rare case this may be acceptable. Or you could get some more memory/machines and just have some unused resources. Best Erick On Wed, May 9, 2012 at 5:08 AM, Yuval Dotan yuvaldo...@gmail.com wrote: Thanks Lance There is already a clear partition - as you assumed, by date. My requirement is for the best setup for: 1. A *single machine* 2. Quickly changing index - so i need to have the option to load and unload partitions dynamically Do you think that the sharding model that solr offers is the most suitable for this setup? What about the solr multi core model? On Wed, May 9, 2012 at 12:23 AM, Lance Norskog goks...@gmail.com wrote: Lucene does not support more 2^32 unique documents, so you need to partition. In Solr this is done with Distributed Search: http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/DistributedSearch First, you have to decide a policy for which documents go to which 'shard'. It is common to make a hash code as the unique id, then distribute the documents modulo this value. This gives a roughly equal distribution of documents. If there is already a clear partition, like the date of the document (like newspaper articles) you could use that also. You have new documents and existing documents. For new documents you need code for this policy to get all new documents to the right index. This could be one master program that passes them out, or each indexer could know which documents it gets. If you want to split up your current index, that's different. I have done this: for each shard, make a copy of the full index, delete-by-query all of the documents that are NOT in that shard, and optimize. We had to do this in sequence so it took a few days :) You don't need a full optimize. Use 'maxSegments=50' or '100' to suppress that last final giant merge. On Tue, May 8, 2012 at 12:02 AM, Yuval Dotan yuvaldo...@gmail.com wrote: Hi Can someone please guide me to the right way to partition the solr index? On Mon, May 7, 2012 at 11:41 AM, Yuval Dotan yuvaldo...@gmail.com wrote: Hi All Jan, thanks for the reply - answers for your questions are located below Please update me if you have ideas that can solve my problems. First, some corrections to my previous mail: Hi All We have an index of ~2,000,000,000 Documents and the query and facet times are too slow for us - our index in fact will be much larger Most of our queries will be limited by time, hence we want to partition the data by date/time - even when unlimited – which is mostly what will happen, we have results in the recent records and querying the whole dataset is redundant We want to partition the data because the index size is too big and doesn't fit into memory (80 Gb's) - our data actually continuously grows over time, it will never fit into memory, but has to be available for queries in case results are found in older records or a full facet is required 1. Is multi core the best way to implement my requirement? 2. I noticed there are some LOAD / UNLOAD actions on a core - should i use these action when managing my cores? if so how can i LOAD a core that i have unloaded for example: I have 7 partitions / cores - one for each day of the week - we might have 2000 per day In most cases I will search for documents only on the last day core. Once every 1 queries I need documents from all cores. Question: Do I need to unload all of the old cores and then load them on demand (when i see i need data from these cores)? 3. If the question to the last answer is no, how do i ensure that only cores that are loaded into memory are the ones I want? Thanks Yuval * * *Answers to Jan:* Hi, First you need to investigate WHY faceting and querying is too slow. What exactly do you mean by slow? Can you please tell us more about your setup? * How large documents and how many fields? small records ~200bytes, 20 fields avg most of them are not stored - attached schema and config file * What kind of queries? How many hits? How many facets? Have you studies debugQuery=true output? problem is not with queries being slow per se, it is with getting 50
Re: Partition Question
Am 08.05.2012 23:23, schrieb Lance Norskog: Lucene does not support more 2^32 unique documents, so you need to partition. Just a small note: I doubt that Solr supports more than 2^31 unique documents, as most other Java applications that use int values. Greetings, Kuli
Re: Partition Question
Thanks Lance There is already a clear partition - as you assumed, by date. My requirement is for the best setup for: 1. A *single machine* 2. Quickly changing index - so i need to have the option to load and unload partitions dynamically Do you think that the sharding model that solr offers is the most suitable for this setup? What about the solr multi core model? On Wed, May 9, 2012 at 12:23 AM, Lance Norskog goks...@gmail.com wrote: Lucene does not support more 2^32 unique documents, so you need to partition. In Solr this is done with Distributed Search: http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/DistributedSearch First, you have to decide a policy for which documents go to which 'shard'. It is common to make a hash code as the unique id, then distribute the documents modulo this value. This gives a roughly equal distribution of documents. If there is already a clear partition, like the date of the document (like newspaper articles) you could use that also. You have new documents and existing documents. For new documents you need code for this policy to get all new documents to the right index. This could be one master program that passes them out, or each indexer could know which documents it gets. If you want to split up your current index, that's different. I have done this: for each shard, make a copy of the full index, delete-by-query all of the documents that are NOT in that shard, and optimize. We had to do this in sequence so it took a few days :) You don't need a full optimize. Use 'maxSegments=50' or '100' to suppress that last final giant merge. On Tue, May 8, 2012 at 12:02 AM, Yuval Dotan yuvaldo...@gmail.com wrote: Hi Can someone please guide me to the right way to partition the solr index? On Mon, May 7, 2012 at 11:41 AM, Yuval Dotan yuvaldo...@gmail.com wrote: Hi All Jan, thanks for the reply - answers for your questions are located below Please update me if you have ideas that can solve my problems. First, some corrections to my previous mail: Hi All We have an index of ~2,000,000,000 Documents and the query and facet times are too slow for us - our index in fact will be much larger Most of our queries will be limited by time, hence we want to partition the data by date/time - even when unlimited – which is mostly what will happen, we have results in the recent records and querying the whole dataset is redundant We want to partition the data because the index size is too big and doesn't fit into memory (80 Gb's) - our data actually continuously grows over time, it will never fit into memory, but has to be available for queries in case results are found in older records or a full facet is required 1. Is multi core the best way to implement my requirement? 2. I noticed there are some LOAD / UNLOAD actions on a core - should i use these action when managing my cores? if so how can i LOAD a core that i have unloaded for example: I have 7 partitions / cores - one for each day of the week - we might have 2000 per day In most cases I will search for documents only on the last day core. Once every 1 queries I need documents from all cores. Question: Do I need to unload all of the old cores and then load them on demand (when i see i need data from these cores)? 3. If the question to the last answer is no, how do i ensure that only cores that are loaded into memory are the ones I want? Thanks Yuval * * *Answers to Jan:* Hi, First you need to investigate WHY faceting and querying is too slow. What exactly do you mean by slow? Can you please tell us more about your setup? * How large documents and how many fields? small records ~200bytes, 20 fields avg most of them are not stored - attached schema and config file * What kind of queries? How many hits? How many facets? Have you studies debugQuery=true output? problem is not with queries being slow per se, it is with getting 50 matches out of billions of matching docs * Do you use filter queries (fq) extensively? user generated queries, fq would not reduce the dataset for some of our usecases * What data do you facet on? Many unique values per field? Text or ranges? What facet.method? problem is not just faceting, it’s with queries – let’s start there * What kind of hardware? RAM/CPU HP DL180G6 , 2 E5645 (12 core) 48 GB RAM * How have you configured your JVM? How much memory? GC? java -Xms512M -Xmx40960M -jar start.jar As you see, you will have to provide a lot more information on your use case and setup in order for us to judge correct action to take. You might need to adjust your config, or to optimize your queries or caches, slim your schema, buy some more RAM, or an SSD :) Normally, going multi core on one box will not necessarily help in itself, as there is overhead in sharding
Re: Partition Question
Hi Can someone please guide me to the right way to partition the solr index? On Mon, May 7, 2012 at 11:41 AM, Yuval Dotan yuvaldo...@gmail.com wrote: Hi All Jan, thanks for the reply - answers for your questions are located below Please update me if you have ideas that can solve my problems. First, some corrections to my previous mail: Hi All We have an index of ~2,000,000,000 Documents and the query and facet times are too slow for us - our index in fact will be much larger Most of our queries will be limited by time, hence we want to partition the data by date/time - even when unlimited – which is mostly what will happen, we have results in the recent records and querying the whole dataset is redundant We want to partition the data because the index size is too big and doesn't fit into memory (80 Gb's) - our data actually continuously grows over time, it will never fit into memory, but has to be available for queries in case results are found in older records or a full facet is required 1. Is multi core the best way to implement my requirement? 2. I noticed there are some LOAD / UNLOAD actions on a core - should i use these action when managing my cores? if so how can i LOAD a core that i have unloaded for example: I have 7 partitions / cores - one for each day of the week - we might have 2000 per day In most cases I will search for documents only on the last day core. Once every 1 queries I need documents from all cores. Question: Do I need to unload all of the old cores and then load them on demand (when i see i need data from these cores)? 3. If the question to the last answer is no, how do i ensure that only cores that are loaded into memory are the ones I want? Thanks Yuval * * *Answers to Jan:* Hi, First you need to investigate WHY faceting and querying is too slow. What exactly do you mean by slow? Can you please tell us more about your setup? * How large documents and how many fields? small records ~200bytes, 20 fields avg most of them are not stored - attached schema and config file * What kind of queries? How many hits? How many facets? Have you studies debugQuery=true output? problem is not with queries being slow per se, it is with getting 50 matches out of billions of matching docs * Do you use filter queries (fq) extensively? user generated queries, fq would not reduce the dataset for some of our usecases * What data do you facet on? Many unique values per field? Text or ranges? What facet.method? problem is not just faceting, it’s with queries – let’s start there * What kind of hardware? RAM/CPU HP DL180G6 , 2 E5645 (12 core) 48 GB RAM * How have you configured your JVM? How much memory? GC? java -Xms512M -Xmx40960M -jar start.jar As you see, you will have to provide a lot more information on your use case and setup in order for us to judge correct action to take. You might need to adjust your config, or to optimize your queries or caches, slim your schema, buy some more RAM, or an SSD :) Normally, going multi core on one box will not necessarily help in itself, as there is overhead in sharding multi cores as well. However, it COULD be a solution since you say that most of the time you only need to consider 1/7 of your data. I would perhaps consider one hot core for last 24h, and one archive core for older data. You could then tune these differently regarding caches etc. Can you get back with some more details? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com
Re: Partition Question
Lucene does not support more 2^32 unique documents, so you need to partition. In Solr this is done with Distributed Search: http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/DistributedSearch First, you have to decide a policy for which documents go to which 'shard'. It is common to make a hash code as the unique id, then distribute the documents modulo this value. This gives a roughly equal distribution of documents. If there is already a clear partition, like the date of the document (like newspaper articles) you could use that also. You have new documents and existing documents. For new documents you need code for this policy to get all new documents to the right index. This could be one master program that passes them out, or each indexer could know which documents it gets. If you want to split up your current index, that's different. I have done this: for each shard, make a copy of the full index, delete-by-query all of the documents that are NOT in that shard, and optimize. We had to do this in sequence so it took a few days :) You don't need a full optimize. Use 'maxSegments=50' or '100' to suppress that last final giant merge. On Tue, May 8, 2012 at 12:02 AM, Yuval Dotan yuvaldo...@gmail.com wrote: Hi Can someone please guide me to the right way to partition the solr index? On Mon, May 7, 2012 at 11:41 AM, Yuval Dotan yuvaldo...@gmail.com wrote: Hi All Jan, thanks for the reply - answers for your questions are located below Please update me if you have ideas that can solve my problems. First, some corrections to my previous mail: Hi All We have an index of ~2,000,000,000 Documents and the query and facet times are too slow for us - our index in fact will be much larger Most of our queries will be limited by time, hence we want to partition the data by date/time - even when unlimited – which is mostly what will happen, we have results in the recent records and querying the whole dataset is redundant We want to partition the data because the index size is too big and doesn't fit into memory (80 Gb's) - our data actually continuously grows over time, it will never fit into memory, but has to be available for queries in case results are found in older records or a full facet is required 1. Is multi core the best way to implement my requirement? 2. I noticed there are some LOAD / UNLOAD actions on a core - should i use these action when managing my cores? if so how can i LOAD a core that i have unloaded for example: I have 7 partitions / cores - one for each day of the week - we might have 2000 per day In most cases I will search for documents only on the last day core. Once every 1 queries I need documents from all cores. Question: Do I need to unload all of the old cores and then load them on demand (when i see i need data from these cores)? 3. If the question to the last answer is no, how do i ensure that only cores that are loaded into memory are the ones I want? Thanks Yuval * * *Answers to Jan:* Hi, First you need to investigate WHY faceting and querying is too slow. What exactly do you mean by slow? Can you please tell us more about your setup? * How large documents and how many fields? small records ~200bytes, 20 fields avg most of them are not stored - attached schema and config file * What kind of queries? How many hits? How many facets? Have you studies debugQuery=true output? problem is not with queries being slow per se, it is with getting 50 matches out of billions of matching docs * Do you use filter queries (fq) extensively? user generated queries, fq would not reduce the dataset for some of our usecases * What data do you facet on? Many unique values per field? Text or ranges? What facet.method? problem is not just faceting, it’s with queries – let’s start there * What kind of hardware? RAM/CPU HP DL180G6 , 2 E5645 (12 core) 48 GB RAM * How have you configured your JVM? How much memory? GC? java -Xms512M -Xmx40960M -jar start.jar As you see, you will have to provide a lot more information on your use case and setup in order for us to judge correct action to take. You might need to adjust your config, or to optimize your queries or caches, slim your schema, buy some more RAM, or an SSD :) Normally, going multi core on one box will not necessarily help in itself, as there is overhead in sharding multi cores as well. However, it COULD be a solution since you say that most of the time you only need to consider 1/7 of your data. I would perhaps consider one hot core for last 24h, and one archive core for older data. You could then tune these differently regarding caches etc. Can you get back with some more details? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com -- Lance Norskog goks...@gmail.com
Partition Question
Hi All We have an index of ~2,000,000,000 Documents and the query and facet times are too slow for us. Before using the shards solution for improving performance, we thought about using the multicore feature (our goal is to maximize performance for a single machine). Most of our queries will be limited by time, hence we want to partition the data by date/time. We want to partition the data because the index size is too big and doesn't fit into memory (80 Gb's). 1. Is multi core the best way to implement my requirement? 2. I noticed there are some LOAD / UNLOAD actions on a core - should i use these action when managing my cores? if so how can i LOAD a core that i have unloaded for example: I have 7 partitions / cores - one for each day of the week In most cases I will search for documents only on the last day core. Once every 1 queries I need documents from all cores. Question: Do I need to unload all of the old cores and then load them on demand (when i see i need data from these cores)? 3. If the question to the last answer is no, how do i ensure that only cores that are loaded into memory are the ones I want? Thanks Yuval
Re: Partition Question
Hi, First you need to investigate WHY faceting and querying is too slow. What exactly do you mean by slow? Can you please tell us more about your setup? * How large documents and how many fields? * What kind of queries? How many hits? How many facets? Have you studies debugQuery=true output? * Do you use filter queries (fq) extensively? * What data do you facet on? Many unique values per field? Text or ranges? What facet.method? * What kind of hardware? RAM/CPU * How have you configured your JVM? How much memory? GC? As you see, you will have to provide a lot more information on your use case and setup in order for us to judge correct action to take. You might need to adjust your config, or to optimize your queries or caches, slim your schema, buy some more RAM, or an SSD :) Normally, going multi core on one box will not necessarily help in itself, as there is overhead in sharding multi cores as well. However, it COULD be a solution since you say that most of the time you only need to consider 1/7 of your data. I would perhaps consider one hot core for last 24h, and one archive core for older data. You could then tune these differently regarding caches etc. Can you get back with some more details? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 6. mai 2012, at 17:07, Yuval Dotan wrote: Hi All We have an index of ~2,000,000,000 Documents and the query and facet times are too slow for us. Before using the shards solution for improving performance, we thought about using the multicore feature (our goal is to maximize performance for a single machine). Most of our queries will be limited by time, hence we want to partition the data by date/time. We want to partition the data because the index size is too big and doesn't fit into memory (80 Gb's). 1. Is multi core the best way to implement my requirement? 2. I noticed there are some LOAD / UNLOAD actions on a core - should i use these action when managing my cores? if so how can i LOAD a core that i have unloaded for example: I have 7 partitions / cores - one for each day of the week In most cases I will search for documents only on the last day core. Once every 1 queries I need documents from all cores. Question: Do I need to unload all of the old cores and then load them on demand (when i see i need data from these cores)? 3. If the question to the last answer is no, how do i ensure that only cores that are loaded into memory are the ones I want? Thanks Yuval