Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?
Hello all, I have a similar situation for grouping where I want group my products into top categories for a ecommerce application. The number groups here is less than 10 and total number of docs in the index is 10 Million. Will solr goruping is an issue here, we have seen OOM issue when we tried grouping for books simillar editions against the same index. However, if we are grouping for categories where number of groups is less than 10, will it still be a problem? Any thoughts on this can be greatly appreciated. -- View this message in context: http://lucene.472066.n3.nabble.com/Scalability-of-Solr-Result-Grouping-Field-Collapsing-Millions-Billions-of-documents-tp4002524p4017945.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?
Tom, Feel free to find my benchmark results for two alternative joining approaches. http://blog.griddynamics.com/2012/08/block-join-query-performs.html Regards On Thu, Aug 23, 2012 at 4:40 PM, Erick Erickson erickerick...@gmail.comwrote: Tom: I thin my comments were that grouping on a field where there was a unique value _per document_ chewed up a lot of resources. Conceptually, there's a bucket for each unique group value. And grouping on a file path is just asking for trouble. But the memory used for grouping should max as a function of the unique values in the grouped field. Best Erick On Wed, Aug 22, 2012 at 11:32 PM, Lance Norskog goks...@gmail.com wrote: Yes, distributed grouping works, but grouping takes a lot of resources. If you can avoid in distributed mode, so much the better. On Wed, Aug 22, 2012 at 3:35 PM, Tom Burton-West tburt...@umich.edu wrote: Thanks Tirthankar, So the issue in memory use for sorting. I'm not sure I understand how sorting of grouping fields is involved with the defaults and field collapsing, since the default sorts by relevance not grouping field. On the other hand I don't know much about how field collapsing is implemented. So far the few tests I've made haven't revealed any memory problems. We are using very small string fields for grouping and I think that we probably only have a couple of cases where we are grouping more than a few thousand docs. I will try to find a query with a lot of docs per group and take a look at the memory use using JConsole. Tom On Wed, Aug 22, 2012 at 4:02 PM, Tirthankar Chatterjee tchatter...@commvault.com wrote: Hi Tom, We had an issue where we are keeping millions of docs in a single node and we were trying to group them on a string field which is nothing but full file path… that caused SOLR to go out of memory… ** ** Erick has explained nicely in the thread as to why it won’t work and I had to find another way of architecting it. ** ** How do you think this is different in your case. If you want to group by a string field with thousands of similar entries I am guessing you will face the same issue. ** ** Thanks, Tirthankar ***Legal Disclaimer*** This communication may contain confidential and privileged material for the sole use of the intended recipient. Any unauthorized review, use or distribution by others is strictly prohibited. If you have received the message in error, please advise the sender by reply email and delete the message. Thank you. ** -- Lance Norskog goks...@gmail.com -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?
How do you separate the documents among the shards? Can you set up the shards such that one collapse group is only on a single shard? That you never have to do distributed grouping? On Tue, Aug 21, 2012 at 4:10 PM, Tirthankar Chatterjee tchatter...@commvault.com wrote: This wont work, see my thread on Solr3.6 Field collapsing Thanks, Tirthankar -Original Message- From: Tom Burton-West tburt...@umich.edu Date: Tue, 21 Aug 2012 18:39:25 To: solr-user@lucene.apache.orgsolr-user@lucene.apache.org Reply-To: solr-user@lucene.apache.org solr-user@lucene.apache.org Cc: William Dueberdueb...@umich.edu; Phillip Farberpfar...@umich.edu Subject: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents? Hello all, We are thinking about using Solr Field Collapsing on a rather large scale and wonder if anyone has experience with performance when doing Field Collapsing on millions of or billions of documents (details below. ) Are there performance issues with grouping large result sets? Details: We have a collection of the full text of 10 million books/journals. This is spread across 12 shards with each shard holding about 800,000 documents. When a query matches a journal article, we would like to group all the matching articles from the same journal together. (there is a unique id field identifying the journal). Similarly when there is a match in multiple copies of the same book we would like to group all results for the same book together (again we have a unique id field we can group on). Sometimes a short query against the OCR field will result in over one million hits. Are there known performance issues when field collapsing result sets containing a million hits? We currently index the entire book as one Solr document. We would like to investigate the feasibility of indexing each page as a Solr document with a field indicating the book id. We could then offer our users the choice of a list of the most relevant pages, or a list of the books containing the most relevant pages. We have approximately 3 billion pages. Does anyone have experience using field collapsing on this sort of scale? Tom Tom Burton-West Information Retrieval Programmer Digital Library Production Service Univerity of Michigan Library http://www.hathitrust.org/blogs/large-scale-search **Legal Disclaimer*** This communication may contain confidential and privileged material for the sole use of the intended recipient. Any unauthorized review, use or distribution by others is strictly prohibited. If you have received the message in error, please advise the sender by reply email and delete the message. Thank you. * -- Lance Norskog goks...@gmail.com
Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?
You can collapse in each Shards as a separate query Lance Norskog goks...@gmail.com wrote: How do you separate the documents among the shards? Can you set up the shards such that one collapse group is only on a single shard? That you never have to do distributed grouping? On Tue, Aug 21, 2012 at 4:10 PM, Tirthankar Chatterjee tchatter...@commvault.com wrote: This wont work, see my thread on Solr3.6 Field collapsing Thanks, Tirthankar -Original Message- From: Tom Burton-West tburt...@umich.edu Date: Tue, 21 Aug 2012 18:39:25 To: solr-user@lucene.apache.orgsolr-user@lucene.apache.org Reply-To: solr-user@lucene.apache.org solr-user@lucene.apache.org Cc: William Dueberdueb...@umich.edu; Phillip Farberpfar...@umich.edu Subject: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents? Hello all, We are thinking about using Solr Field Collapsing on a rather large scale and wonder if anyone has experience with performance when doing Field Collapsing on millions of or billions of documents (details below. ) Are there performance issues with grouping large result sets? Details: We have a collection of the full text of 10 million books/journals. This is spread across 12 shards with each shard holding about 800,000 documents. When a query matches a journal article, we would like to group all the matching articles from the same journal together. (there is a unique id field identifying the journal). Similarly when there is a match in multiple copies of the same book we would like to group all results for the same book together (again we have a unique id field we can group on). Sometimes a short query against the OCR field will result in over one million hits. Are there known performance issues when field collapsing result sets containing a million hits? We currently index the entire book as one Solr document. We would like to investigate the feasibility of indexing each page as a Solr document with a field indicating the book id. We could then offer our users the choice of a list of the most relevant pages, or a list of the books containing the most relevant pages. We have approximately 3 billion pages. Does anyone have experience using field collapsing on this sort of scale? Tom Tom Burton-West Information Retrieval Programmer Digital Library Production Service Univerity of Michigan Library http://www.hathitrust.org/blogs/large-scale-search **Legal Disclaimer*** This communication may contain confidential and privileged material for the sole use of the intended recipient. Any unauthorized review, use or distribution by others is strictly prohibited. If you have received the message in error, please advise the sender by reply email and delete the message. Thank you. * -- Lance Norskog goks...@gmail.com
Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?
Hi Lance, I don't understand enough of how the field collapsing is implemented, but I thought it worked with distributed search. Are you saying it only works if everything that needs collapsing is on the same shard? Tom On Wed, Aug 22, 2012 at 2:41 AM, Lance Norskog goks...@gmail.com wrote: How do you separate the documents among the shards? Can you set up the shards such that one collapse group is only on a single shard? That you never have to do distributed grouping? On Tue, Aug 21, 2012 at 4:10 PM, Tirthankar Chatterjee tchatter...@commvault.com wrote: This wont work, see my thread on Solr3.6 Field collapsing Thanks, Tirthankar -Original Message- From: Tom Burton-West tburt...@umich.edu Date: Tue, 21 Aug 2012 18:39:25 To: solr-user@lucene.apache.orgsolr-user@lucene.apache.org Reply-To: solr-user@lucene.apache.org solr-user@lucene.apache.org Cc: William Dueberdueb...@umich.edu; Phillip Farberpfar...@umich.edu Subject: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents? Hello all, We are thinking about using Solr Field Collapsing on a rather large scale and wonder if anyone has experience with performance when doing Field Collapsing on millions of or billions of documents (details below. ) Are there performance issues with grouping large result sets? Details: We have a collection of the full text of 10 million books/journals. This is spread across 12 shards with each shard holding about 800,000 documents. When a query matches a journal article, we would like to group all the matching articles from the same journal together. (there is a unique id field identifying the journal). Similarly when there is a match in multiple copies of the same book we would like to group all results for the same book together (again we have a unique id field we can group on). Sometimes a short query against the OCR field will result in over one million hits. Are there known performance issues when field collapsing result sets containing a million hits? We currently index the entire book as one Solr document. We would like to investigate the feasibility of indexing each page as a Solr document with a field indicating the book id. We could then offer our users the choice of a list of the most relevant pages, or a list of the books containing the most relevant pages. We have approximately 3 billion pages. Does anyone have experience using field collapsing on this sort of scale? Tom Tom Burton-West Information Retrieval Programmer Digital Library Production Service Univerity of Michigan Library http://www.hathitrust.org/blogs/large-scale-search **Legal Disclaimer*** This communication may contain confidential and privileged material for the sole use of the intended recipient. Any unauthorized review, use or distribution by others is strictly prohibited. If you have received the message in error, please advise the sender by reply email and delete the message. Thank you. * -- Lance Norskog goks...@gmail.com
Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?
Hi Tirthankar, Can you give me a quick summary of what won't work and why? I couldn't figure it out from looking at your thread. You seem to have a different issue, but maybe I'm missing something here. Tom On Tue, Aug 21, 2012 at 7:10 PM, Tirthankar Chatterjee tchatter...@commvault.com wrote: This wont work, see my thread on Solr3.6 Field collapsing Thanks, Tirthankar
Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?
Hi Lance and Tirthankar, We are currently using Solr 3.6. I tried a search across our current 12 shards grouping by book id (record_no in our schema) and it seems to work fine (the query with the actual urls for the shards changed is appended below.) I then searched for the record_no of the second group in the results to confirm that the number of records being folded is correct. In both cases the numFound is 505 so it seems as though the record counts for the group are correct. Then I tried the same search but changed the shards parameter to limit the search to 1/2 of the shards and got numFound = 325. This shows that the items in the group are distributed between different shards. What am I missing here? What is it that you are saying does not work? Tom Field Collapse query ( IP address changed, and newlines added and shard urls simplified for readability) http://solr-myhost.edu/serve-9/select?indent=onversion=2.2 shards=shard1,shard2,shard3, shard4,shard5, shard,6,...shard12 q=title:naturefq=start=0rows=10fl=id,author,title,volume_enumcron,score group=truegroup.field=record_nogroup.limit=2
Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?
Thanks Tirthankar, So the issue in memory use for sorting. I'm not sure I understand how sorting of grouping fields is involved with the defaults and field collapsing, since the default sorts by relevance not grouping field. On the other hand I don't know much about how field collapsing is implemented. So far the few tests I've made haven't revealed any memory problems. We are using very small string fields for grouping and I think that we probably only have a couple of cases where we are grouping more than a few thousand docs. I will try to find a query with a lot of docs per group and take a look at the memory use using JConsole. Tom On Wed, Aug 22, 2012 at 4:02 PM, Tirthankar Chatterjee tchatter...@commvault.com wrote: Hi Tom, We had an issue where we are keeping millions of docs in a single node and we were trying to group them on a string field which is nothing but full file path… that caused SOLR to go out of memory… ** ** Erick has explained nicely in the thread as to why it won’t work and I had to find another way of architecting it. ** ** How do you think this is different in your case. If you want to group by a string field with thousands of similar entries I am guessing you will face the same issue. ** ** Thanks, Tirthankar ***Legal Disclaimer*** This communication may contain confidential and privileged material for the sole use of the intended recipient. Any unauthorized review, use or distribution by others is strictly prohibited. If you have received the message in error, please advise the sender by reply email and delete the message. Thank you. **
Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?
Yes, distributed grouping works, but grouping takes a lot of resources. If you can avoid in distributed mode, so much the better. On Wed, Aug 22, 2012 at 3:35 PM, Tom Burton-West tburt...@umich.edu wrote: Thanks Tirthankar, So the issue in memory use for sorting. I'm not sure I understand how sorting of grouping fields is involved with the defaults and field collapsing, since the default sorts by relevance not grouping field. On the other hand I don't know much about how field collapsing is implemented. So far the few tests I've made haven't revealed any memory problems. We are using very small string fields for grouping and I think that we probably only have a couple of cases where we are grouping more than a few thousand docs. I will try to find a query with a lot of docs per group and take a look at the memory use using JConsole. Tom On Wed, Aug 22, 2012 at 4:02 PM, Tirthankar Chatterjee tchatter...@commvault.com wrote: Hi Tom, We had an issue where we are keeping millions of docs in a single node and we were trying to group them on a string field which is nothing but full file path… that caused SOLR to go out of memory… ** ** Erick has explained nicely in the thread as to why it won’t work and I had to find another way of architecting it. ** ** How do you think this is different in your case. If you want to group by a string field with thousands of similar entries I am guessing you will face the same issue. ** ** Thanks, Tirthankar ***Legal Disclaimer*** This communication may contain confidential and privileged material for the sole use of the intended recipient. Any unauthorized review, use or distribution by others is strictly prohibited. If you have received the message in error, please advise the sender by reply email and delete the message. Thank you. ** -- Lance Norskog goks...@gmail.com
Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?
Hello all, We are thinking about using Solr Field Collapsing on a rather large scale and wonder if anyone has experience with performance when doing Field Collapsing on millions of or billions of documents (details below. ) Are there performance issues with grouping large result sets? Details: We have a collection of the full text of 10 million books/journals. This is spread across 12 shards with each shard holding about 800,000 documents. When a query matches a journal article, we would like to group all the matching articles from the same journal together. (there is a unique id field identifying the journal). Similarly when there is a match in multiple copies of the same book we would like to group all results for the same book together (again we have a unique id field we can group on). Sometimes a short query against the OCR field will result in over one million hits. Are there known performance issues when field collapsing result sets containing a million hits? We currently index the entire book as one Solr document. We would like to investigate the feasibility of indexing each page as a Solr document with a field indicating the book id. We could then offer our users the choice of a list of the most relevant pages, or a list of the books containing the most relevant pages. We have approximately 3 billion pages. Does anyone have experience using field collapsing on this sort of scale? Tom Tom Burton-West Information Retrieval Programmer Digital Library Production Service Univerity of Michigan Library http://www.hathitrust.org/blogs/large-scale-search
Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?
This wont work, see my thread on Solr3.6 Field collapsing Thanks, Tirthankar -Original Message- From: Tom Burton-West tburt...@umich.edu Date: Tue, 21 Aug 2012 18:39:25 To: solr-user@lucene.apache.orgsolr-user@lucene.apache.org Reply-To: solr-user@lucene.apache.org solr-user@lucene.apache.org Cc: William Dueberdueb...@umich.edu; Phillip Farberpfar...@umich.edu Subject: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents? Hello all, We are thinking about using Solr Field Collapsing on a rather large scale and wonder if anyone has experience with performance when doing Field Collapsing on millions of or billions of documents (details below. ) Are there performance issues with grouping large result sets? Details: We have a collection of the full text of 10 million books/journals. This is spread across 12 shards with each shard holding about 800,000 documents. When a query matches a journal article, we would like to group all the matching articles from the same journal together. (there is a unique id field identifying the journal). Similarly when there is a match in multiple copies of the same book we would like to group all results for the same book together (again we have a unique id field we can group on). Sometimes a short query against the OCR field will result in over one million hits. Are there known performance issues when field collapsing result sets containing a million hits? We currently index the entire book as one Solr document. We would like to investigate the feasibility of indexing each page as a Solr document with a field indicating the book id. We could then offer our users the choice of a list of the most relevant pages, or a list of the books containing the most relevant pages. We have approximately 3 billion pages. Does anyone have experience using field collapsing on this sort of scale? Tom Tom Burton-West Information Retrieval Programmer Digital Library Production Service Univerity of Michigan Library http://www.hathitrust.org/blogs/large-scale-search **Legal Disclaimer*** This communication may contain confidential and privileged material for the sole use of the intended recipient. Any unauthorized review, use or distribution by others is strictly prohibited. If you have received the message in error, please advise the sender by reply email and delete the message. Thank you. *