Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-11-03 Thread ilay
Hello all,

  I have a similar situation for grouping where I want group my products
into top categories for a ecommerce application. The number groups here is
less than 10 and total number of docs in the index is 10 Million. Will solr
goruping is an issue here, we have seen OOM issue when we tried grouping for
books simillar editions against the same index. However, if we are grouping
for categories where number of groups is less than 10, will it still be a
problem? Any thoughts on this can be greatly appreciated.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Scalability-of-Solr-Result-Grouping-Field-Collapsing-Millions-Billions-of-documents-tp4002524p4017945.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-23 Thread Mikhail Khludnev
Tom,
Feel free to find my benchmark results for two alternative joining
approaches.
http://blog.griddynamics.com/2012/08/block-join-query-performs.html

Regards

On Thu, Aug 23, 2012 at 4:40 PM, Erick Erickson erickerick...@gmail.comwrote:

 Tom:

 I thin my comments were that grouping on a field where there was
 a unique value _per document_ chewed up a lot of resources.
 Conceptually, there's a bucket for each unique group value. And
 grouping on a file path is just asking for trouble.

 But the memory used for grouping should max as a function of
 the unique values in the grouped field.

 Best
 Erick

 On Wed, Aug 22, 2012 at 11:32 PM, Lance Norskog goks...@gmail.com wrote:
  Yes, distributed grouping works, but grouping takes a lot of
  resources. If you can avoid in distributed mode, so much the better.
 
  On Wed, Aug 22, 2012 at 3:35 PM, Tom Burton-West tburt...@umich.edu
 wrote:
  Thanks Tirthankar,
 
  So the issue in memory use for sorting.  I'm not sure I understand how
  sorting of grouping fields  is involved with the defaults and field
  collapsing, since the default sorts by relevance not grouping field.  On
  the other hand I don't know much about how field collapsing is
 implemented.
 
  So far the few tests I've made haven't revealed any memory problems.  We
  are using very small string fields for grouping and I think that we
  probably only have a couple of cases where we are grouping more than a
 few
  thousand docs.   I will try to find a query with a lot of docs per group
  and take a look at the memory use using JConsole.
 
  Tom
 
 
  On Wed, Aug 22, 2012 at 4:02 PM, Tirthankar Chatterjee 
  tchatter...@commvault.com wrote:
 
   Hi Tom,
 
  We had an issue where we are keeping millions of docs in a single node
 and
  we were trying to group them on a string field which is nothing but
 full
  file path… that caused SOLR to go out of memory…
 
  ** **
 
  Erick has explained nicely in the thread as to why it won’t work and I
 had
  to find another way of architecting it. 
 
  ** **
 
  How do you think this is different in your case. If you want to group
 by a
  string field with thousands of similar entries I am guessing you will
 face
  the same issue. 
 
  ** **
 
  Thanks,
 
  Tirthankar
  ***Legal Disclaimer***
  This communication may contain confidential and privileged material
 for
  the
  sole use of the intended recipient. Any unauthorized review, use or
  distribution
  by others is strictly prohibited. If you have received the message in
  error,
  please advise the sender by reply email and delete the message. Thank
 you.
  **
 
 
 
 
  --
  Lance Norskog
  goks...@gmail.com




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-22 Thread Lance Norskog
How do you separate the documents among the shards? Can you set up the
shards such that one collapse group is only on a single shard? That
you never have to do distributed grouping?

On Tue, Aug 21, 2012 at 4:10 PM, Tirthankar Chatterjee
tchatter...@commvault.com wrote:
 This wont work, see my thread on Solr3.6 Field collapsing
 Thanks,
 Tirthankar

 -Original Message-
 From: Tom Burton-West tburt...@umich.edu
 Date: Tue, 21 Aug 2012 18:39:25
 To: solr-user@lucene.apache.orgsolr-user@lucene.apache.org
 Reply-To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Cc: William Dueberdueb...@umich.edu; Phillip Farberpfar...@umich.edu
 Subject: Scalability of Solr Result Grouping/Field Collapsing:
  Millions/Billions of documents?

 Hello all,

 We are thinking about using Solr Field Collapsing on a rather large scale
 and wonder if anyone has experience with performance when doing Field
 Collapsing on millions of or billions of documents (details below. )  Are
 there performance issues with grouping large result sets?

 Details:
 We have a collection of the full text of 10 million books/journals.  This
 is spread across 12 shards with each shard holding about 800,000
 documents.  When a query matches a journal article, we would like to group
 all the matching articles from the same journal together. (there is a
 unique id field identifying the journal).  Similarly when there is a match
 in multiple copies of the same book we would like to group all results for
 the same book together (again we have a unique id field we can group on).
 Sometimes a short query against the OCR field will result in over one
 million hits.  Are there known performance issues when field collapsing
 result sets containing a million hits?

 We currently index the entire book as one Solr document.  We would like to
 investigate the feasibility of indexing each page as a Solr document with a
 field indicating the book id.  We could then offer our users the choice of
 a list of the most relevant pages, or a list of the books containing the
 most relevant pages.  We have approximately 3 billion pages.   Does anyone
 have experience using field collapsing on this sort of scale?

 Tom

 Tom Burton-West
 Information Retrieval Programmer
 Digital Library Production Service
 Univerity of Michigan Library
 http://www.hathitrust.org/blogs/large-scale-search
 **Legal Disclaimer***
 This communication may contain confidential and privileged
 material for the sole use of the intended recipient. Any
 unauthorized review, use or distribution by others is strictly
 prohibited. If you have received the message in error, please
 advise the sender by reply email and delete the message. Thank
 you.
 *



-- 
Lance Norskog
goks...@gmail.com


Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-22 Thread Tirthankar Chatterjee
You can collapse in each Shards as a separate query

Lance Norskog goks...@gmail.com wrote:


How do you separate the documents among the shards? Can you set up the
shards such that one collapse group is only on a single shard? That
you never have to do distributed grouping?

On Tue, Aug 21, 2012 at 4:10 PM, Tirthankar Chatterjee
tchatter...@commvault.com wrote:
 This wont work, see my thread on Solr3.6 Field collapsing
 Thanks,
 Tirthankar

 -Original Message-
 From: Tom Burton-West tburt...@umich.edu
 Date: Tue, 21 Aug 2012 18:39:25
 To: solr-user@lucene.apache.orgsolr-user@lucene.apache.org
 Reply-To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Cc: William Dueberdueb...@umich.edu; Phillip Farberpfar...@umich.edu
 Subject: Scalability of Solr Result Grouping/Field Collapsing:
  Millions/Billions of documents?

 Hello all,

 We are thinking about using Solr Field Collapsing on a rather large scale
 and wonder if anyone has experience with performance when doing Field
 Collapsing on millions of or billions of documents (details below. )  Are
 there performance issues with grouping large result sets?

 Details:
 We have a collection of the full text of 10 million books/journals.  This
 is spread across 12 shards with each shard holding about 800,000
 documents.  When a query matches a journal article, we would like to group
 all the matching articles from the same journal together. (there is a
 unique id field identifying the journal).  Similarly when there is a match
 in multiple copies of the same book we would like to group all results for
 the same book together (again we have a unique id field we can group on).
 Sometimes a short query against the OCR field will result in over one
 million hits.  Are there known performance issues when field collapsing
 result sets containing a million hits?

 We currently index the entire book as one Solr document.  We would like to
 investigate the feasibility of indexing each page as a Solr document with a
 field indicating the book id.  We could then offer our users the choice of
 a list of the most relevant pages, or a list of the books containing the
 most relevant pages.  We have approximately 3 billion pages.   Does anyone
 have experience using field collapsing on this sort of scale?

 Tom

 Tom Burton-West
 Information Retrieval Programmer
 Digital Library Production Service
 Univerity of Michigan Library
 http://www.hathitrust.org/blogs/large-scale-search
 **Legal Disclaimer***
 This communication may contain confidential and privileged
 material for the sole use of the intended recipient. Any
 unauthorized review, use or distribution by others is strictly
 prohibited. If you have received the message in error, please
 advise the sender by reply email and delete the message. Thank
 you.
 *



--
Lance Norskog
goks...@gmail.com


Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-22 Thread Tom Burton-West
Hi Lance,

I don't understand enough of how the field collapsing is implemented, but I
thought it worked with distributed search.  Are you saying it only works if
everything that needs collapsing is on the same shard?

Tom

On Wed, Aug 22, 2012 at 2:41 AM, Lance Norskog goks...@gmail.com wrote:

 How do you separate the documents among the shards? Can you set up the
 shards such that one collapse group is only on a single shard? That
 you never have to do distributed grouping?

 On Tue, Aug 21, 2012 at 4:10 PM, Tirthankar Chatterjee
 tchatter...@commvault.com wrote:
  This wont work, see my thread on Solr3.6 Field collapsing
  Thanks,
  Tirthankar
 
  -Original Message-
  From: Tom Burton-West tburt...@umich.edu
  Date: Tue, 21 Aug 2012 18:39:25
  To: solr-user@lucene.apache.orgsolr-user@lucene.apache.org
  Reply-To: solr-user@lucene.apache.org solr-user@lucene.apache.org
  Cc: William Dueberdueb...@umich.edu; Phillip Farberpfar...@umich.edu
  Subject: Scalability of Solr Result Grouping/Field Collapsing:
   Millions/Billions of documents?
 
  Hello all,
 
  We are thinking about using Solr Field Collapsing on a rather large scale
  and wonder if anyone has experience with performance when doing Field
  Collapsing on millions of or billions of documents (details below. )  Are
  there performance issues with grouping large result sets?
 
  Details:
  We have a collection of the full text of 10 million books/journals.  This
  is spread across 12 shards with each shard holding about 800,000
  documents.  When a query matches a journal article, we would like to
 group
  all the matching articles from the same journal together. (there is a
  unique id field identifying the journal).  Similarly when there is a
 match
  in multiple copies of the same book we would like to group all results
 for
  the same book together (again we have a unique id field we can group on).
  Sometimes a short query against the OCR field will result in over one
  million hits.  Are there known performance issues when field collapsing
  result sets containing a million hits?
 
  We currently index the entire book as one Solr document.  We would like
 to
  investigate the feasibility of indexing each page as a Solr document
 with a
  field indicating the book id.  We could then offer our users the choice
 of
  a list of the most relevant pages, or a list of the books containing the
  most relevant pages.  We have approximately 3 billion pages.   Does
 anyone
  have experience using field collapsing on this sort of scale?
 
  Tom
 
  Tom Burton-West
  Information Retrieval Programmer
  Digital Library Production Service
  Univerity of Michigan Library
  http://www.hathitrust.org/blogs/large-scale-search
  **Legal Disclaimer***
  This communication may contain confidential and privileged
  material for the sole use of the intended recipient. Any
  unauthorized review, use or distribution by others is strictly
  prohibited. If you have received the message in error, please
  advise the sender by reply email and delete the message. Thank
  you.
  *



 --
 Lance Norskog
 goks...@gmail.com



Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-22 Thread Tom Burton-West
Hi Tirthankar,

Can you give me a quick summary of what   won't work and why?
I couldn't figure it out from looking at your thread.  You seem to have a
different issue, but maybe I'm missing something here.

Tom

On Tue, Aug 21, 2012 at 7:10 PM, Tirthankar Chatterjee 
tchatter...@commvault.com wrote:

 This wont work, see my thread on Solr3.6 Field collapsing
 Thanks,
 Tirthankar




Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-22 Thread Tom Burton-West
Hi Lance and Tirthankar,

We are currently using Solr 3.6.  I tried a search across our current 12
shards grouping by book id (record_no in our schema) and it seems to work
fine (the query with the actual urls for the shards changed is appended
below.)

I then searched for the record_no of the second group in the results to
confirm that the number of records being folded is correct. In both cases
the numFound is 505 so it seems as though the record counts for the group
are correct.  Then I tried the same search but changed the shards parameter
to limit the search to 1/2 of the shards and got numFound = 325.  This
shows that the items in the group are distributed between different shards.

What am I missing here?   What is it that you are saying does not work?

Tom
Field Collapse query ( IP address changed, and newlines added and  shard
urls simplified  for readability)


http://solr-myhost.edu/serve-9/select?indent=onversion=2.2
shards=shard1,shard2,shard3, shard4,shard5, shard,6,...shard12
q=title:naturefq=start=0rows=10fl=id,author,title,volume_enumcron,score
group=truegroup.field=record_nogroup.limit=2


Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-22 Thread Tom Burton-West
Thanks Tirthankar,

So the issue in memory use for sorting.  I'm not sure I understand how
sorting of grouping fields  is involved with the defaults and field
collapsing, since the default sorts by relevance not grouping field.  On
the other hand I don't know much about how field collapsing is implemented.

So far the few tests I've made haven't revealed any memory problems.  We
are using very small string fields for grouping and I think that we
probably only have a couple of cases where we are grouping more than a few
thousand docs.   I will try to find a query with a lot of docs per group
and take a look at the memory use using JConsole.

Tom


On Wed, Aug 22, 2012 at 4:02 PM, Tirthankar Chatterjee 
tchatter...@commvault.com wrote:

  Hi Tom,

 We had an issue where we are keeping millions of docs in a single node and
 we were trying to group them on a string field which is nothing but full
 file path… that caused SOLR to go out of memory…

 ** **

 Erick has explained nicely in the thread as to why it won’t work and I had
 to find another way of architecting it. 

 ** **

 How do you think this is different in your case. If you want to group by a
 string field with thousands of similar entries I am guessing you will face
 the same issue. 

 ** **

 Thanks,

 Tirthankar
 ***Legal Disclaimer***
 This communication may contain confidential and privileged material for
 the
 sole use of the intended recipient. Any unauthorized review, use or
 distribution
 by others is strictly prohibited. If you have received the message in
 error,
 please advise the sender by reply email and delete the message. Thank you.
 **



Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-22 Thread Lance Norskog
Yes, distributed grouping works, but grouping takes a lot of
resources. If you can avoid in distributed mode, so much the better.

On Wed, Aug 22, 2012 at 3:35 PM, Tom Burton-West tburt...@umich.edu wrote:
 Thanks Tirthankar,

 So the issue in memory use for sorting.  I'm not sure I understand how
 sorting of grouping fields  is involved with the defaults and field
 collapsing, since the default sorts by relevance not grouping field.  On
 the other hand I don't know much about how field collapsing is implemented.

 So far the few tests I've made haven't revealed any memory problems.  We
 are using very small string fields for grouping and I think that we
 probably only have a couple of cases where we are grouping more than a few
 thousand docs.   I will try to find a query with a lot of docs per group
 and take a look at the memory use using JConsole.

 Tom


 On Wed, Aug 22, 2012 at 4:02 PM, Tirthankar Chatterjee 
 tchatter...@commvault.com wrote:

  Hi Tom,

 We had an issue where we are keeping millions of docs in a single node and
 we were trying to group them on a string field which is nothing but full
 file path… that caused SOLR to go out of memory…

 ** **

 Erick has explained nicely in the thread as to why it won’t work and I had
 to find another way of architecting it. 

 ** **

 How do you think this is different in your case. If you want to group by a
 string field with thousands of similar entries I am guessing you will face
 the same issue. 

 ** **

 Thanks,

 Tirthankar
 ***Legal Disclaimer***
 This communication may contain confidential and privileged material for
 the
 sole use of the intended recipient. Any unauthorized review, use or
 distribution
 by others is strictly prohibited. If you have received the message in
 error,
 please advise the sender by reply email and delete the message. Thank you.
 **




-- 
Lance Norskog
goks...@gmail.com


Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-21 Thread Tom Burton-West
Hello all,

We are thinking about using Solr Field Collapsing on a rather large scale
and wonder if anyone has experience with performance when doing Field
Collapsing on millions of or billions of documents (details below. )  Are
there performance issues with grouping large result sets?

Details:
We have a collection of the full text of 10 million books/journals.  This
is spread across 12 shards with each shard holding about 800,000
documents.  When a query matches a journal article, we would like to group
all the matching articles from the same journal together. (there is a
unique id field identifying the journal).  Similarly when there is a match
in multiple copies of the same book we would like to group all results for
the same book together (again we have a unique id field we can group on).
Sometimes a short query against the OCR field will result in over one
million hits.  Are there known performance issues when field collapsing
result sets containing a million hits?

We currently index the entire book as one Solr document.  We would like to
investigate the feasibility of indexing each page as a Solr document with a
field indicating the book id.  We could then offer our users the choice of
a list of the most relevant pages, or a list of the books containing the
most relevant pages.  We have approximately 3 billion pages.   Does anyone
have experience using field collapsing on this sort of scale?

Tom

Tom Burton-West
Information Retrieval Programmer
Digital Library Production Service
Univerity of Michigan Library
http://www.hathitrust.org/blogs/large-scale-search


Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-21 Thread Tirthankar Chatterjee
This wont work, see my thread on Solr3.6 Field collapsing
Thanks,
Tirthankar

-Original Message-
From: Tom Burton-West tburt...@umich.edu
Date: Tue, 21 Aug 2012 18:39:25 
To: solr-user@lucene.apache.orgsolr-user@lucene.apache.org
Reply-To: solr-user@lucene.apache.org solr-user@lucene.apache.org
Cc: William Dueberdueb...@umich.edu; Phillip Farberpfar...@umich.edu
Subject: Scalability of Solr Result Grouping/Field Collapsing:
 Millions/Billions of documents?

Hello all,

We are thinking about using Solr Field Collapsing on a rather large scale
and wonder if anyone has experience with performance when doing Field
Collapsing on millions of or billions of documents (details below. )  Are
there performance issues with grouping large result sets?

Details:
We have a collection of the full text of 10 million books/journals.  This
is spread across 12 shards with each shard holding about 800,000
documents.  When a query matches a journal article, we would like to group
all the matching articles from the same journal together. (there is a
unique id field identifying the journal).  Similarly when there is a match
in multiple copies of the same book we would like to group all results for
the same book together (again we have a unique id field we can group on).
Sometimes a short query against the OCR field will result in over one
million hits.  Are there known performance issues when field collapsing
result sets containing a million hits?

We currently index the entire book as one Solr document.  We would like to
investigate the feasibility of indexing each page as a Solr document with a
field indicating the book id.  We could then offer our users the choice of
a list of the most relevant pages, or a list of the books containing the
most relevant pages.  We have approximately 3 billion pages.   Does anyone
have experience using field collapsing on this sort of scale?

Tom

Tom Burton-West
Information Retrieval Programmer
Digital Library Production Service
Univerity of Michigan Library
http://www.hathitrust.org/blogs/large-scale-search
**Legal Disclaimer***
This communication may contain confidential and privileged
material for the sole use of the intended recipient. Any
unauthorized review, use or distribution by others is strictly
prohibited. If you have received the message in error, please
advise the sender by reply email and delete the message. Thank
you.
*