Re: solr as nosql - pulling all docs vs deep paging limitations
You can do range queries without an upper bound and just limit the number of results. Then you look at the last result to obtain the new lower bound. -- Jens On 17/12/13 20:23, Petersen, Robert wrote: My use case is basically to do a dump of all contents of the index with no ordering needed. It's actually to be a product data export for third parties. Unique key is product sku. I could take the min sku and range query up to the max sku but the skus are not contiguous because some get turned off and only some are valid for export so each range would return a different number of products (which may or may not be acceptable and I might be able to kind of hide that with some code). -Original Message- From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com] Sent: Tuesday, December 17, 2013 10:41 AM To: solr-user Subject: Re: solr as nosql - pulling all docs vs deep paging limitations Hoss, What about SELECT * FROM WHERE ... like misusing Solr? I'm sure you've been asked many times for that. What if client don't need to rank results somehow, but just requesting unordered filtering result like they are used to in RDBMS? Do you feel it will never considered as a resonable usecase for Solr? or there is a well known approach for dealing with? On Tue, Dec 17, 2013 at 10:16 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : Then I remembered we currently don't allow deep paging in our current : search indexes as performance declines the deeper you go. Is this still : the case? Coincidently, i'm working on a new cursor based API to make this much more feasible as we speak.. https://issues.apache.org/jira/browse/SOLR-5463 I did some simple perf testing of the strawman approach and posted the results last week... http://searchhub.org/coming-soon-to-solr-efficient-cursor-based-iterat ion-of-large-result-sets/ ...current iterations on the patch are to eliminate the strawman code to improve performance even more and beef up the test cases. : If so, is there another approach to make all the data in a collection : easily available for retrieval? The only thing I can think of is to ... : Then I was thinking we could have a field with an incrementing numeric : value which could be used to perform range queries as a substitute for : paging through everything. Ie queries like 'IncrementalField:[1 TO : 100]' 'IncrementalField:[101 TO 200]' but this would be difficult to : maintain as we update the index unless we reindex the entire collection : every time we update any docs at all. As i mentioned in the blog above, as long as you have a uniqueKey field that supports range queries, bulk exporting of all documents is fairly trivial by sorting on your uniqueKey field and using an fq that also filters on your uniqueKey field modify the fq each time to change the lower bound to match the highest ID you got on the previous page. This approach works really well in simple cases where you wnat to fetch all documents matching a query and then process/sort them by some other criteria on the client -- but it's not viable if it's important to you that the documents come back from solr in score order before your client gets them because you want to stop fetching once some criteria is met in your client. Example: you have billions of documents matching a query, you want to fetch all sorted by score desc and crunch them on your client to compute some stats, and once your client side stat crunching tells you you have enough results (which might be after the 1000th result, or might be after the millionth result) then you want to stop. SOLR-5463 will help even in that later case. The bulk of the patch should easy to use in the next day or so (having other people try out and test in their applications would be *very* helpful) and hopefully show up in Solr 4.7 -Hoss http://www.lucidworks.com/ -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: solr as nosql - pulling all docs vs deep paging limitations
Aha! SOLR-5244 is a particular case which I'm asking about. I wonder who else consider it useful? (I.m sorry if I hijacked the thread) 18.12.2013 5:41 пользователь Joel Bernstein joels...@gmail.com написал: They are for different use cases. Hoss's approach, I believe, focuses on deep paging of ranked search results. SOLR-5244 focuses on the batch export of an entire unranked search result in binary format. It's basically a very efficient bulk extract for Solr. On Tue, Dec 17, 2013 at 6:51 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Joel - can you please elaborate a bit on how this compares with Hoss' approach? Complementary? Thanks, Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Tue, Dec 17, 2013 at 6:45 PM, Joel Bernstein joels...@gmail.com wrote: SOLR-5244 is also working in this direction. This focuses on efficient binary extract of entire search results. On Tue, Dec 17, 2013 at 2:33 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hoss is working on it. Search for deep paging or cursor in JIRA. Otis Solr ElasticSearch Support http://sematext.com/ On Dec 17, 2013 12:30 PM, Petersen, Robert robert.peter...@mail.rakuten.com wrote: Hi solr users, We have a new use case where need to make a pile of data available as XML to a client and I was thinking we could easily put all this data into a solr collection and the client could just do a star search and page through all the results to obtain the data we need to give them. Then I remembered we currently don't allow deep paging in our current search indexes as performance declines the deeper you go. Is this still the case? If so, is there another approach to make all the data in a collection easily available for retrieval? The only thing I can think of is to query our DB for all the unique IDs of all the documents in the collection and then pull out the documents out in small groups with successive queries like 'UniqueIdField:(id1 OR id2 OR ... OR idn)' 'UniqueIdField:(idn+1 OR idn+2 OR ... etc)' which doesn't seem like a very good approach because the DB might have been updated with new data which hasn't been indexed yet and so all the ids might not be in there (which may or may not matter I suppose). Then I was thinking we could have a field with an incrementing numeric value which could be used to perform range queries as a substitute for paging through everything. Ie queries like 'IncrementalField:[1 TO 100]' 'IncrementalField:[101 TO 200]' but this would be difficult to maintain as we update the index unless we reindex the entire collection every time we update any docs at all. Is this perhaps not a good use case for solr? Should I use something else or is there another approach that would work here to allow a client to pull groups of docs in a collection through the rest api until the client has gotten them all? Thanks Robi -- Joel Bernstein Search Engineer at Heliosearch -- Joel Bernstein Search Engineer at Heliosearch
Re: solr as nosql - pulling all docs vs deep paging limitations
: : What about SELECT * FROM WHERE ... like misusing Solr? I'm sure you've been : asked many times for that. : What if client don't need to rank results somehow, but just requesting : unordered filtering result like they are used to in RDBMS? : Do you feel it will never considered as a resonable usecase for Solr? or : there is a well known approach for dealing with? If you don't care about ordering, then the approach i described (either using SOLR-5463, or just using a sort by uniqueKey with increasing range filters on the id) should work fine -- the fact that they come back sorted by id is just an implementation detail that makes it possible to batch the records (the same way most SQL databases will likely give you back the docs based on whatever primary key index you have) I think the key difference between approaches like SOLR-5244 vs the cursor work in SOLR-5463 is that SOLR-5244 is really targeted at dumping all data about all docs from a core (matching the query) in a single request/response -- for something like SolrCloud, the client would manually need to hit each shard (but as i understand it fro mthe dscription, that's kind of the point, it's aiming to be a very low level bulk export). With the cursor approach in SOLR-5463, we do agregation across all shards, and we support arbitrary sorts, and you can control the batch size from the client and iterate over multiple request/responses of that size. if there is any network hucups, you can re-do a request. If you process half the docs that match (in a particular order) and then decide I've got all the docs i need for my purposes, ou can stop requesting the continuation of that cursor. -Hoss http://www.lucidworks.com/
Re: solr as nosql - pulling all docs vs deep paging limitations
: You can do range queries without an upper bound and just limit the number of : results. Then you look at the last result to obtain the new lower bound. exactly. instead of this: First: q=foostart=0rows=$ROWS After: q=foostart=$Xrows=$ROWS ...where $ROWS is how big a batch of docsy you can handle at one time, and you increase the value of $X by the value of $ROWS on each successive request, you can just do this... First: q=foostart=0rows=$ROWSsort=id+asc After: q=foostart=0rows=$ROWSsort=id+ascfq=id:{$X TO *] ...where $X is whatever the last id you got on the previous page. Or: you try out the patch in SOLR-5463 and do something like this... First: q=foostart=0rows=$ROWSsort=id+asccursorMark=* After: q=foostart=0rows=$ROWSsort=id+asccursorMark=$X ...where $X is whatever nextCursorMark you got from the previous page. -Hoss http://www.lucidworks.com/
Re: solr as nosql - pulling all docs vs deep paging limitations
Us too. That's going to be huge for us! Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. The Science of Influence Marketing 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinionshttps://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts w: appinions.com http://www.appinions.com/ On Wed, Dec 18, 2013 at 9:55 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Aha! SOLR-5244 is a particular case which I'm asking about. I wonder who else consider it useful? (I.m sorry if I hijacked the thread) 18.12.2013 5:41 пользователь Joel Bernstein joels...@gmail.com написал: They are for different use cases. Hoss's approach, I believe, focuses on deep paging of ranked search results. SOLR-5244 focuses on the batch export of an entire unranked search result in binary format. It's basically a very efficient bulk extract for Solr. On Tue, Dec 17, 2013 at 6:51 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Joel - can you please elaborate a bit on how this compares with Hoss' approach? Complementary? Thanks, Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Tue, Dec 17, 2013 at 6:45 PM, Joel Bernstein joels...@gmail.com wrote: SOLR-5244 is also working in this direction. This focuses on efficient binary extract of entire search results. On Tue, Dec 17, 2013 at 2:33 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hoss is working on it. Search for deep paging or cursor in JIRA. Otis Solr ElasticSearch Support http://sematext.com/ On Dec 17, 2013 12:30 PM, Petersen, Robert robert.peter...@mail.rakuten.com wrote: Hi solr users, We have a new use case where need to make a pile of data available as XML to a client and I was thinking we could easily put all this data into a solr collection and the client could just do a star search and page through all the results to obtain the data we need to give them. Then I remembered we currently don't allow deep paging in our current search indexes as performance declines the deeper you go. Is this still the case? If so, is there another approach to make all the data in a collection easily available for retrieval? The only thing I can think of is to query our DB for all the unique IDs of all the documents in the collection and then pull out the documents out in small groups with successive queries like 'UniqueIdField:(id1 OR id2 OR ... OR idn)' 'UniqueIdField:(idn+1 OR idn+2 OR ... etc)' which doesn't seem like a very good approach because the DB might have been updated with new data which hasn't been indexed yet and so all the ids might not be in there (which may or may not matter I suppose). Then I was thinking we could have a field with an incrementing numeric value which could be used to perform range queries as a substitute for paging through everything. Ie queries like 'IncrementalField:[1 TO 100]' 'IncrementalField:[101 TO 200]' but this would be difficult to maintain as we update the index unless we reindex the entire collection every time we update any docs at all. Is this perhaps not a good use case for solr? Should I use something else or is there another approach that would work here to allow a client to pull groups of docs in a collection through the rest api until the client has gotten them all? Thanks Robi -- Joel Bernstein Search Engineer at Heliosearch -- Joel Bernstein Search Engineer at Heliosearch
Re: solr as nosql - pulling all docs vs deep paging limitations
On 12/17/13 1:16 PM, Chris Hostetter wrote: As i mentioned in the blog above, as long as you have a uniqueKey field that supports range queries, bulk exporting of all documents is fairly trivial by sorting on your uniqueKey field and using an fq that also filters on your uniqueKey field modify the fq each time to change the lower bound to match the highest ID you got on the previous page. Aha, very nice suggestion, I hadn't thought of this, when myself trying to figure out decent ways to 'fetch all documents matching a query' for some bulk offline processing. One question that I was never sure about when trying to do things like this -- is this going to end up blowing the query and/or document caches if used on a live Solr? By filling up those caches with the results of the 'bulk' export? If so, is there any way to avoid that? Or does it probably not really matter? Jonathan
Re: solr as nosql - pulling all docs vs deep paging limitations
: One question that I was never sure about when trying to do things like this -- : is this going to end up blowing the query and/or document caches if used on a : live Solr? By filling up those caches with the results of the 'bulk' export? : If so, is there any way to avoid that? Or does it probably not really matter? q={!cache=false}... -Hoss http://www.lucidworks.com/
Re: solr as nosql - pulling all docs vs deep paging limitations
On Wed, Dec 18, 2013 at 8:03 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : : What about SELECT * FROM WHERE ... like misusing Solr? I'm sure you've been : asked many times for that. : What if client don't need to rank results somehow, but just requesting : unordered filtering result like they are used to in RDBMS? : Do you feel it will never considered as a resonable usecase for Solr? or : there is a well known approach for dealing with? If you don't care about ordering, then the approach i described (either using SOLR-5463, or just using a sort by uniqueKey with increasing range filters on the id) should work fine -- the fact that they come back sorted by id is just an implementation detail that makes it possible to batch the records From the functional standpoint it's true, but performance might matter, in that side cases. eg. I wonder why the priority queue is needed even if we request sort=_docid_. (the same way most SQL databases will likely give you back the docs based on whatever primary key index you have) I think the key difference between approaches like SOLR-5244 vs the cursor work in SOLR-5463 is that SOLR-5244 is really targeted at dumping all data about all docs from a core (matching the query) in a single request/response -- for something like SolrCloud, the client would manually need to hit each shard (but as i understand it fro mthe dscription, that's kind of the point, it's aiming to be a very low level bulk export). With the cursor approach in SOLR-5463, we do agregation across all shards, and we support arbitrary sorts, and you can control the batch size from the client and iterate over multiple request/responses of that size. if there is any network hucups, you can re-do a request. If you process half the docs that match (in a particular order) and then decide I've got all the docs i need for my purposes, ou can stop requesting the continuation of that cursor. -Hoss http://www.lucidworks.com/ -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: solr as nosql - pulling all docs vs deep paging limitations
: Then I remembered we currently don't allow deep paging in our current : search indexes as performance declines the deeper you go. Is this still : the case? Coincidently, i'm working on a new cursor based API to make this much more feasible as we speak.. https://issues.apache.org/jira/browse/SOLR-5463 I did some simple perf testing of the strawman approach and posted the results last week... http://searchhub.org/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/ ...current iterations on the patch are to eliminate the strawman code to improve performance even more and beef up the test cases. : If so, is there another approach to make all the data in a collection : easily available for retrieval? The only thing I can think of is to ... : Then I was thinking we could have a field with an incrementing numeric : value which could be used to perform range queries as a substitute for : paging through everything. Ie queries like 'IncrementalField:[1 TO : 100]' 'IncrementalField:[101 TO 200]' but this would be difficult to : maintain as we update the index unless we reindex the entire collection : every time we update any docs at all. As i mentioned in the blog above, as long as you have a uniqueKey field that supports range queries, bulk exporting of all documents is fairly trivial by sorting on your uniqueKey field and using an fq that also filters on your uniqueKey field modify the fq each time to change the lower bound to match the highest ID you got on the previous page. This approach works really well in simple cases where you wnat to fetch all documents matching a query and then process/sort them by some other criteria on the client -- but it's not viable if it's important to you that the documents come back from solr in score order before your client gets them because you want to stop fetching once some criteria is met in your client. Example: you have billions of documents matching a query, you want to fetch all sorted by score desc and crunch them on your client to compute some stats, and once your client side stat crunching tells you you have enough results (which might be after the 1000th result, or might be after the millionth result) then you want to stop. SOLR-5463 will help even in that later case. The bulk of the patch should easy to use in the next day or so (having other people try out and test in their applications would be *very* helpful) and hopefully show up in Solr 4.7 -Hoss http://www.lucidworks.com/
Re: solr as nosql - pulling all docs vs deep paging limitations
Hoss is working on it. Search for deep paging or cursor in JIRA. Otis Solr ElasticSearch Support http://sematext.com/ On Dec 17, 2013 12:30 PM, Petersen, Robert robert.peter...@mail.rakuten.com wrote: Hi solr users, We have a new use case where need to make a pile of data available as XML to a client and I was thinking we could easily put all this data into a solr collection and the client could just do a star search and page through all the results to obtain the data we need to give them. Then I remembered we currently don't allow deep paging in our current search indexes as performance declines the deeper you go. Is this still the case? If so, is there another approach to make all the data in a collection easily available for retrieval? The only thing I can think of is to query our DB for all the unique IDs of all the documents in the collection and then pull out the documents out in small groups with successive queries like 'UniqueIdField:(id1 OR id2 OR ... OR idn)' 'UniqueIdField:(idn+1 OR idn+2 OR ... etc)' which doesn't seem like a very good approach because the DB might have been updated with new data which hasn't been indexed yet and so all the ids might not be in there (which may or may not matter I suppose). Then I was thinking we could have a field with an incrementing numeric value which could be used to perform range queries as a substitute for paging through everything. Ie queries like 'IncrementalField:[1 TO 100]' 'IncrementalField:[101 TO 200]' but this would be difficult to maintain as we update the index unless we reindex the entire collection every time we update any docs at all. Is this perhaps not a good use case for solr? Should I use something else or is there another approach that would work here to allow a client to pull groups of docs in a collection through the rest api until the client has gotten them all? Thanks Robi
RE: solr as nosql - pulling all docs vs deep paging limitations
My use case is basically to do a dump of all contents of the index with no ordering needed. It's actually to be a product data export for third parties. Unique key is product sku. I could take the min sku and range query up to the max sku but the skus are not contiguous because some get turned off and only some are valid for export so each range would return a different number of products (which may or may not be acceptable and I might be able to kind of hide that with some code). -Original Message- From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com] Sent: Tuesday, December 17, 2013 10:41 AM To: solr-user Subject: Re: solr as nosql - pulling all docs vs deep paging limitations Hoss, What about SELECT * FROM WHERE ... like misusing Solr? I'm sure you've been asked many times for that. What if client don't need to rank results somehow, but just requesting unordered filtering result like they are used to in RDBMS? Do you feel it will never considered as a resonable usecase for Solr? or there is a well known approach for dealing with? On Tue, Dec 17, 2013 at 10:16 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : Then I remembered we currently don't allow deep paging in our current : search indexes as performance declines the deeper you go. Is this still : the case? Coincidently, i'm working on a new cursor based API to make this much more feasible as we speak.. https://issues.apache.org/jira/browse/SOLR-5463 I did some simple perf testing of the strawman approach and posted the results last week... http://searchhub.org/coming-soon-to-solr-efficient-cursor-based-iterat ion-of-large-result-sets/ ...current iterations on the patch are to eliminate the strawman code to improve performance even more and beef up the test cases. : If so, is there another approach to make all the data in a collection : easily available for retrieval? The only thing I can think of is to ... : Then I was thinking we could have a field with an incrementing numeric : value which could be used to perform range queries as a substitute for : paging through everything. Ie queries like 'IncrementalField:[1 TO : 100]' 'IncrementalField:[101 TO 200]' but this would be difficult to : maintain as we update the index unless we reindex the entire collection : every time we update any docs at all. As i mentioned in the blog above, as long as you have a uniqueKey field that supports range queries, bulk exporting of all documents is fairly trivial by sorting on your uniqueKey field and using an fq that also filters on your uniqueKey field modify the fq each time to change the lower bound to match the highest ID you got on the previous page. This approach works really well in simple cases where you wnat to fetch all documents matching a query and then process/sort them by some other criteria on the client -- but it's not viable if it's important to you that the documents come back from solr in score order before your client gets them because you want to stop fetching once some criteria is met in your client. Example: you have billions of documents matching a query, you want to fetch all sorted by score desc and crunch them on your client to compute some stats, and once your client side stat crunching tells you you have enough results (which might be after the 1000th result, or might be after the millionth result) then you want to stop. SOLR-5463 will help even in that later case. The bulk of the patch should easy to use in the next day or so (having other people try out and test in their applications would be *very* helpful) and hopefully show up in Solr 4.7 -Hoss http://www.lucidworks.com/ -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: solr as nosql - pulling all docs vs deep paging limitations
Hoss, What about SELECT * FROM WHERE ... like misusing Solr? I'm sure you've been asked many times for that. What if client don't need to rank results somehow, but just requesting unordered filtering result like they are used to in RDBMS? Do you feel it will never considered as a resonable usecase for Solr? or there is a well known approach for dealing with? On Tue, Dec 17, 2013 at 10:16 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : Then I remembered we currently don't allow deep paging in our current : search indexes as performance declines the deeper you go. Is this still : the case? Coincidently, i'm working on a new cursor based API to make this much more feasible as we speak.. https://issues.apache.org/jira/browse/SOLR-5463 I did some simple perf testing of the strawman approach and posted the results last week... http://searchhub.org/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/ ...current iterations on the patch are to eliminate the strawman code to improve performance even more and beef up the test cases. : If so, is there another approach to make all the data in a collection : easily available for retrieval? The only thing I can think of is to ... : Then I was thinking we could have a field with an incrementing numeric : value which could be used to perform range queries as a substitute for : paging through everything. Ie queries like 'IncrementalField:[1 TO : 100]' 'IncrementalField:[101 TO 200]' but this would be difficult to : maintain as we update the index unless we reindex the entire collection : every time we update any docs at all. As i mentioned in the blog above, as long as you have a uniqueKey field that supports range queries, bulk exporting of all documents is fairly trivial by sorting on your uniqueKey field and using an fq that also filters on your uniqueKey field modify the fq each time to change the lower bound to match the highest ID you got on the previous page. This approach works really well in simple cases where you wnat to fetch all documents matching a query and then process/sort them by some other criteria on the client -- but it's not viable if it's important to you that the documents come back from solr in score order before your client gets them because you want to stop fetching once some criteria is met in your client. Example: you have billions of documents matching a query, you want to fetch all sorted by score desc and crunch them on your client to compute some stats, and once your client side stat crunching tells you you have enough results (which might be after the 1000th result, or might be after the millionth result) then you want to stop. SOLR-5463 will help even in that later case. The bulk of the patch should easy to use in the next day or so (having other people try out and test in their applications would be *very* helpful) and hopefully show up in Solr 4.7 -Hoss http://www.lucidworks.com/ -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: solr as nosql - pulling all docs vs deep paging limitations
SOLR-5244 is also working in this direction. This focuses on efficient binary extract of entire search results. On Tue, Dec 17, 2013 at 2:33 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hoss is working on it. Search for deep paging or cursor in JIRA. Otis Solr ElasticSearch Support http://sematext.com/ On Dec 17, 2013 12:30 PM, Petersen, Robert robert.peter...@mail.rakuten.com wrote: Hi solr users, We have a new use case where need to make a pile of data available as XML to a client and I was thinking we could easily put all this data into a solr collection and the client could just do a star search and page through all the results to obtain the data we need to give them. Then I remembered we currently don't allow deep paging in our current search indexes as performance declines the deeper you go. Is this still the case? If so, is there another approach to make all the data in a collection easily available for retrieval? The only thing I can think of is to query our DB for all the unique IDs of all the documents in the collection and then pull out the documents out in small groups with successive queries like 'UniqueIdField:(id1 OR id2 OR ... OR idn)' 'UniqueIdField:(idn+1 OR idn+2 OR ... etc)' which doesn't seem like a very good approach because the DB might have been updated with new data which hasn't been indexed yet and so all the ids might not be in there (which may or may not matter I suppose). Then I was thinking we could have a field with an incrementing numeric value which could be used to perform range queries as a substitute for paging through everything. Ie queries like 'IncrementalField:[1 TO 100]' 'IncrementalField:[101 TO 200]' but this would be difficult to maintain as we update the index unless we reindex the entire collection every time we update any docs at all. Is this perhaps not a good use case for solr? Should I use something else or is there another approach that would work here to allow a client to pull groups of docs in a collection through the rest api until the client has gotten them all? Thanks Robi -- Joel Bernstein Search Engineer at Heliosearch
Re: solr as nosql - pulling all docs vs deep paging limitations
Joel - can you please elaborate a bit on how this compares with Hoss' approach? Complementary? Thanks, Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Tue, Dec 17, 2013 at 6:45 PM, Joel Bernstein joels...@gmail.com wrote: SOLR-5244 is also working in this direction. This focuses on efficient binary extract of entire search results. On Tue, Dec 17, 2013 at 2:33 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hoss is working on it. Search for deep paging or cursor in JIRA. Otis Solr ElasticSearch Support http://sematext.com/ On Dec 17, 2013 12:30 PM, Petersen, Robert robert.peter...@mail.rakuten.com wrote: Hi solr users, We have a new use case where need to make a pile of data available as XML to a client and I was thinking we could easily put all this data into a solr collection and the client could just do a star search and page through all the results to obtain the data we need to give them. Then I remembered we currently don't allow deep paging in our current search indexes as performance declines the deeper you go. Is this still the case? If so, is there another approach to make all the data in a collection easily available for retrieval? The only thing I can think of is to query our DB for all the unique IDs of all the documents in the collection and then pull out the documents out in small groups with successive queries like 'UniqueIdField:(id1 OR id2 OR ... OR idn)' 'UniqueIdField:(idn+1 OR idn+2 OR ... etc)' which doesn't seem like a very good approach because the DB might have been updated with new data which hasn't been indexed yet and so all the ids might not be in there (which may or may not matter I suppose). Then I was thinking we could have a field with an incrementing numeric value which could be used to perform range queries as a substitute for paging through everything. Ie queries like 'IncrementalField:[1 TO 100]' 'IncrementalField:[101 TO 200]' but this would be difficult to maintain as we update the index unless we reindex the entire collection every time we update any docs at all. Is this perhaps not a good use case for solr? Should I use something else or is there another approach that would work here to allow a client to pull groups of docs in a collection through the rest api until the client has gotten them all? Thanks Robi -- Joel Bernstein Search Engineer at Heliosearch
Re: solr as nosql - pulling all docs vs deep paging limitations
They are for different use cases. Hoss's approach, I believe, focuses on deep paging of ranked search results. SOLR-5244 focuses on the batch export of an entire unranked search result in binary format. It's basically a very efficient bulk extract for Solr. On Tue, Dec 17, 2013 at 6:51 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Joel - can you please elaborate a bit on how this compares with Hoss' approach? Complementary? Thanks, Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Tue, Dec 17, 2013 at 6:45 PM, Joel Bernstein joels...@gmail.com wrote: SOLR-5244 is also working in this direction. This focuses on efficient binary extract of entire search results. On Tue, Dec 17, 2013 at 2:33 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hoss is working on it. Search for deep paging or cursor in JIRA. Otis Solr ElasticSearch Support http://sematext.com/ On Dec 17, 2013 12:30 PM, Petersen, Robert robert.peter...@mail.rakuten.com wrote: Hi solr users, We have a new use case where need to make a pile of data available as XML to a client and I was thinking we could easily put all this data into a solr collection and the client could just do a star search and page through all the results to obtain the data we need to give them. Then I remembered we currently don't allow deep paging in our current search indexes as performance declines the deeper you go. Is this still the case? If so, is there another approach to make all the data in a collection easily available for retrieval? The only thing I can think of is to query our DB for all the unique IDs of all the documents in the collection and then pull out the documents out in small groups with successive queries like 'UniqueIdField:(id1 OR id2 OR ... OR idn)' 'UniqueIdField:(idn+1 OR idn+2 OR ... etc)' which doesn't seem like a very good approach because the DB might have been updated with new data which hasn't been indexed yet and so all the ids might not be in there (which may or may not matter I suppose). Then I was thinking we could have a field with an incrementing numeric value which could be used to perform range queries as a substitute for paging through everything. Ie queries like 'IncrementalField:[1 TO 100]' 'IncrementalField:[101 TO 200]' but this would be difficult to maintain as we update the index unless we reindex the entire collection every time we update any docs at all. Is this perhaps not a good use case for solr? Should I use something else or is there another approach that would work here to allow a client to pull groups of docs in a collection through the rest api until the client has gotten them all? Thanks Robi -- Joel Bernstein Search Engineer at Heliosearch -- Joel Bernstein Search Engineer at Heliosearch