RE: solr as nosql - pulling all docs vs deep paging limitations

Petersen, Robert Tue, 17 Dec 2013 13:30:10 -0800

My use case is basically to do a dump of all contents of the index with no 
ordering needed.  It's actually to be a product data export for third parties.  
Unique key is product sku.  I could take the min sku and range query up to the 
max sku but the skus are not contiguous because some get turned off and only 
some are valid for export so each range would return a different number of 
products (which may or may not be acceptable and I might be able to kind of 
hide that with some code).


-----Original Message-----
From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com] 
Sent: Tuesday, December 17, 2013 10:41 AM
To: solr-user
Subject: Re: solr as nosql - pulling all docs vs deep paging limitations

Hoss,

What about SELECT * FROM WHERE ... like misusing Solr? I'm sure you've been 
asked many times for that.
What if client don't need to rank results somehow, but just requesting 
unordered filtering result like they are used to in RDBMS?
Do you feel it will never considered as a resonable usecase for Solr? or there 
is a well known approach for dealing with?


On Tue, Dec 17, 2013 at 10:16 PM, Chris Hostetter
<hossman_luc...@fucit.org>wrote:

>
> : Then I remembered we currently don't allow deep paging in our 
> current
> : search indexes as performance declines the deeper you go.  Is this 
> still
> : the case?
>
> Coincidently, i'm working on a new cursor based API to make this much 
> more feasible as we speak..
>
> https://issues.apache.org/jira/browse/SOLR-5463
>
> I did some simple perf testing of the strawman approach and posted the 
> results last week...
>
>
> http://searchhub.org/coming-soon-to-solr-efficient-cursor-based-iterat
> ion-of-large-result-sets/
>
> ...current iterations on the patch are to eliminate the strawman code 
> to improve performance even more and beef up the test cases.
>
> : If so, is there another approach to make all the data in a 
> collection
> : easily available for retrieval?  The only thing I can think of is to
>         ...
> : Then I was thinking we could have a field with an incrementing 
> numeric
> : value which could be used to perform range queries as a substitute 
> for
> : paging through everything.  Ie queries like 'IncrementalField:[1 TO
> : 100]' 'IncrementalField:[101 TO 200]' but this would be difficult to
> : maintain as we update the index unless we reindex the entire 
> collection
> : every time we update any docs at all.
>
> As i mentioned in the blog above, as long as you have a uniqueKey 
> field that supports range queries, bulk exporting of all documents is 
> fairly trivial by sorting on your uniqueKey field and using an fq that 
> also filters on your uniqueKey field modify the fq each time to change 
> the lower bound to match the highest ID you got on the previous "page".
>
> This approach works really well in simple cases where you wnat to 
> "fetch all" documents matching a query and then process/sort them by 
> some other criteria on the client -- but it's not viable if it's 
> important to you that the documents come back from solr in score order 
> before your client gets them because you want to "stop fetching" once 
> some criteria is met in your client.  Example: you have billions of 
> documents matching a query, you want to fetch all sorted by score desc 
> and crunch them on your client to compute some stats, and once your 
> client side stat crunching tells you you have enough results (which 
> might be after the 1000th result, or might be after the millionth result) 
> then you want to stop.
>
> SOLR-5463 will help even in that later case.  The bulk of the patch 
> should easy to use in the next day or so (having other people try out 
> and test in their applications would be *very* helpful) and hopefully 
> show up in Solr 4.7
>
> -Hoss
> http://www.lucidworks.com/
>



--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mkhlud...@griddynamics.com>

RE: solr as nosql - pulling all docs vs deep paging limitations

Reply via email to