You can do range queries without an upper bound and just limit the number of results. Then you look at the last result to obtain the new lower bound.

-- Jens


On 17/12/13 20:23, Petersen, Robert wrote:
My use case is basically to do a dump of all contents of the index with no 
ordering needed.  It's actually to be a product data export for third parties.  
Unique key is product sku.  I could take the min sku and range query up to the 
max sku but the skus are not contiguous because some get turned off and only 
some are valid for export so each range would return a different number of 
products (which may or may not be acceptable and I might be able to kind of 
hide that with some code).

-----Original Message-----
From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com]
Sent: Tuesday, December 17, 2013 10:41 AM
To: solr-user
Subject: Re: solr as nosql - pulling all docs vs deep paging limitations

Hoss,

What about SELECT * FROM WHERE ... like misusing Solr? I'm sure you've been 
asked many times for that.
What if client don't need to rank results somehow, but just requesting 
unordered filtering result like they are used to in RDBMS?
Do you feel it will never considered as a resonable usecase for Solr? or there 
is a well known approach for dealing with?


On Tue, Dec 17, 2013 at 10:16 PM, Chris Hostetter
<hossman_luc...@fucit.org>wrote:


: Then I remembered we currently don't allow deep paging in our
current
: search indexes as performance declines the deeper you go.  Is this
still
: the case?

Coincidently, i'm working on a new cursor based API to make this much
more feasible as we speak..

https://issues.apache.org/jira/browse/SOLR-5463

I did some simple perf testing of the strawman approach and posted the
results last week...


http://searchhub.org/coming-soon-to-solr-efficient-cursor-based-iterat
ion-of-large-result-sets/

...current iterations on the patch are to eliminate the strawman code
to improve performance even more and beef up the test cases.

: If so, is there another approach to make all the data in a
collection
: easily available for retrieval?  The only thing I can think of is to
         ...
: Then I was thinking we could have a field with an incrementing
numeric
: value which could be used to perform range queries as a substitute
for
: paging through everything.  Ie queries like 'IncrementalField:[1 TO
: 100]' 'IncrementalField:[101 TO 200]' but this would be difficult to
: maintain as we update the index unless we reindex the entire
collection
: every time we update any docs at all.

As i mentioned in the blog above, as long as you have a uniqueKey
field that supports range queries, bulk exporting of all documents is
fairly trivial by sorting on your uniqueKey field and using an fq that
also filters on your uniqueKey field modify the fq each time to change
the lower bound to match the highest ID you got on the previous "page".

This approach works really well in simple cases where you wnat to
"fetch all" documents matching a query and then process/sort them by
some other criteria on the client -- but it's not viable if it's
important to you that the documents come back from solr in score order
before your client gets them because you want to "stop fetching" once
some criteria is met in your client.  Example: you have billions of
documents matching a query, you want to fetch all sorted by score desc
and crunch them on your client to compute some stats, and once your
client side stat crunching tells you you have enough results (which
might be after the 1000th result, or might be after the millionth result) then 
you want to stop.

SOLR-5463 will help even in that later case.  The bulk of the patch
should easy to use in the next day or so (having other people try out
and test in their applications would be *very* helpful) and hopefully
show up in Solr 4.7

-Hoss
http://www.lucidworks.com/




--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
  <mkhlud...@griddynamics.com>




Reply via email to