Re: solr as nosql - pulling all docs vs deep paging limitations

2013-12-18 Thread Jens Grivolla
You can do range queries without an upper bound and just limit the 
number of results. Then you look at the last result to obtain the new 
lower bound.


-- Jens


On 17/12/13 20:23, Petersen, Robert wrote:

My use case is basically to do a dump of all contents of the index with no 
ordering needed.  It's actually to be a product data export for third parties.  
Unique key is product sku.  I could take the min sku and range query up to the 
max sku but the skus are not contiguous because some get turned off and only 
some are valid for export so each range would return a different number of 
products (which may or may not be acceptable and I might be able to kind of 
hide that with some code).

-Original Message-
From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com]
Sent: Tuesday, December 17, 2013 10:41 AM
To: solr-user
Subject: Re: solr as nosql - pulling all docs vs deep paging limitations

Hoss,

What about SELECT * FROM WHERE ... like misusing Solr? I'm sure you've been 
asked many times for that.
What if client don't need to rank results somehow, but just requesting 
unordered filtering result like they are used to in RDBMS?
Do you feel it will never considered as a resonable usecase for Solr? or there 
is a well known approach for dealing with?


On Tue, Dec 17, 2013 at 10:16 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:



: Then I remembered we currently don't allow deep paging in our
current
: search indexes as performance declines the deeper you go.  Is this
still
: the case?

Coincidently, i'm working on a new cursor based API to make this much
more feasible as we speak..

https://issues.apache.org/jira/browse/SOLR-5463

I did some simple perf testing of the strawman approach and posted the
results last week...


http://searchhub.org/coming-soon-to-solr-efficient-cursor-based-iterat
ion-of-large-result-sets/

...current iterations on the patch are to eliminate the strawman code
to improve performance even more and beef up the test cases.

: If so, is there another approach to make all the data in a
collection
: easily available for retrieval?  The only thing I can think of is to
 ...
: Then I was thinking we could have a field with an incrementing
numeric
: value which could be used to perform range queries as a substitute
for
: paging through everything.  Ie queries like 'IncrementalField:[1 TO
: 100]' 'IncrementalField:[101 TO 200]' but this would be difficult to
: maintain as we update the index unless we reindex the entire
collection
: every time we update any docs at all.

As i mentioned in the blog above, as long as you have a uniqueKey
field that supports range queries, bulk exporting of all documents is
fairly trivial by sorting on your uniqueKey field and using an fq that
also filters on your uniqueKey field modify the fq each time to change
the lower bound to match the highest ID you got on the previous page.

This approach works really well in simple cases where you wnat to
fetch all documents matching a query and then process/sort them by
some other criteria on the client -- but it's not viable if it's
important to you that the documents come back from solr in score order
before your client gets them because you want to stop fetching once
some criteria is met in your client.  Example: you have billions of
documents matching a query, you want to fetch all sorted by score desc
and crunch them on your client to compute some stats, and once your
client side stat crunching tells you you have enough results (which
might be after the 1000th result, or might be after the millionth result) then 
you want to stop.

SOLR-5463 will help even in that later case.  The bulk of the patch
should easy to use in the next day or so (having other people try out
and test in their applications would be *very* helpful) and hopefully
show up in Solr 4.7

-Hoss
http://www.lucidworks.com/





--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
  mkhlud...@griddynamics.com







Re: solr as nosql - pulling all docs vs deep paging limitations

2013-12-18 Thread Mikhail Khludnev
Aha! SOLR-5244 is a particular case which I'm asking about. I wonder who
else consider it useful?
(I.m sorry if I hijacked the thread)
18.12.2013 5:41 пользователь Joel Bernstein joels...@gmail.com написал:

 They are for different use cases. Hoss's approach, I believe, focuses on
 deep paging of ranked search results. SOLR-5244 focuses on the batch export
 of an entire unranked search result in binary format. It's basically a very
 efficient bulk extract for Solr.


 On Tue, Dec 17, 2013 at 6:51 PM, Otis Gospodnetic 
 otis.gospodne...@gmail.com wrote:

  Joel - can you please elaborate a bit on how this compares with Hoss'
  approach?  Complementary?
 
  Thanks,
  Otis
  --
  Performance Monitoring * Log Analytics * Search Analytics
  Solr  Elasticsearch Support * http://sematext.com/
 
 
  On Tue, Dec 17, 2013 at 6:45 PM, Joel Bernstein joels...@gmail.com
  wrote:
 
   SOLR-5244 is also working in this direction. This focuses on efficient
   binary extract of entire search results.
  
  
   On Tue, Dec 17, 2013 at 2:33 PM, Otis Gospodnetic 
   otis.gospodne...@gmail.com wrote:
  
Hoss is working on it. Search for deep paging or cursor in JIRA.
   
Otis
Solr  ElasticSearch Support
http://sematext.com/
On Dec 17, 2013 12:30 PM, Petersen, Robert 
robert.peter...@mail.rakuten.com wrote:
   
 Hi solr users,

 We have a new use case where need to make a pile of data available
 as
   XML
 to a client and I was thinking we could easily put all this data
  into a
 solr collection and the client could just do a star search and page
through
 all the results to obtain the data we need to give them.  Then I
remembered
 we currently don't allow deep paging in our current search indexes
 as
 performance declines the deeper you go.  Is this still the case?

 If so, is there another approach to make all the data in a
 collection
 easily available for retrieval?  The only thing I can think of is
 to
query
 our DB for all the unique IDs of all the documents in the
 collection
   and
 then pull out the documents out in small groups with successive
  queries
 like 'UniqueIdField:(id1 OR id2 OR ... OR idn)'
 'UniqueIdField:(idn+1
   OR
 idn+2 OR ... etc)' which doesn't seem like a very good approach
  because
the
 DB might have been updated with new data which hasn't been indexed
  yet
and
 so all the ids might not be in there (which may or may not matter I
 suppose).

 Then I was thinking we could have a field with an incrementing
  numeric
 value which could be used to perform range queries as a substitute
  for
 paging through everything.  Ie queries like 'IncrementalField:[1 TO
   100]'
 'IncrementalField:[101 TO 200]' but this would be difficult to
  maintain
as
 we update the index unless we reindex the entire collection every
  time
   we
 update any docs at all.

 Is this perhaps not a good use case for solr?  Should I use
 something
else
 or is there another approach that would work here to allow a client
  to
pull
 groups of docs in a collection through the rest api until the
 client
   has
 gotten them all?

 Thanks
 Robi


   
  
  
  
   --
   Joel Bernstein
   Search Engineer at Heliosearch
  
 



 --
 Joel Bernstein
 Search Engineer at Heliosearch



Re: solr as nosql - pulling all docs vs deep paging limitations

2013-12-18 Thread Chris Hostetter
: 
: What about SELECT * FROM WHERE ... like misusing Solr? I'm sure you've been
: asked many times for that.
: What if client don't need to rank results somehow, but just requesting
: unordered filtering result like they are used to in RDBMS?
: Do you feel it will never considered as a resonable usecase for Solr? or
: there is a well known approach for dealing with?

If you don't care about ordering, then the approach i described (either 
using SOLR-5463, or just using a sort by uniqueKey with increasing 
range filters on the id) should work fine -- the fact that they come back 
sorted by id is just an implementation detail that makes it possible to 
batch the records (the same way most SQL databases will likely give you 
back the docs based on whatever primary key index you have)

I think the key difference between approaches like SOLR-5244 vs the cursor 
work in SOLR-5463 is that SOLR-5244 is really targeted at dumping all 
data about all docs from a core (matching the query) in a single 
request/response -- for something like SolrCloud, the client would 
manually need to hit each shard (but as i understand it fro mthe 
dscription, that's kind of the point, it's aiming to be a very low level 
bulk export).  With the cursor approach in SOLR-5463, we do 
agregation across all shards, and we support arbitrary sorts, and you can 
control the batch size from the client and iterate over multiple 
request/responses of that size.  if there is any network hucups, you can 
re-do a request.  If you process half the docs that match (in a 
particular order) and then decide I've got all the docs i need for my 
purposes, ou can stop requesting the continuation of that cursor.



-Hoss
http://www.lucidworks.com/


Re: solr as nosql - pulling all docs vs deep paging limitations

2013-12-18 Thread Chris Hostetter

: You can do range queries without an upper bound and just limit the number of
: results. Then you look at the last result to obtain the new lower bound.

exactly.  instead of this:

   First: q=foostart=0rows=$ROWS
   After: q=foostart=$Xrows=$ROWS

...where $ROWS is how big a batch of docsy you can handle at one time, 
and you increase the value of $X by the value of $ROWS on each successive 
request, you can just do this...

   First: q=foostart=0rows=$ROWSsort=id+asc
   After: q=foostart=0rows=$ROWSsort=id+ascfq=id:{$X TO *]

...where $X is whatever the last id you got on the previous page.

Or: you try out the patch in SOLR-5463 and do something like this...

   First: q=foostart=0rows=$ROWSsort=id+asccursorMark=*
   After: q=foostart=0rows=$ROWSsort=id+asccursorMark=$X

...where $X is whatever nextCursorMark you got from the previous page.



-Hoss
http://www.lucidworks.com/


Re: solr as nosql - pulling all docs vs deep paging limitations

2013-12-18 Thread Michael Della Bitta
Us too. That's going to be huge for us!

Michael Della Bitta

Applications Developer

o: +1 646 532 3062  | c: +1 917 477 7906

appinions inc.

The Science of Influence Marketing

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinionshttps://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
w: appinions.com http://www.appinions.com/


On Wed, Dec 18, 2013 at 9:55 AM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 Aha! SOLR-5244 is a particular case which I'm asking about. I wonder who
 else consider it useful?
 (I.m sorry if I hijacked the thread)
 18.12.2013 5:41 пользователь Joel Bernstein joels...@gmail.com
 написал:

  They are for different use cases. Hoss's approach, I believe, focuses on
  deep paging of ranked search results. SOLR-5244 focuses on the batch
 export
  of an entire unranked search result in binary format. It's basically a
 very
  efficient bulk extract for Solr.
 
 
  On Tue, Dec 17, 2013 at 6:51 PM, Otis Gospodnetic 
  otis.gospodne...@gmail.com wrote:
 
   Joel - can you please elaborate a bit on how this compares with Hoss'
   approach?  Complementary?
  
   Thanks,
   Otis
   --
   Performance Monitoring * Log Analytics * Search Analytics
   Solr  Elasticsearch Support * http://sematext.com/
  
  
   On Tue, Dec 17, 2013 at 6:45 PM, Joel Bernstein joels...@gmail.com
   wrote:
  
SOLR-5244 is also working in this direction. This focuses on
 efficient
binary extract of entire search results.
   
   
On Tue, Dec 17, 2013 at 2:33 PM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:
   
 Hoss is working on it. Search for deep paging or cursor in JIRA.

 Otis
 Solr  ElasticSearch Support
 http://sematext.com/
 On Dec 17, 2013 12:30 PM, Petersen, Robert 
 robert.peter...@mail.rakuten.com wrote:

  Hi solr users,
 
  We have a new use case where need to make a pile of data
 available
  as
XML
  to a client and I was thinking we could easily put all this data
   into a
  solr collection and the client could just do a star search and
 page
 through
  all the results to obtain the data we need to give them.  Then I
 remembered
  we currently don't allow deep paging in our current search
 indexes
  as
  performance declines the deeper you go.  Is this still the case?
 
  If so, is there another approach to make all the data in a
  collection
  easily available for retrieval?  The only thing I can think of is
  to
 query
  our DB for all the unique IDs of all the documents in the
  collection
and
  then pull out the documents out in small groups with successive
   queries
  like 'UniqueIdField:(id1 OR id2 OR ... OR idn)'
  'UniqueIdField:(idn+1
OR
  idn+2 OR ... etc)' which doesn't seem like a very good approach
   because
 the
  DB might have been updated with new data which hasn't been
 indexed
   yet
 and
  so all the ids might not be in there (which may or may not
 matter I
  suppose).
 
  Then I was thinking we could have a field with an incrementing
   numeric
  value which could be used to perform range queries as a
 substitute
   for
  paging through everything.  Ie queries like 'IncrementalField:[1
 TO
100]'
  'IncrementalField:[101 TO 200]' but this would be difficult to
   maintain
 as
  we update the index unless we reindex the entire collection every
   time
we
  update any docs at all.
 
  Is this perhaps not a good use case for solr?  Should I use
  something
 else
  or is there another approach that would work here to allow a
 client
   to
 pull
  groups of docs in a collection through the rest api until the
  client
has
  gotten them all?
 
  Thanks
  Robi
 
 

   
   
   
--
Joel Bernstein
Search Engineer at Heliosearch
   
  
 
 
 
  --
  Joel Bernstein
  Search Engineer at Heliosearch
 



Re: solr as nosql - pulling all docs vs deep paging limitations

2013-12-18 Thread Jonathan Rochkind

On 12/17/13 1:16 PM, Chris Hostetter wrote:

As i mentioned in the blog above, as long as you have a uniqueKey field
that supports range queries, bulk exporting of all documents is fairly
trivial by sorting on your uniqueKey field and using an fq that also
filters on your uniqueKey field modify the fq each time to change the
lower bound to match the highest ID you got on the previous page.


Aha, very nice suggestion, I hadn't thought of this, when myself trying 
to figure out decent ways to 'fetch all documents matching a query' for 
some bulk offline processing.


One question that I was never sure about when trying to do things like 
this -- is this going to end up blowing the query and/or document caches 
if used on a live Solr?  By filling up those caches with the results of 
the 'bulk' export?  If so, is there any way to avoid that? Or does it 
probably not really matter?


Jonathan


Re: solr as nosql - pulling all docs vs deep paging limitations

2013-12-18 Thread Chris Hostetter

: One question that I was never sure about when trying to do things like this --
: is this going to end up blowing the query and/or document caches if used on a
: live Solr?  By filling up those caches with the results of the 'bulk' export?
: If so, is there any way to avoid that? Or does it probably not really matter?

  q={!cache=false}...


-Hoss
http://www.lucidworks.com/


Re: solr as nosql - pulling all docs vs deep paging limitations

2013-12-18 Thread Mikhail Khludnev
On Wed, Dec 18, 2013 at 8:03 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:

 :
 : What about SELECT * FROM WHERE ... like misusing Solr? I'm sure you've
 been
 : asked many times for that.
 : What if client don't need to rank results somehow, but just requesting
 : unordered filtering result like they are used to in RDBMS?
 : Do you feel it will never considered as a resonable usecase for Solr? or
 : there is a well known approach for dealing with?

 If you don't care about ordering, then the approach i described (either
 using SOLR-5463, or just using a sort by uniqueKey with increasing
 range filters on the id) should work fine -- the fact that they come back
 sorted by id is just an implementation detail that makes it possible to
 batch the records

From the functional standpoint it's true, but performance might matter, in
that side cases. eg. I wonder why the priority queue is needed even if we
request sort=_docid_.

 (the same way most SQL databases will likely give you
 back the docs based on whatever primary key index you have)

 I think the key difference between approaches like SOLR-5244 vs the cursor
 work in SOLR-5463 is that SOLR-5244 is really targeted at dumping all
 data about all docs from a core (matching the query) in a single
 request/response -- for something like SolrCloud, the client would
 manually need to hit each shard (but as i understand it fro mthe
 dscription, that's kind of the point, it's aiming to be a very low level
 bulk export).  With the cursor approach in SOLR-5463, we do
 agregation across all shards, and we support arbitrary sorts, and you can
 control the batch size from the client and iterate over multiple
 request/responses of that size.  if there is any network hucups, you can
 re-do a request.  If you process half the docs that match (in a
 particular order) and then decide I've got all the docs i need for my
 purposes, ou can stop requesting the continuation of that cursor.



 -Hoss
 http://www.lucidworks.com/




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


solr as nosql - pulling all docs vs deep paging limitations

2013-12-17 Thread Petersen, Robert
Hi solr users,

We have a new use case where need to make a pile of data available as XML to a 
client and I was thinking we could easily put all this data into a solr 
collection and the client could just do a star search and page through all the 
results to obtain the data we need to give them.  Then I remembered we 
currently don't allow deep paging in our current search indexes as performance 
declines the deeper you go.  Is this still the case?

If so, is there another approach to make all the data in a collection easily 
available for retrieval?  The only thing I can think of is to query our DB for 
all the unique IDs of all the documents in the collection and then pull out the 
documents out in small groups with successive queries like 'UniqueIdField:(id1 
OR id2 OR ... OR idn)' 'UniqueIdField:(idn+1 OR idn+2 OR ... etc)' which 
doesn't seem like a very good approach because the DB might have been updated 
with new data which hasn't been indexed yet and so all the ids might not be in 
there (which may or may not matter I suppose).

Then I was thinking we could have a field with an incrementing numeric value 
which could be used to perform range queries as a substitute for paging through 
everything.  Ie queries like 'IncrementalField:[1 TO 100]' 
'IncrementalField:[101 TO 200]' but this would be difficult to maintain as we 
update the index unless we reindex the entire collection every time we update 
any docs at all.

Is this perhaps not a good use case for solr?  Should I use something else or 
is there another approach that would work here to allow a client to pull groups 
of docs in a collection through the rest api until the client has gotten them 
all?

Thanks
Robi



Re: solr as nosql - pulling all docs vs deep paging limitations

2013-12-17 Thread Chris Hostetter

: Then I remembered we currently don't allow deep paging in our current 
: search indexes as performance declines the deeper you go.  Is this still 
: the case?

Coincidently, i'm working on a new cursor based API to make this much more 
feasible as we speak..

https://issues.apache.org/jira/browse/SOLR-5463

I did some simple perf testing of the strawman approach and posted the 
results last week...

http://searchhub.org/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/

...current iterations on the patch are to eliminate the 
strawman code to improve performance even more and beef up the test 
cases.

: If so, is there another approach to make all the data in a collection 
: easily available for retrieval?  The only thing I can think of is to 
...
: Then I was thinking we could have a field with an incrementing numeric 
: value which could be used to perform range queries as a substitute for 
: paging through everything.  Ie queries like 'IncrementalField:[1 TO 
: 100]' 'IncrementalField:[101 TO 200]' but this would be difficult to 
: maintain as we update the index unless we reindex the entire collection 
: every time we update any docs at all.

As i mentioned in the blog above, as long as you have a uniqueKey field 
that supports range queries, bulk exporting of all documents is fairly 
trivial by sorting on your uniqueKey field and using an fq that also 
filters on your uniqueKey field modify the fq each time to change the 
lower bound to match the highest ID you got on the previous page.  

This approach works really well in simple cases where you wnat to fetch 
all documents matching a query and then process/sort them by some other 
criteria on the client -- but it's not viable if it's important to you 
that the documents come back from solr in score order before your client 
gets them because you want to stop fetching once some criteria is met in 
your client.  Example: you have billions of documents matching a query, 
you want to fetch all sorted by score desc and crunch them on your client 
to compute some stats, and once your client side stat crunching tells you 
you have enough results (which might be after the 1000th result, or might 
be after the millionth result) then you want to stop.

SOLR-5463 will help even in that later case.  The bulk of the patch should 
easy to use in the next day or so (having other people try out and 
test in their applications would be *very* helpful) and hopefully show up 
in Solr 4.7

-Hoss
http://www.lucidworks.com/


Re: solr as nosql - pulling all docs vs deep paging limitations

2013-12-17 Thread Otis Gospodnetic
Hoss is working on it. Search for deep paging or cursor in JIRA.

Otis
Solr  ElasticSearch Support
http://sematext.com/
On Dec 17, 2013 12:30 PM, Petersen, Robert 
robert.peter...@mail.rakuten.com wrote:

 Hi solr users,

 We have a new use case where need to make a pile of data available as XML
 to a client and I was thinking we could easily put all this data into a
 solr collection and the client could just do a star search and page through
 all the results to obtain the data we need to give them.  Then I remembered
 we currently don't allow deep paging in our current search indexes as
 performance declines the deeper you go.  Is this still the case?

 If so, is there another approach to make all the data in a collection
 easily available for retrieval?  The only thing I can think of is to query
 our DB for all the unique IDs of all the documents in the collection and
 then pull out the documents out in small groups with successive queries
 like 'UniqueIdField:(id1 OR id2 OR ... OR idn)' 'UniqueIdField:(idn+1 OR
 idn+2 OR ... etc)' which doesn't seem like a very good approach because the
 DB might have been updated with new data which hasn't been indexed yet and
 so all the ids might not be in there (which may or may not matter I
 suppose).

 Then I was thinking we could have a field with an incrementing numeric
 value which could be used to perform range queries as a substitute for
 paging through everything.  Ie queries like 'IncrementalField:[1 TO 100]'
 'IncrementalField:[101 TO 200]' but this would be difficult to maintain as
 we update the index unless we reindex the entire collection every time we
 update any docs at all.

 Is this perhaps not a good use case for solr?  Should I use something else
 or is there another approach that would work here to allow a client to pull
 groups of docs in a collection through the rest api until the client has
 gotten them all?

 Thanks
 Robi




RE: solr as nosql - pulling all docs vs deep paging limitations

2013-12-17 Thread Petersen, Robert
My use case is basically to do a dump of all contents of the index with no 
ordering needed.  It's actually to be a product data export for third parties.  
Unique key is product sku.  I could take the min sku and range query up to the 
max sku but the skus are not contiguous because some get turned off and only 
some are valid for export so each range would return a different number of 
products (which may or may not be acceptable and I might be able to kind of 
hide that with some code).

-Original Message-
From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com] 
Sent: Tuesday, December 17, 2013 10:41 AM
To: solr-user
Subject: Re: solr as nosql - pulling all docs vs deep paging limitations

Hoss,

What about SELECT * FROM WHERE ... like misusing Solr? I'm sure you've been 
asked many times for that.
What if client don't need to rank results somehow, but just requesting 
unordered filtering result like they are used to in RDBMS?
Do you feel it will never considered as a resonable usecase for Solr? or there 
is a well known approach for dealing with?


On Tue, Dec 17, 2013 at 10:16 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : Then I remembered we currently don't allow deep paging in our 
 current
 : search indexes as performance declines the deeper you go.  Is this 
 still
 : the case?

 Coincidently, i'm working on a new cursor based API to make this much 
 more feasible as we speak..

 https://issues.apache.org/jira/browse/SOLR-5463

 I did some simple perf testing of the strawman approach and posted the 
 results last week...


 http://searchhub.org/coming-soon-to-solr-efficient-cursor-based-iterat
 ion-of-large-result-sets/

 ...current iterations on the patch are to eliminate the strawman code 
 to improve performance even more and beef up the test cases.

 : If so, is there another approach to make all the data in a 
 collection
 : easily available for retrieval?  The only thing I can think of is to
 ...
 : Then I was thinking we could have a field with an incrementing 
 numeric
 : value which could be used to perform range queries as a substitute 
 for
 : paging through everything.  Ie queries like 'IncrementalField:[1 TO
 : 100]' 'IncrementalField:[101 TO 200]' but this would be difficult to
 : maintain as we update the index unless we reindex the entire 
 collection
 : every time we update any docs at all.

 As i mentioned in the blog above, as long as you have a uniqueKey 
 field that supports range queries, bulk exporting of all documents is 
 fairly trivial by sorting on your uniqueKey field and using an fq that 
 also filters on your uniqueKey field modify the fq each time to change 
 the lower bound to match the highest ID you got on the previous page.

 This approach works really well in simple cases where you wnat to 
 fetch all documents matching a query and then process/sort them by 
 some other criteria on the client -- but it's not viable if it's 
 important to you that the documents come back from solr in score order 
 before your client gets them because you want to stop fetching once 
 some criteria is met in your client.  Example: you have billions of 
 documents matching a query, you want to fetch all sorted by score desc 
 and crunch them on your client to compute some stats, and once your 
 client side stat crunching tells you you have enough results (which 
 might be after the 1000th result, or might be after the millionth result) 
 then you want to stop.

 SOLR-5463 will help even in that later case.  The bulk of the patch 
 should easy to use in the next day or so (having other people try out 
 and test in their applications would be *very* helpful) and hopefully 
 show up in Solr 4.7

 -Hoss
 http://www.lucidworks.com/




--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com



Re: solr as nosql - pulling all docs vs deep paging limitations

2013-12-17 Thread Mikhail Khludnev
Hoss,

What about SELECT * FROM WHERE ... like misusing Solr? I'm sure you've been
asked many times for that.
What if client don't need to rank results somehow, but just requesting
unordered filtering result like they are used to in RDBMS?
Do you feel it will never considered as a resonable usecase for Solr? or
there is a well known approach for dealing with?


On Tue, Dec 17, 2013 at 10:16 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : Then I remembered we currently don't allow deep paging in our current
 : search indexes as performance declines the deeper you go.  Is this still
 : the case?

 Coincidently, i'm working on a new cursor based API to make this much more
 feasible as we speak..

 https://issues.apache.org/jira/browse/SOLR-5463

 I did some simple perf testing of the strawman approach and posted the
 results last week...


 http://searchhub.org/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/

 ...current iterations on the patch are to eliminate the
 strawman code to improve performance even more and beef up the test
 cases.

 : If so, is there another approach to make all the data in a collection
 : easily available for retrieval?  The only thing I can think of is to
 ...
 : Then I was thinking we could have a field with an incrementing numeric
 : value which could be used to perform range queries as a substitute for
 : paging through everything.  Ie queries like 'IncrementalField:[1 TO
 : 100]' 'IncrementalField:[101 TO 200]' but this would be difficult to
 : maintain as we update the index unless we reindex the entire collection
 : every time we update any docs at all.

 As i mentioned in the blog above, as long as you have a uniqueKey field
 that supports range queries, bulk exporting of all documents is fairly
 trivial by sorting on your uniqueKey field and using an fq that also
 filters on your uniqueKey field modify the fq each time to change the
 lower bound to match the highest ID you got on the previous page.

 This approach works really well in simple cases where you wnat to fetch
 all documents matching a query and then process/sort them by some other
 criteria on the client -- but it's not viable if it's important to you
 that the documents come back from solr in score order before your client
 gets them because you want to stop fetching once some criteria is met in
 your client.  Example: you have billions of documents matching a query,
 you want to fetch all sorted by score desc and crunch them on your client
 to compute some stats, and once your client side stat crunching tells you
 you have enough results (which might be after the 1000th result, or might
 be after the millionth result) then you want to stop.

 SOLR-5463 will help even in that later case.  The bulk of the patch should
 easy to use in the next day or so (having other people try out and
 test in their applications would be *very* helpful) and hopefully show up
 in Solr 4.7

 -Hoss
 http://www.lucidworks.com/




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: solr as nosql - pulling all docs vs deep paging limitations

2013-12-17 Thread Joel Bernstein
SOLR-5244 is also working in this direction. This focuses on efficient
binary extract of entire search results.


On Tue, Dec 17, 2013 at 2:33 PM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:

 Hoss is working on it. Search for deep paging or cursor in JIRA.

 Otis
 Solr  ElasticSearch Support
 http://sematext.com/
 On Dec 17, 2013 12:30 PM, Petersen, Robert 
 robert.peter...@mail.rakuten.com wrote:

  Hi solr users,
 
  We have a new use case where need to make a pile of data available as XML
  to a client and I was thinking we could easily put all this data into a
  solr collection and the client could just do a star search and page
 through
  all the results to obtain the data we need to give them.  Then I
 remembered
  we currently don't allow deep paging in our current search indexes as
  performance declines the deeper you go.  Is this still the case?
 
  If so, is there another approach to make all the data in a collection
  easily available for retrieval?  The only thing I can think of is to
 query
  our DB for all the unique IDs of all the documents in the collection and
  then pull out the documents out in small groups with successive queries
  like 'UniqueIdField:(id1 OR id2 OR ... OR idn)' 'UniqueIdField:(idn+1 OR
  idn+2 OR ... etc)' which doesn't seem like a very good approach because
 the
  DB might have been updated with new data which hasn't been indexed yet
 and
  so all the ids might not be in there (which may or may not matter I
  suppose).
 
  Then I was thinking we could have a field with an incrementing numeric
  value which could be used to perform range queries as a substitute for
  paging through everything.  Ie queries like 'IncrementalField:[1 TO 100]'
  'IncrementalField:[101 TO 200]' but this would be difficult to maintain
 as
  we update the index unless we reindex the entire collection every time we
  update any docs at all.
 
  Is this perhaps not a good use case for solr?  Should I use something
 else
  or is there another approach that would work here to allow a client to
 pull
  groups of docs in a collection through the rest api until the client has
  gotten them all?
 
  Thanks
  Robi
 
 




-- 
Joel Bernstein
Search Engineer at Heliosearch


Re: solr as nosql - pulling all docs vs deep paging limitations

2013-12-17 Thread Otis Gospodnetic
Joel - can you please elaborate a bit on how this compares with Hoss'
approach?  Complementary?

Thanks,
Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Tue, Dec 17, 2013 at 6:45 PM, Joel Bernstein joels...@gmail.com wrote:

 SOLR-5244 is also working in this direction. This focuses on efficient
 binary extract of entire search results.


 On Tue, Dec 17, 2013 at 2:33 PM, Otis Gospodnetic 
 otis.gospodne...@gmail.com wrote:

  Hoss is working on it. Search for deep paging or cursor in JIRA.
 
  Otis
  Solr  ElasticSearch Support
  http://sematext.com/
  On Dec 17, 2013 12:30 PM, Petersen, Robert 
  robert.peter...@mail.rakuten.com wrote:
 
   Hi solr users,
  
   We have a new use case where need to make a pile of data available as
 XML
   to a client and I was thinking we could easily put all this data into a
   solr collection and the client could just do a star search and page
  through
   all the results to obtain the data we need to give them.  Then I
  remembered
   we currently don't allow deep paging in our current search indexes as
   performance declines the deeper you go.  Is this still the case?
  
   If so, is there another approach to make all the data in a collection
   easily available for retrieval?  The only thing I can think of is to
  query
   our DB for all the unique IDs of all the documents in the collection
 and
   then pull out the documents out in small groups with successive queries
   like 'UniqueIdField:(id1 OR id2 OR ... OR idn)' 'UniqueIdField:(idn+1
 OR
   idn+2 OR ... etc)' which doesn't seem like a very good approach because
  the
   DB might have been updated with new data which hasn't been indexed yet
  and
   so all the ids might not be in there (which may or may not matter I
   suppose).
  
   Then I was thinking we could have a field with an incrementing numeric
   value which could be used to perform range queries as a substitute for
   paging through everything.  Ie queries like 'IncrementalField:[1 TO
 100]'
   'IncrementalField:[101 TO 200]' but this would be difficult to maintain
  as
   we update the index unless we reindex the entire collection every time
 we
   update any docs at all.
  
   Is this perhaps not a good use case for solr?  Should I use something
  else
   or is there another approach that would work here to allow a client to
  pull
   groups of docs in a collection through the rest api until the client
 has
   gotten them all?
  
   Thanks
   Robi
  
  
 



 --
 Joel Bernstein
 Search Engineer at Heliosearch



Re: solr as nosql - pulling all docs vs deep paging limitations

2013-12-17 Thread Joel Bernstein
They are for different use cases. Hoss's approach, I believe, focuses on
deep paging of ranked search results. SOLR-5244 focuses on the batch export
of an entire unranked search result in binary format. It's basically a very
efficient bulk extract for Solr.


On Tue, Dec 17, 2013 at 6:51 PM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:

 Joel - can you please elaborate a bit on how this compares with Hoss'
 approach?  Complementary?

 Thanks,
 Otis
 --
 Performance Monitoring * Log Analytics * Search Analytics
 Solr  Elasticsearch Support * http://sematext.com/


 On Tue, Dec 17, 2013 at 6:45 PM, Joel Bernstein joels...@gmail.com
 wrote:

  SOLR-5244 is also working in this direction. This focuses on efficient
  binary extract of entire search results.
 
 
  On Tue, Dec 17, 2013 at 2:33 PM, Otis Gospodnetic 
  otis.gospodne...@gmail.com wrote:
 
   Hoss is working on it. Search for deep paging or cursor in JIRA.
  
   Otis
   Solr  ElasticSearch Support
   http://sematext.com/
   On Dec 17, 2013 12:30 PM, Petersen, Robert 
   robert.peter...@mail.rakuten.com wrote:
  
Hi solr users,
   
We have a new use case where need to make a pile of data available as
  XML
to a client and I was thinking we could easily put all this data
 into a
solr collection and the client could just do a star search and page
   through
all the results to obtain the data we need to give them.  Then I
   remembered
we currently don't allow deep paging in our current search indexes as
performance declines the deeper you go.  Is this still the case?
   
If so, is there another approach to make all the data in a collection
easily available for retrieval?  The only thing I can think of is to
   query
our DB for all the unique IDs of all the documents in the collection
  and
then pull out the documents out in small groups with successive
 queries
like 'UniqueIdField:(id1 OR id2 OR ... OR idn)' 'UniqueIdField:(idn+1
  OR
idn+2 OR ... etc)' which doesn't seem like a very good approach
 because
   the
DB might have been updated with new data which hasn't been indexed
 yet
   and
so all the ids might not be in there (which may or may not matter I
suppose).
   
Then I was thinking we could have a field with an incrementing
 numeric
value which could be used to perform range queries as a substitute
 for
paging through everything.  Ie queries like 'IncrementalField:[1 TO
  100]'
'IncrementalField:[101 TO 200]' but this would be difficult to
 maintain
   as
we update the index unless we reindex the entire collection every
 time
  we
update any docs at all.
   
Is this perhaps not a good use case for solr?  Should I use something
   else
or is there another approach that would work here to allow a client
 to
   pull
groups of docs in a collection through the rest api until the client
  has
gotten them all?
   
Thanks
Robi
   
   
  
 
 
 
  --
  Joel Bernstein
  Search Engineer at Heliosearch
 




-- 
Joel Bernstein
Search Engineer at Heliosearch