Using cursorMark with '_yz_rk'

2016-09-21 Thread Vipin Sharma
Hi all,

In our system we have default implementation of querying the data from riak 
using “pagination”.
For some of the queries, with huge number of resulting records (into the tunes 
of 10,000+) , it is becoming an issue and hence we wanted to change it to use 
“cursorMark” as suggested here : 
https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results

While using cursorMark,

-  It asks for unique key in the sort field. We didn’t have a unique 
key of our own so wanted to use “_yz_rk”  but It gives error mentioned below.

-  Query is accepted when sort parameter is changed to use “_yz_id” 
instead but  It gives redundant / duplicate records. It is probably a known 
issue as mentioned 
here ( 
Pagination Warning). Solution recommended is to use { _yz_rt asc, _yz_rb asc, 
_yz_rk asc } instead but for each of them query is returning the following 
error :

"error":{"msg":"Cursor functionality requires a sort containing 
a uniqueKey field tie breaker","code":400}

Can somebody please share some suggestions on this.

Thanks
Vipin




___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Using cursorMark with '_yz_rk'

2016-09-21 Thread Fred Dushin
Okay, I probably spoke too soon.

While Solr 4.7 supports cursor marks, we do have an issue in Riak (or Yokozuna) 
whereby it is actually impractical to use cursor marks for query.  The problem 
is that while Yokozuna uses coverage plans generate a filter query that will 
guarantee that we get no replicas in a result set, these coverage plans change 
every few seconds, in order to ensure we are not constantly querying a subset 
of the cluster (thus possibly creating hot zones in the cluster, especially for 
query-heavy work loads).

Theoretically you could change the interval by which these coverage plans are 
updated (by setting the yokozuna cover_tick configuration setting in 
advanced.config [1]), which would be okay in a development or test environment, 
but which would be unsuitable in production.

The solution is to pin a query to a coverage plan, so that subsequent 
iterations of the query with the next cursor will use the same filter, and 
hence will give you proper result sets.  We do not currently have this 
implemented in Yokozuna.

-Fred

[1] https://github.com/basho/yokozuna/blob/2.0.4/src/yz_cover.erl#L285

> On Sep 21, 2016, at 10:40 AM, Guillaume Boddaert 
>  wrote:
> 
> I'm very curious of your cursorMark implementation, I'm in deep need of that 
> feature.
> 
> From my experience I wasn't even able to trigger a query with my riak version 
> as it was not yet supported by the Solr bundled with it. But I might missed a 
> point with that.
> 
> I'm using 2.1.2.
> 
> Guillaume
> 
> On 21/09/2016 03:28, Vipin Sharma wrote:
>> Hi all,
>>  
>> In our system we have default implementation of querying the data from riak 
>> using “pagination”.
>> For some of the queries, with huge number of resulting records (into the 
>> tunes of 10,000+) , it is becoming an issue and hence we wanted to change it 
>> to use “cursorMark” as suggested here : 
>> https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results 
>> 
>>  
>> While using cursorMark, 
>> -  It asks for unique key in the sort field. We didn’t have a unique 
>> key of our own so wanted to use “_yz_rk”  but It gives error mentioned below.
>> -  Query is accepted when sort parameter is changed to use “_yz_id” 
>> instead but  It gives redundant / duplicate records. It is probably a known 
>> issue as mentioned here 
>>  ( Pagination 
>> Warning). Solution recommended is to use { _yz_rt asc, _yz_rb asc, _yz_rk 
>> asc } instead but for each of them query is returning the following error :
>>  
>> "error":{"msg":"Cursor functionality requires a sort 
>> containing a uniqueKey field tie breaker","code":400}
>>  
>> Can somebody please share some suggestions on this.
>>  
>> Thanks
>> Vipin 
>>  
>>  
>>  
>>  
>> 
>> 
>> ___
>> riak-users mailing list
>> riak-users@lists.basho.com 
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com 
>> 
> 
> ___
> riak-users mailing list
> riak-users@lists.basho.com 
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com 
> 

___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Using cursorMark with '_yz_rk'

2016-09-21 Thread Guillaume Boddaert
I'm very curious of your cursorMark implementation, I'm in deep need of 
that feature.


From my experience I wasn't even able to trigger a query with my riak 
version as it was not yet supported by the Solr bundled with it. But I 
might missed a point with that.


I'm using 2.1.2.

Guillaume

On 21/09/2016 03:28, Vipin Sharma wrote:


Hi all,

In our system we have default implementation of querying the data from 
riak using “pagination”.


For some of the queries, with huge number of resulting records (into 
the tunes of 10,000+) , it is becoming an issue and hence we wanted to 
change it to use “cursorMark” as suggested here : 
https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results


While using cursorMark,

-It asks for unique key in the sort field. We didn’t have a unique key 
of our own so wanted to use “_yz_rk”  but It gives error mentioned below.


-Query is accepted when sort parameter is changed to use “_yz_id” 
instead but  It gives redundant / duplicate records. It is probably a 
known issue as mentioned here 
 ( 
Pagination Warning). Solution recommended is to use { _yz_rt 
asc, _yz_rb asc, _yz_rk asc } instead but for each of them query is 
returning the following error :


"error":{"msg":"Cursor functionality requires a sort containing a 
uniqueKey field tie breaker","code":400}


Can somebody please share some suggestions on this.

Thanks

Vipin



___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Solr search performance

2016-09-21 Thread sean mcevoy
Hi Fred,

Thanks for the pointer! 'cursorMark' is a lot more performant alright,
though apparently it doesn't suit our use case.

I've written a loop function using OTP's httpc that reads each page, gets
the cursorMark and repeats, and it returns all 147 pages with consistent
times in the 40-60ms bracket which is an excellent improvement!

I would have been asking about the effort involved in making the protocol
buffers client support this, but instead our GUI guys insist that they need
to request a page number as sometimes they want to start in the middle of a
set of data.

So I'm almost back to square one.
Can you shed any light on the internal workings of SOLR that produce the
slow-down in my original question?
I'm hoping I can find a way to restructure my index data without having to
change the higher-level API's that I support.

Cheers,
//Sean.


On Mon, Sep 19, 2016 at 10:00 PM, Fred Dushin  wrote:

> All great questions, Sean.
>
> A few things.  First off, for result sets that are that large, you are
> probably going to want to use Solr cursor marks [1], which are supported in
> the current version of Solr we ship.  Riak allows queries using cursor
> marks through the HTTP interface.  At present, it does not support cursors
> using the protobuf API, due to some internal limitations of the server-side
> protobuf library, but we do hope to fix that in the future.
>
> Secondly, we have found sorting with distributed queries to be far more
> performant using Solr 4.10.4.  Currently released versions of Riak use Solr
> 4.7, but as you can see on github [2], Solr 4.10.4 support has been merged
> into the develop-2.2 branch, and is in the pipeline for release.  I can't
> say when the next version of Riak is that will ship with this version
> because of indeterminacy around bug triage, but it should not be too long.
>
> I would start to look at using cursor marks and measure their relative
> performance in your scenario.  My guess is that you should see some
> improvement there.
>
> -Fred
>
> [1] https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results
> [2] https://github.com/basho/yokozuna/commit/
> f64e19cef107d982082f5b95ed598da96fb419b0
>
>
> > On Sep 19, 2016, at 4:48 PM, sean mcevoy  wrote:
> >
> > Hi All,
> >
> > We have an index with ~548,000 entries, ~14,000 of which match one of
> our queries.
> > We read these in a paginated search and the first page (of 100 hits)
> returns quickly in ~70ms.
> > This response time seems to increase exponentially as we walk through
> the pages:
> > the 4th page takes ~200ms,
> > the 8th page takes ~1200ms
> > the 12th page takes ~2100ms
> > the 16th page takes ~6100ms
> > the 20th page takes ~24000ms
> >
> > And by the time we're searching for the 22nd page it regularly times out
> at the default 60 seconds.
> >
> > I have a good unsderstanding of riak KV internals but absolutely nothing
> of Lucene which I think is what's most relevant here. If anyone in the know
> can point me towards any relevant resource or can explain what's happening
> I'd be much obliged :-)
> > As I would also be if anyone with experience of using Riak/Lucene can
> tell me:
> > - Is 500K a crazy number of entries to put into one index?
> > - Is 14K a crazy number of entries to expect to be returned?
> > - Are there any methods we can use to make the search time more constant
> across the full search?
> > I read one blog post on inlining but it was a bit old & not very obvious
> how to implement using riakc_pb_socket calls.
> >
> > And out of curiosity, do we not traverse the full range of hits for each
> page? I naively thought that because I'm sorting the returned values we'd
> have to get them all first and then sort, but the response times suggests
> otherwise. Does Lucene store the data sorted by each field just in case a
> query asks for it? Or what other magic is going on?
> >
> >
> > For the technical details, we use the "_yz_default" schema and all the
> fields stored are strings:
> > - entry_id_s: unique within the DB, the aim of the query is to gather a
> list of these
> > - type_s: has one of 2 values
> > - sub_category_id_s: in the query described above all 14K hits will
> match on this, in the DB of ~500K entries there are ~43K different values
> for this field, withe each category typically having 2-6 sub categories
> > - category_id_s: not matched in this query, in the DB of ~500K entries
> there are ~13K different values for this field
> > - status_s: has one of 2 values, in the query described baove all hits
> will have the value "active"
> > - user_id_s: unique within the DB but not matched in this query
> > - first_name_s: almost unique within the DB, this query will sort by
> this field
> > - last_name_s: almost unique within the DB, this query will sort by this
> field
> >
> > This search query looks like:
> > <<"sub_category_id_s:test_1 AND status_s:active AND
> type_s:sub_category">>
> >
> > Our options