Re: Efficient model for a sorting

Dorian Hoxha Tue, 04 Oct 2016 06:06:40 -0700

On lucene you can query+filter+sort on a single shard, so it should be
better than MV/sasi. The index building is a little async though.


On Tue, Oct 4, 2016 at 2:29 PM, Benjamin Roth <benjamin.r...@jaumo.com>
wrote:

> Thanks guys!
>
> Good to know, that my approach is basically right, but I will check that
> lucene indices by time.
>
> 2016-10-04 14:22 GMT+02:00 DuyHai Doan <doanduy...@gmail.com>:
>
>> "What scatter/gather? "
>>
>> http://www.slideshare.net/doanduyhai/sasi-cassandra-on-the-
>> full-text-search-ride-voxxed-daybelgrade-2016/23
>>
>> "If you partition your data by user_id then you query only 1 shard to
>> get sorted by time visitors for a user"
>>
>> Exact, but in this case, you're using a 2nd index only for sorting right
>> ? For SASI it's not even possible. Maybe it can work with Statrio Lucene
>> impl
>>
>> On Tue, Oct 4, 2016 at 2:15 PM, Dorian Hoxha <dorian.ho...@gmail.com>
>> wrote:
>>
>>> @DuyHai
>>>
>>> What scatter/gather? If you partition your data by user_id then you
>>> query only 1 shard to get sorted by time visitors for a user.
>>>
>>> On Tue, Oct 4, 2016 at 2:09 PM, DuyHai Doan <doanduy...@gmail.com>
>>> wrote:
>>>
>>>> MV is right now your best choice for this kind of sorting behavior.
>>>>
>>>> Secondary index (whatever the impl, SASI or Lucene) has a cost of
>>>> scatter-gather if your cluster scale out. With MV you're at least
>>>> guaranteed to hit a single node everytime
>>>>
>>>> On Tue, Oct 4, 2016 at 1:56 PM, Dorian Hoxha <dorian.ho...@gmail.com>
>>>> wrote:
>>>>
>>>>> Can you use the lucene index https://github.com/Stratio/cas
>>>>> sandra-lucene-index ?
>>>>>
>>>>> On Tue, Oct 4, 2016 at 1:27 PM, Benjamin Roth <benjamin.r...@jaumo.com
>>>>> > wrote:
>>>>>
>>>>>> Hi!
>>>>>>
>>>>>> I have a frequently used pattern which seems to be quite costly in
>>>>>> CS. The pattern is always the same: I have a unique key and a sorting by 
>>>>>> a
>>>>>> different field.
>>>>>>
>>>>>> To give an example, here a real life example from our model:
>>>>>> CREATE TABLE visits.visits_in (
>>>>>>     user_id int,
>>>>>>     user_id_visitor int,
>>>>>>     created timestamp,
>>>>>>     PRIMARY KEY (user_id, user_id_visitor)
>>>>>> ) WITH CLUSTERING ORDER BY (user_id_visitor ASC)
>>>>>>
>>>>>> CREATE MATERIALIZED VIEW visits.visits_in_sorted_mv AS
>>>>>>     SELECT user_id, created, user_id_visitor
>>>>>>     FROM visits.visits_in
>>>>>>     WHERE user_id IS NOT NULL AND created IS NOT NULL AND
>>>>>> user_id_visitor IS NOT NULL
>>>>>>     PRIMARY KEY (user_id, created, user_id_visitor)
>>>>>>     WITH CLUSTERING ORDER BY (created DESC, user_id_visitor DESC)
>>>>>>
>>>>>> This simply represents people, that visited my profile sorted by date
>>>>>> desc but only one entry per visitor.
>>>>>> Other examples with the same pattern could be a whats-app-like inbox
>>>>>> where the last message of each sender is shown by date desc. There are 
>>>>>> lots
>>>>>> of examples for that pattern.
>>>>>>
>>>>>> E.g. in redis I'd just use a sorted set, where the key could be like
>>>>>> "visits_${user_id}", set key would be user_id_visitor and score
>>>>>> the created timestamp.
>>>>>> In MySQL I'd create the table with PK on user_id + user_id_visitor
>>>>>> and create an index on user_id + created
>>>>>> In C* i use an MV.
>>>>>>
>>>>>> Is this the most efficient approach?
>>>>>> I also could have done this without an MV but then the situation in
>>>>>> our app would be far more complex.
>>>>>> I know that denormalization is a common pattern in C* and I don't
>>>>>> hesitate to use it but in this case, it is not as simple as it's not an
>>>>>> append-only case but updates have to be handled correctly.
>>>>>> If it is the first visit of a user, it's that simple, just 2 inserts
>>>>>> in base table + denormalized table. But on a 2nd or 3rd visit, the 1st or
>>>>>> 2nd visit has to be deleted from the denormalized table before. Otherwise
>>>>>> the visit would not be unique any more.
>>>>>> Handling this case without an MV requires a lot more effort, I guess
>>>>>> even more effort than just using an MV.
>>>>>> 1. You need kind of app-side locking to deal with race conditions
>>>>>> 2. Read before write is required to determine if an old record has to
>>>>>> be deleted
>>>>>> 3. At least CL_QUORUM is required to make sure that read before write
>>>>>> is always consistent
>>>>>> 4. Old record has to be deleted on update
>>>>>>
>>>>>> I guess, using an MV here is more efficient as there is less
>>>>>> roundtrip between C* and the app to do all that and the MV does not 
>>>>>> require
>>>>>> strong consistency as MV updates are always local and are eventual
>>>>>> consistent when the base table is. So there is also no need for 
>>>>>> distributed
>>>>>> locks.
>>>>>>
>>>>>> I ask all this as we now use CS 3.x and have been advised that 3.x is
>>>>>> still not considered really production ready.
>>>>>>
>>>>>> I guess in a perfect world, this wouldn't even require an MV if SASI
>>>>>> indexes could be created over more than 1 column. E.g. in MySQL this case
>>>>>> is nothing else than a BTree. AFAIK SASI indices are also BTrees, 
>>>>>> filtering
>>>>>> by Partition Key (which should to be done anyway) and sorting by a field
>>>>>> would perfectly do the trick. But from the docs, this is not possible 
>>>>>> right
>>>>>> now.
>>>>>>
>>>>>> Does anyone see a better solution or are all my assumptions correct?
>>>>>>
>>>>>> --
>>>>>> Benjamin Roth
>>>>>> Prokurist
>>>>>>
>>>>>> Jaumo GmbH · www.jaumo.com
>>>>>> Wehrstraße 46 · 73035 Göppingen · Germany
>>>>>> Phone +49 7161 304880-6 · Fax +49 7161 304880-1
>>>>>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> Benjamin Roth
> Prokurist
>
> Jaumo GmbH · www.jaumo.com
> Wehrstraße 46 · 73035 Göppingen · Germany
> Phone +49 7161 304880-6 · Fax +49 7161 304880-1
> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>

Re: Efficient model for a sorting

Reply via email to