Re: Best way to do a multi_get using CQL

Jeremy Jongsma Fri, 20 Jun 2014 08:23:18 -0700

I've found that if you have any amount of latency between your client and
nodes, and you are executing a large batch of queries, you'll usually want
to send them together to one node unless execution time is of no concern.
The tradeoff is resource usage on the connected node vs. time to complete
all the queries, because you'll need fewer client -> node network round
trips.


With large numbers of queries you will still want to make sure you split
them into manageable batches before sending them, to control memory usage
on the executing node. I've been limiting queries to batches of 100 keys in
scenarios like this.


On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael <michael.la...@nytimes.com>
wrote:

> However my extensive benchmarking this week of the python driver from
> master shows a performance *decrease* when using 'token_aware'.
>
> This is on 12-node, 2-datacenter, RF-3 cluster in AWS.
>
> Also why do the work the coordinator will do for you: send all the
> queries, wait for everything to come back in whatever order, and sort the
> result.
>
> I would rather keep my app code simple.
>
> But the real point is that you should benchmark in your own environment.
>
> ml
>
>
> On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle <
> marc...@s1mbi0se.com.br> wrote:
>
>> Yes, I am using the CQL datastax drivers.
>> It was a good advice, thanks a lot Janathan.
>> []s
>>
>>
>> 2014-06-20 0:28 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>:
>>
>> The only case in which it might be better to use an IN clause is if
>>> the entire query can be satisfied from that machine.  Otherwise, go
>>> async.
>>>
>>> The native driver reuses connections and intelligently manages the
>>> pool for you.  It can also multiplex queries over a single connection.
>>>
>>> I am assuming you're using one of the datastax drivers for CQL, btw.
>>>
>>> Jon
>>>
>>> On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle
>>> <marc...@s1mbi0se.com.br> wrote:
>>> > This is interesting, I didn't know that!
>>> > It might make sense then to use select = + async + token aware, I will
>>> try
>>> > to change my code.
>>> >
>>> > But would it be a "recomended solution" for these cases? Any other
>>> options?
>>> >
>>> > I still would if this is the right use case for Cassandra, to look for
>>> > random keys in a huge cluster. After all, the amount of connections to
>>> > Cassandra will still be huge, right... Wouldn't it be a problem?
>>> > Or when you use async the driver reuses the connection?
>>> >
>>> > []s
>>> >
>>> >
>>> > 2014-06-19 22:16 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>:
>>> >
>>> >> If you use async and your driver is token aware, it will go to the
>>> >> proper node, rather than requiring the coordinator to do so.
>>> >>
>>> >> Realistically you're going to have a connection open to every server
>>> >> anyways.  It's the difference between you querying for the data
>>> >> directly and using a coordinator as a proxy.  It's faster to just ask
>>> >> the node with the data.
>>> >>
>>> >> On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle
>>> >> <marc...@s1mbi0se.com.br> wrote:
>>> >> > But using async queries wouldn't be even worse than using SELECT IN?
>>> >> > The justification in the docs is I could query many nodes, but I
>>> would
>>> >> > still
>>> >> > do it.
>>> >> >
>>> >> > Today, I use both async queries AND SELECT IN:
>>> >> >
>>> >> > SELECT_ENTITY_LOOKUP = "SELECT entity_id FROM " + ENTITY_LOOKUP + "
>>> >> > WHERE
>>> >> > name=%s and value in(%s)"
>>> >> >
>>> >> > for name, values in identifiers.items():
>>> >> >    query = self.SELECT_ENTITY_LOOKUP % ('%s',
>>> >> > ','.join(['%s']*len(values)))
>>> >> >    args = [name] + values
>>> >> >    query_msg = query % tuple(args)
>>> >> >    futures.append((query_msg, self.session.execute_async(query,
>>> args)))
>>> >> >
>>> >> > for query_msg, future in futures:
>>> >> >    try:
>>> >> >       rows = future.result(timeout=100000)
>>> >> >       for row in rows:
>>> >> >         entity_ids.add(row.entity_id)
>>> >> >    except:
>>> >> >       logging.error("Query '%s' returned ERROR " % (query_msg))
>>> >> >       raise
>>> >> >
>>> >> > Using async just with select = would mean instead of 1 async query
>>> >> > (example:
>>> >> > in (0, 1, 2)), I would do several, one for each value of "values"
>>> array
>>> >> > above.
>>> >> > In my head, this would mean more connections to Cassandra and the
>>> same
>>> >> > amount of work, right? What would be the advantage?
>>> >> >
>>> >> > []s
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> > 2014-06-19 22:01 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>:
>>> >> >
>>> >> >> Your other option is to fire off async queries.  It's pretty
>>> >> >> straightforward w/ the java or python drivers.
>>> >> >>
>>> >> >> On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
>>> >> >> <marc...@s1mbi0se.com.br> wrote:
>>> >> >> > I was taking a look at Cassandra anti-patterns list:
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html
>>> >> >> >
>>> >> >> > Among then is
>>> >> >> >
>>> >> >> > SELECT ... IN or index lookups¶
>>> >> >> >
>>> >> >> > SELECT ... IN and index lookups (formerly secondary indexes)
>>> should
>>> >> >> > be
>>> >> >> > avoided except for specific scenarios. See When not to use IN in
>>> >> >> > SELECT
>>> >> >> > and
>>> >> >> > When not to use an index in Indexing in
>>> >> >> >
>>> >> >> > CQL for Cassandra 2.0"
>>> >> >> >
>>> >> >> > And Looking at the SELECT doc, I saw:
>>> >> >> >
>>> >> >> > When not to use IN¶
>>> >> >> >
>>> >> >> > The recommendations about when not to use an index apply to
>>> using IN
>>> >> >> > in
>>> >> >> > the
>>> >> >> > WHERE clause. Under most conditions, using IN in the WHERE
>>> clause is
>>> >> >> > not
>>> >> >> > recommended. Using IN can degrade performance because usually
>>> many
>>> >> >> > nodes
>>> >> >> > must be queried. For example, in a single, local data center
>>> cluster
>>> >> >> > having
>>> >> >> > 30 nodes, a replication factor of 3, and a consistency level of
>>> >> >> > LOCAL_QUORUM, a single key query goes out to two nodes, but if
>>> the
>>> >> >> > query
>>> >> >> > uses the IN condition, the number of nodes being queried are most
>>> >> >> > likely
>>> >> >> > even higher, up to 20 nodes depending on where the keys fall in
>>> the
>>> >> >> > token
>>> >> >> > range."
>>> >> >> >
>>> >> >> > In my system, I have a column family called "entity_lookup":
>>> >> >> >
>>> >> >> > CREATE KEYSPACE IF NOT EXISTS Identification1
>>> >> >> >   WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
>>> >> >> >   'DC1' : 3 };
>>> >> >> > USE Identification1;
>>> >> >> >
>>> >> >> > CREATE TABLE IF NOT EXISTS entity_lookup (
>>> >> >> >   name varchar,
>>> >> >> >   value varchar,
>>> >> >> >   entity_id uuid,
>>> >> >> >   PRIMARY KEY ((name, value), entity_id));
>>> >> >> >
>>> >> >> > And I use the following select to query it:
>>> >> >> >
>>> >> >> > SELECT entity_id FROM entity_lookup WHERE name=%s and value
>>> in(%s)
>>> >> >> >
>>> >> >> > Is this an anti-pattern?
>>> >> >> >
>>> >> >> > If not using SELECT IN, which other way would you recomend for
>>> >> >> > lookups
>>> >> >> > like
>>> >> >> > that? I have several values I would like to search in cassandra
>>> and
>>> >> >> > they
>>> >> >> > might not be in the same particion, as above.
>>> >> >> >
>>> >> >> > Is Cassandra the wrong tool for lookups like that?
>>> >> >> >
>>> >> >> > Best regards,
>>> >> >> > Marcelo Valle.
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> --
>>> >> >> Jon Haddad
>>> >> >> http://www.rustyrazorblade.com
>>> >> >> skype: rustyrazorblade
>>> >> >
>>> >> >
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Jon Haddad
>>> >> http://www.rustyrazorblade.com
>>> >> skype: rustyrazorblade
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Jon Haddad
>>> http://www.rustyrazorblade.com
>>> skype: rustyrazorblade
>>>
>>
>>
>

Re: Best way to do a multi_get using CQL

Reply via email to