Re: Best way to do a multi_get using CQL

Marcelo Elias Del Valle Fri, 20 Jun 2014 10:33:06 -0700

A question, not sure if you guys know the answer:
Supose I async query 1000 rows using token aware and suppose I have 10
nodes. Suppose also each node would receive 100 row queries each.
How does async work in this case? Would it send each row query to each node
in a different connection? Different message?
I guess if there was a way to use batch with async, once you commit the
batch for the 1000 queries, it would create 1 connection to each host and
query 100 rows in a single message to each host.
This would decrease resource usage, am I wrong?


[]s


2014-06-20 12:12 GMT-03:00 Jeremy Jongsma <jer...@barchart.com>:

> I've found that if you have any amount of latency between your client and
> nodes, and you are executing a large batch of queries, you'll usually want
> to send them together to one node unless execution time is of no concern.
> The tradeoff is resource usage on the connected node vs. time to complete
> all the queries, because you'll need fewer client -> node network round
> trips.
>
> With large numbers of queries you will still want to make sure you split
> them into manageable batches before sending them, to control memory usage
> on the executing node. I've been limiting queries to batches of 100 keys in
> scenarios like this.
>
>
> On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael <michael.la...@nytimes.com
> > wrote:
>
>> However my extensive benchmarking this week of the python driver from
>> master shows a performance *decrease* when using 'token_aware'.
>>
>> This is on 12-node, 2-datacenter, RF-3 cluster in AWS.
>>
>> Also why do the work the coordinator will do for you: send all the
>> queries, wait for everything to come back in whatever order, and sort the
>> result.
>>
>> I would rather keep my app code simple.
>>
>> But the real point is that you should benchmark in your own environment.
>>
>> ml
>>
>>
>> On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle <
>> marc...@s1mbi0se.com.br> wrote:
>>
>>> Yes, I am using the CQL datastax drivers.
>>> It was a good advice, thanks a lot Janathan.
>>> []s
>>>
>>>
>>> 2014-06-20 0:28 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>:
>>>
>>> The only case in which it might be better to use an IN clause is if
>>>> the entire query can be satisfied from that machine.  Otherwise, go
>>>> async.
>>>>
>>>> The native driver reuses connections and intelligently manages the
>>>> pool for you.  It can also multiplex queries over a single connection.
>>>>
>>>> I am assuming you're using one of the datastax drivers for CQL, btw.
>>>>
>>>> Jon
>>>>
>>>> On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle
>>>> <marc...@s1mbi0se.com.br> wrote:
>>>> > This is interesting, I didn't know that!
>>>> > It might make sense then to use select = + async + token aware, I
>>>> will try
>>>> > to change my code.
>>>> >
>>>> > But would it be a "recomended solution" for these cases? Any other
>>>> options?
>>>> >
>>>> > I still would if this is the right use case for Cassandra, to look for
>>>> > random keys in a huge cluster. After all, the amount of connections to
>>>> > Cassandra will still be huge, right... Wouldn't it be a problem?
>>>> > Or when you use async the driver reuses the connection?
>>>> >
>>>> > []s
>>>> >
>>>> >
>>>> > 2014-06-19 22:16 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>:
>>>> >
>>>> >> If you use async and your driver is token aware, it will go to the
>>>> >> proper node, rather than requiring the coordinator to do so.
>>>> >>
>>>> >> Realistically you're going to have a connection open to every server
>>>> >> anyways.  It's the difference between you querying for the data
>>>> >> directly and using a coordinator as a proxy.  It's faster to just ask
>>>> >> the node with the data.
>>>> >>
>>>> >> On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle
>>>> >> <marc...@s1mbi0se.com.br> wrote:
>>>> >> > But using async queries wouldn't be even worse than using SELECT
>>>> IN?
>>>> >> > The justification in the docs is I could query many nodes, but I
>>>> would
>>>> >> > still
>>>> >> > do it.
>>>> >> >
>>>> >> > Today, I use both async queries AND SELECT IN:
>>>> >> >
>>>> >> > SELECT_ENTITY_LOOKUP = "SELECT entity_id FROM " + ENTITY_LOOKUP + "
>>>> >> > WHERE
>>>> >> > name=%s and value in(%s)"
>>>> >> >
>>>> >> > for name, values in identifiers.items():
>>>> >> >    query = self.SELECT_ENTITY_LOOKUP % ('%s',
>>>> >> > ','.join(['%s']*len(values)))
>>>> >> >    args = [name] + values
>>>> >> >    query_msg = query % tuple(args)
>>>> >> >    futures.append((query_msg, self.session.execute_async(query,
>>>> args)))
>>>> >> >
>>>> >> > for query_msg, future in futures:
>>>> >> >    try:
>>>> >> >       rows = future.result(timeout=100000)
>>>> >> >       for row in rows:
>>>> >> >         entity_ids.add(row.entity_id)
>>>> >> >    except:
>>>> >> >       logging.error("Query '%s' returned ERROR " % (query_msg))
>>>> >> >       raise
>>>> >> >
>>>> >> > Using async just with select = would mean instead of 1 async query
>>>> >> > (example:
>>>> >> > in (0, 1, 2)), I would do several, one for each value of "values"
>>>> array
>>>> >> > above.
>>>> >> > In my head, this would mean more connections to Cassandra and the
>>>> same
>>>> >> > amount of work, right? What would be the advantage?
>>>> >> >
>>>> >> > []s
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > 2014-06-19 22:01 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>:
>>>> >> >
>>>> >> >> Your other option is to fire off async queries.  It's pretty
>>>> >> >> straightforward w/ the java or python drivers.
>>>> >> >>
>>>> >> >> On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
>>>> >> >> <marc...@s1mbi0se.com.br> wrote:
>>>> >> >> > I was taking a look at Cassandra anti-patterns list:
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html
>>>> >> >> >
>>>> >> >> > Among then is
>>>> >> >> >
>>>> >> >> > SELECT ... IN or index lookups¶
>>>> >> >> >
>>>> >> >> > SELECT ... IN and index lookups (formerly secondary indexes)
>>>> should
>>>> >> >> > be
>>>> >> >> > avoided except for specific scenarios. See When not to use IN in
>>>> >> >> > SELECT
>>>> >> >> > and
>>>> >> >> > When not to use an index in Indexing in
>>>> >> >> >
>>>> >> >> > CQL for Cassandra 2.0"
>>>> >> >> >
>>>> >> >> > And Looking at the SELECT doc, I saw:
>>>> >> >> >
>>>> >> >> > When not to use IN¶
>>>> >> >> >
>>>> >> >> > The recommendations about when not to use an index apply to
>>>> using IN
>>>> >> >> > in
>>>> >> >> > the
>>>> >> >> > WHERE clause. Under most conditions, using IN in the WHERE
>>>> clause is
>>>> >> >> > not
>>>> >> >> > recommended. Using IN can degrade performance because usually
>>>> many
>>>> >> >> > nodes
>>>> >> >> > must be queried. For example, in a single, local data center
>>>> cluster
>>>> >> >> > having
>>>> >> >> > 30 nodes, a replication factor of 3, and a consistency level of
>>>> >> >> > LOCAL_QUORUM, a single key query goes out to two nodes, but if
>>>> the
>>>> >> >> > query
>>>> >> >> > uses the IN condition, the number of nodes being queried are
>>>> most
>>>> >> >> > likely
>>>> >> >> > even higher, up to 20 nodes depending on where the keys fall in
>>>> the
>>>> >> >> > token
>>>> >> >> > range."
>>>> >> >> >
>>>> >> >> > In my system, I have a column family called "entity_lookup":
>>>> >> >> >
>>>> >> >> > CREATE KEYSPACE IF NOT EXISTS Identification1
>>>> >> >> >   WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
>>>> >> >> >   'DC1' : 3 };
>>>> >> >> > USE Identification1;
>>>> >> >> >
>>>> >> >> > CREATE TABLE IF NOT EXISTS entity_lookup (
>>>> >> >> >   name varchar,
>>>> >> >> >   value varchar,
>>>> >> >> >   entity_id uuid,
>>>> >> >> >   PRIMARY KEY ((name, value), entity_id));
>>>> >> >> >
>>>> >> >> > And I use the following select to query it:
>>>> >> >> >
>>>> >> >> > SELECT entity_id FROM entity_lookup WHERE name=%s and value
>>>> in(%s)
>>>> >> >> >
>>>> >> >> > Is this an anti-pattern?
>>>> >> >> >
>>>> >> >> > If not using SELECT IN, which other way would you recomend for
>>>> >> >> > lookups
>>>> >> >> > like
>>>> >> >> > that? I have several values I would like to search in cassandra
>>>> and
>>>> >> >> > they
>>>> >> >> > might not be in the same particion, as above.
>>>> >> >> >
>>>> >> >> > Is Cassandra the wrong tool for lookups like that?
>>>> >> >> >
>>>> >> >> > Best regards,
>>>> >> >> > Marcelo Valle.
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> >> >> --
>>>> >> >> Jon Haddad
>>>> >> >> http://www.rustyrazorblade.com
>>>> >> >> skype: rustyrazorblade
>>>> >> >
>>>> >> >
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Jon Haddad
>>>> >> http://www.rustyrazorblade.com
>>>> >> skype: rustyrazorblade
>>>> >
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Jon Haddad
>>>> http://www.rustyrazorblade.com
>>>> skype: rustyrazorblade
>>>>
>>>
>>>
>>
>

Re: Best way to do a multi_get using CQL

Reply via email to