A question, not sure if you guys know the answer: Supose I async query 1000 rows using token aware and suppose I have 10 nodes. Suppose also each node would receive 100 row queries each. How does async work in this case? Would it send each row query to each node in a different connection? Different message? I guess if there was a way to use batch with async, once you commit the batch for the 1000 queries, it would create 1 connection to each host and query 100 rows in a single message to each host. This would decrease resource usage, am I wrong?
[]s 2014-06-20 12:12 GMT-03:00 Jeremy Jongsma <jer...@barchart.com>: > I've found that if you have any amount of latency between your client and > nodes, and you are executing a large batch of queries, you'll usually want > to send them together to one node unless execution time is of no concern. > The tradeoff is resource usage on the connected node vs. time to complete > all the queries, because you'll need fewer client -> node network round > trips. > > With large numbers of queries you will still want to make sure you split > them into manageable batches before sending them, to control memory usage > on the executing node. I've been limiting queries to batches of 100 keys in > scenarios like this. > > > On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael <michael.la...@nytimes.com > > wrote: > >> However my extensive benchmarking this week of the python driver from >> master shows a performance *decrease* when using 'token_aware'. >> >> This is on 12-node, 2-datacenter, RF-3 cluster in AWS. >> >> Also why do the work the coordinator will do for you: send all the >> queries, wait for everything to come back in whatever order, and sort the >> result. >> >> I would rather keep my app code simple. >> >> But the real point is that you should benchmark in your own environment. >> >> ml >> >> >> On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle < >> marc...@s1mbi0se.com.br> wrote: >> >>> Yes, I am using the CQL datastax drivers. >>> It was a good advice, thanks a lot Janathan. >>> []s >>> >>> >>> 2014-06-20 0:28 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>: >>> >>> The only case in which it might be better to use an IN clause is if >>>> the entire query can be satisfied from that machine. Otherwise, go >>>> async. >>>> >>>> The native driver reuses connections and intelligently manages the >>>> pool for you. It can also multiplex queries over a single connection. >>>> >>>> I am assuming you're using one of the datastax drivers for CQL, btw. >>>> >>>> Jon >>>> >>>> On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle >>>> <marc...@s1mbi0se.com.br> wrote: >>>> > This is interesting, I didn't know that! >>>> > It might make sense then to use select = + async + token aware, I >>>> will try >>>> > to change my code. >>>> > >>>> > But would it be a "recomended solution" for these cases? Any other >>>> options? >>>> > >>>> > I still would if this is the right use case for Cassandra, to look for >>>> > random keys in a huge cluster. After all, the amount of connections to >>>> > Cassandra will still be huge, right... Wouldn't it be a problem? >>>> > Or when you use async the driver reuses the connection? >>>> > >>>> > []s >>>> > >>>> > >>>> > 2014-06-19 22:16 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>: >>>> > >>>> >> If you use async and your driver is token aware, it will go to the >>>> >> proper node, rather than requiring the coordinator to do so. >>>> >> >>>> >> Realistically you're going to have a connection open to every server >>>> >> anyways. It's the difference between you querying for the data >>>> >> directly and using a coordinator as a proxy. It's faster to just ask >>>> >> the node with the data. >>>> >> >>>> >> On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle >>>> >> <marc...@s1mbi0se.com.br> wrote: >>>> >> > But using async queries wouldn't be even worse than using SELECT >>>> IN? >>>> >> > The justification in the docs is I could query many nodes, but I >>>> would >>>> >> > still >>>> >> > do it. >>>> >> > >>>> >> > Today, I use both async queries AND SELECT IN: >>>> >> > >>>> >> > SELECT_ENTITY_LOOKUP = "SELECT entity_id FROM " + ENTITY_LOOKUP + " >>>> >> > WHERE >>>> >> > name=%s and value in(%s)" >>>> >> > >>>> >> > for name, values in identifiers.items(): >>>> >> > query = self.SELECT_ENTITY_LOOKUP % ('%s', >>>> >> > ','.join(['%s']*len(values))) >>>> >> > args = [name] + values >>>> >> > query_msg = query % tuple(args) >>>> >> > futures.append((query_msg, self.session.execute_async(query, >>>> args))) >>>> >> > >>>> >> > for query_msg, future in futures: >>>> >> > try: >>>> >> > rows = future.result(timeout=100000) >>>> >> > for row in rows: >>>> >> > entity_ids.add(row.entity_id) >>>> >> > except: >>>> >> > logging.error("Query '%s' returned ERROR " % (query_msg)) >>>> >> > raise >>>> >> > >>>> >> > Using async just with select = would mean instead of 1 async query >>>> >> > (example: >>>> >> > in (0, 1, 2)), I would do several, one for each value of "values" >>>> array >>>> >> > above. >>>> >> > In my head, this would mean more connections to Cassandra and the >>>> same >>>> >> > amount of work, right? What would be the advantage? >>>> >> > >>>> >> > []s >>>> >> > >>>> >> > >>>> >> > >>>> >> > >>>> >> > 2014-06-19 22:01 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>: >>>> >> > >>>> >> >> Your other option is to fire off async queries. It's pretty >>>> >> >> straightforward w/ the java or python drivers. >>>> >> >> >>>> >> >> On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle >>>> >> >> <marc...@s1mbi0se.com.br> wrote: >>>> >> >> > I was taking a look at Cassandra anti-patterns list: >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html >>>> >> >> > >>>> >> >> > Among then is >>>> >> >> > >>>> >> >> > SELECT ... IN or index lookups¶ >>>> >> >> > >>>> >> >> > SELECT ... IN and index lookups (formerly secondary indexes) >>>> should >>>> >> >> > be >>>> >> >> > avoided except for specific scenarios. See When not to use IN in >>>> >> >> > SELECT >>>> >> >> > and >>>> >> >> > When not to use an index in Indexing in >>>> >> >> > >>>> >> >> > CQL for Cassandra 2.0" >>>> >> >> > >>>> >> >> > And Looking at the SELECT doc, I saw: >>>> >> >> > >>>> >> >> > When not to use IN¶ >>>> >> >> > >>>> >> >> > The recommendations about when not to use an index apply to >>>> using IN >>>> >> >> > in >>>> >> >> > the >>>> >> >> > WHERE clause. Under most conditions, using IN in the WHERE >>>> clause is >>>> >> >> > not >>>> >> >> > recommended. Using IN can degrade performance because usually >>>> many >>>> >> >> > nodes >>>> >> >> > must be queried. For example, in a single, local data center >>>> cluster >>>> >> >> > having >>>> >> >> > 30 nodes, a replication factor of 3, and a consistency level of >>>> >> >> > LOCAL_QUORUM, a single key query goes out to two nodes, but if >>>> the >>>> >> >> > query >>>> >> >> > uses the IN condition, the number of nodes being queried are >>>> most >>>> >> >> > likely >>>> >> >> > even higher, up to 20 nodes depending on where the keys fall in >>>> the >>>> >> >> > token >>>> >> >> > range." >>>> >> >> > >>>> >> >> > In my system, I have a column family called "entity_lookup": >>>> >> >> > >>>> >> >> > CREATE KEYSPACE IF NOT EXISTS Identification1 >>>> >> >> > WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', >>>> >> >> > 'DC1' : 3 }; >>>> >> >> > USE Identification1; >>>> >> >> > >>>> >> >> > CREATE TABLE IF NOT EXISTS entity_lookup ( >>>> >> >> > name varchar, >>>> >> >> > value varchar, >>>> >> >> > entity_id uuid, >>>> >> >> > PRIMARY KEY ((name, value), entity_id)); >>>> >> >> > >>>> >> >> > And I use the following select to query it: >>>> >> >> > >>>> >> >> > SELECT entity_id FROM entity_lookup WHERE name=%s and value >>>> in(%s) >>>> >> >> > >>>> >> >> > Is this an anti-pattern? >>>> >> >> > >>>> >> >> > If not using SELECT IN, which other way would you recomend for >>>> >> >> > lookups >>>> >> >> > like >>>> >> >> > that? I have several values I would like to search in cassandra >>>> and >>>> >> >> > they >>>> >> >> > might not be in the same particion, as above. >>>> >> >> > >>>> >> >> > Is Cassandra the wrong tool for lookups like that? >>>> >> >> > >>>> >> >> > Best regards, >>>> >> >> > Marcelo Valle. >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> >>>> >> >> >>>> >> >> >>>> >> >> -- >>>> >> >> Jon Haddad >>>> >> >> http://www.rustyrazorblade.com >>>> >> >> skype: rustyrazorblade >>>> >> > >>>> >> > >>>> >> >>>> >> >>>> >> >>>> >> -- >>>> >> Jon Haddad >>>> >> http://www.rustyrazorblade.com >>>> >> skype: rustyrazorblade >>>> > >>>> > >>>> >>>> >>>> >>>> -- >>>> Jon Haddad >>>> http://www.rustyrazorblade.com >>>> skype: rustyrazorblade >>>> >>> >>> >> >