Re: [web2py] DAL speed - an idea
There's a tree structure among the record, upon which the aggregation is computed. Some dbs (e.g. oracle) have extensions for tree-like structures (CONNECT BY etc), but it is not standard, and I need to support both sqlite and postgres in this app.
[web2py] DAL speed - an idea
One of my controllers need to go through a lot of records to provide a meaningful answer -- as in, 60k records. Just loading them from the database takes about 100ms (db.executesql(select * from table order by id;)); Doing the same through DAL takes over 6 seconds. I realize that the DAL does do a lot of additional work, which in general is helpful -- but I can do without all the parsing / Rows() generation for this. What do people here think about adding a db.rawselect(...), which is a slim rapper for db.executesql(db._select()) that wraps everything with a named tuple? It solves most of the speed problem when it is needed, but still maintains a lot of the features of the SQL DAL processing.
Re: [web2py] DAL speed - an idea
You mean using the generated SQL command to fire execute_sql and returns as a named tuple? db(db.query).raw_select() it is the same as doing command = db(db.query)._select() result = db.execute_sql(command) On Thu, Feb 9, 2012 at 4:51 PM, nick name i.like.privacy@gmail.comwrote: One of my controllers need to go through a lot of records to provide a meaningful answer -- as in, 60k records. Just loading them from the database takes about 100ms (db.executesql(select * from table order by id;)); Doing the same through DAL takes over 6 seconds. I realize that the DAL does do a lot of additional work, which in general is helpful -- but I can do without all the parsing / Rows() generation for this. What do people here think about adding a db.rawselect(...), which is a slim rapper for db.executesql(db._select()) that wraps everything with a named tuple? It solves most of the speed problem when it is needed, but still maintains a lot of the features of the SQL DAL processing. -- Bruno Rocha [http://rochacbruno.com.br]
Re: [web2py] DAL speed - an idea
Yes, that is the basis of what I am suggesting. There is not currently such a thing; there is something called 'select_raw' implemented in the GoogleDataStore adapter, but not in anything else, and it isn't exactly what I am proposing. To elaborate: Assume the table is defined as follows: reftable = db.define_table('reftable', Field('a', string)) table = db.define_table('table', Field('b', reftable)) In my case, I need to pull all the records (60,000) from the database to compute some aggregation which I cannot compute using sql. There are two alternatives here: r1 = db().select(table.ALL) # takes 6 seconds r2 = db.executesql(db._select(table.ALL)) # takes ~0.1sec The records returned in the first instance are much richer; they have record chasing (e.g. I can do r1[0].b.a to select through the foreign key), they have methods like r1[0].update_record() and r1[0].delete_record(), and other nice stuff. However, for this use, I don't need the additional records, and I do need the speed, so I would rather use r2. However, r2 is not a direct replacement -- it doesn't have the column names. If I use r3 = db.executesql(db._select(table.ALL), as_dict=True) # still takes ~0.1sec I can do r3[0]['b'] but I cannot do r3[0].b; and it takes a lot more memory than r2. A suggestion: add another parameter, processor=... which, if available, will be called with the db.connection.cursor, returning a function, through which each routine will be passed; example def named_tuple_process(name, description): from collections import namedtuple fields = ' '.join([x[0] for x in description]) return namedtuple(name, fields) r4 = db.executesql(db._select(table.ALL), process=lambda x: named_tuple_process('tablerec', x)) r4[0].b # will now work; not a full replacement, but good enough for many uses. In fact, you can do that externally - r4 = db.executesql(db._select(table.ALL)) f = named_tuple_process('tablerec', db._adapter.cursor.description) r4 = [f(x) for x in r4] But this requires reaching into the internals of the db adapter. Finally, I propose to define x.raw_select(*args) to do: db.executesql(x._select(*args)) which would make this a relatively clean replacement.
Re: [web2py] DAL speed - an idea
In my case, I need to pull all the records (60,000) from the database to compute some aggregation which I cannot compute using sql Are you familiar with window functions in SQL? I've never met an aggregation need that couldn't be met with clever use of windows...
Re: [web2py] DAL speed - an idea
I've been thinking about something like this as well. Instead of a separate select_raw() method, maybe we can just add a raw=True|False argument to the existing select() method. I like the namedtuple idea as well (I think some adapters already provide that as an option -- e.g., psycopg2). Anthony On Thursday, February 9, 2012 3:04:41 PM UTC-5, nick name wrote: Yes, that is the basis of what I am suggesting. There is not currently such a thing; there is something called 'select_raw' implemented in the GoogleDataStore adapter, but not in anything else, and it isn't exactly what I am proposing. To elaborate: Assume the table is defined as follows: reftable = db.define_table('reftable', Field('a', string)) table = db.define_table('table', Field('b', reftable)) In my case, I need to pull all the records (60,000) from the database to compute some aggregation which I cannot compute using sql. There are two alternatives here: r1 = db().select(table.ALL) # takes 6 seconds r2 = db.executesql(db._select(table.ALL)) # takes ~0.1sec The records returned in the first instance are much richer; they have record chasing (e.g. I can do r1[0].b.a to select through the foreign key), they have methods like r1[0].update_record() and r1[0].delete_record(), and other nice stuff. However, for this use, I don't need the additional records, and I do need the speed, so I would rather use r2. However, r2 is not a direct replacement -- it doesn't have the column names. If I use r3 = db.executesql(db._select(table.ALL), as_dict=True) # still takes ~0.1sec I can do r3[0]['b'] but I cannot do r3[0].b; and it takes a lot more memory than r2. A suggestion: add another parameter, processor=... which, if available, will be called with the db.connection.cursor, returning a function, through which each routine will be passed; example def named_tuple_process(name, description): from collections import namedtuple fields = ' '.join([x[0] for x in description]) return namedtuple(name, fields) r4 = db.executesql(db._select(table.ALL), process=lambda x: named_tuple_process('tablerec', x)) r4[0].b # will now work; not a full replacement, but good enough for many uses. In fact, you can do that externally - r4 = db.executesql(db._select(table.ALL)) f = named_tuple_process('tablerec', db._adapter.cursor.description) r4 = [f(x) for x in r4] But this requires reaching into the internals of the db adapter. Finally, I propose to define x.raw_select(*args) to do: db.executesql(x._select(*args)) which would make this a relatively clean replacement.