Re: [web2py] DAL speed - an idea

2012-02-10 Thread nick name
There's a tree structure among the record, upon which the aggregation is 
computed.

Some dbs (e.g. oracle) have extensions for tree-like structures (CONNECT BY 
etc), but it is not standard, and I need to support both sqlite and 
postgres in this app.



[web2py] DAL speed - an idea

2012-02-09 Thread nick name
One of my controllers need to go through a lot of records to provide a 
meaningful answer -- as in, 60k records.

Just loading them from the database takes about 100ms 
(db.executesql(select * from table order by id;)); Doing the same through 
DAL takes over 6 seconds. I realize that the DAL does do a lot of 
additional work, which in general is helpful -- but I can do without all 
the parsing / Rows() generation for this.

What do people here think about adding a db.rawselect(...), which is a slim 
rapper for db.executesql(db._select())  that wraps everything with a 
named tuple? It solves most of the speed problem when it is needed, but 
still maintains a lot of the features of the SQL DAL processing.



Re: [web2py] DAL speed - an idea

2012-02-09 Thread Bruno Rocha
You mean using the generated SQL command to fire execute_sql and returns as
a named tuple?

 db(db.query).raw_select()

it is the same as doing

command = db(db.query)._select()
result = db.execute_sql(command)

On Thu, Feb 9, 2012 at 4:51 PM, nick name i.like.privacy@gmail.comwrote:

 One of my controllers need to go through a lot of records to provide a
 meaningful answer -- as in, 60k records.

 Just loading them from the database takes about 100ms
 (db.executesql(select * from table order by id;)); Doing the same through
 DAL takes over 6 seconds. I realize that the DAL does do a lot of
 additional work, which in general is helpful -- but I can do without all
 the parsing / Rows() generation for this.

 What do people here think about adding a db.rawselect(...), which is a
 slim rapper for db.executesql(db._select())  that wraps everything with
 a named tuple? It solves most of the speed problem when it is needed, but
 still maintains a lot of the features of the SQL DAL processing.




-- 

Bruno Rocha
[http://rochacbruno.com.br]


Re: [web2py] DAL speed - an idea

2012-02-09 Thread nick name
Yes, that is the basis of what I am suggesting.

There is not currently such a thing; there is something called 'select_raw' 
implemented in the GoogleDataStore adapter, but not in anything else, and 
it isn't exactly what I am proposing.

To elaborate:

Assume the table is defined as follows:
   
reftable = db.define_table('reftable', Field('a', string))
table = db.define_table('table', Field('b', reftable))

In my case, I need to pull all the records (60,000) from the database to 
compute some aggregation which I cannot compute using sql. There are two 
alternatives here:

r1 = db().select(table.ALL) # takes  6 seconds

r2 = db.executesql(db._select(table.ALL)) # takes ~0.1sec

The records returned in the first instance are much richer; they have 
record chasing (e.g. I can do r1[0].b.a to select through the foreign key), 
they have methods like r1[0].update_record() and r1[0].delete_record(), and 
other nice stuff.

However, for this use, I don't need the additional records, and I do need 
the speed, so I would rather use r2. However, r2 is not a direct 
replacement -- it doesn't have the column names. If I use

r3 = db.executesql(db._select(table.ALL), as_dict=True) # still takes 
~0.1sec

I can do r3[0]['b'] but I cannot do r3[0].b; and it takes a lot more memory 
than r2.

A suggestion: add another parameter, processor=... which, if available, 
will be called with the db.connection.cursor, returning a function, through 
which each routine will be passed; example

def named_tuple_process(name, description):
   from collections import namedtuple
   fields = ' '.join([x[0] for x in description])
   return namedtuple(name, fields)

r4 = db.executesql(db._select(table.ALL), process=lambda x: 
named_tuple_process('tablerec', x))

r4[0].b # will now work; not a full replacement, but good enough for many 
uses.

In fact, you can do that externally - 

r4 = db.executesql(db._select(table.ALL))
f = named_tuple_process('tablerec', db._adapter.cursor.description)
r4 = [f(x) for x in r4]

But this requires reaching into the internals of the db adapter.

Finally, I propose to define x.raw_select(*args) to do: 
db.executesql(x._select(*args))

which would make this a relatively clean replacement.


Re: [web2py] DAL speed - an idea

2012-02-09 Thread Willoughby
In my case, I need to pull all the records (60,000) from the database to 
compute some aggregation which I cannot compute using sql

Are you familiar with window functions in SQL?  I've never met an 
aggregation need that couldn't be met with clever use of windows...




Re: [web2py] DAL speed - an idea

2012-02-09 Thread Anthony
I've been thinking about something like this as well. Instead of a separate 
select_raw() method, maybe we can just add a raw=True|False argument to the 
existing select() method. I like the namedtuple idea as well (I think some 
adapters already provide that as an option -- e.g., psycopg2).

Anthony

On Thursday, February 9, 2012 3:04:41 PM UTC-5, nick name wrote:

 Yes, that is the basis of what I am suggesting.

 There is not currently such a thing; there is something called 
 'select_raw' implemented in the GoogleDataStore adapter, but not in 
 anything else, and it isn't exactly what I am proposing.

 To elaborate:

 Assume the table is defined as follows:

 reftable = db.define_table('reftable', Field('a', string))
 table = db.define_table('table', Field('b', reftable))

 In my case, I need to pull all the records (60,000) from the database to 
 compute some aggregation which I cannot compute using sql. There are two 
 alternatives here:

 r1 = db().select(table.ALL) # takes  6 seconds

 r2 = db.executesql(db._select(table.ALL)) # takes ~0.1sec

 The records returned in the first instance are much richer; they have 
 record chasing (e.g. I can do r1[0].b.a to select through the foreign key), 
 they have methods like r1[0].update_record() and r1[0].delete_record(), and 
 other nice stuff.

 However, for this use, I don't need the additional records, and I do need 
 the speed, so I would rather use r2. However, r2 is not a direct 
 replacement -- it doesn't have the column names. If I use

 r3 = db.executesql(db._select(table.ALL), as_dict=True) # still takes 
 ~0.1sec

 I can do r3[0]['b'] but I cannot do r3[0].b; and it takes a lot more 
 memory than r2.

 A suggestion: add another parameter, processor=... which, if available, 
 will be called with the db.connection.cursor, returning a function, through 
 which each routine will be passed; example

 def named_tuple_process(name, description):
from collections import namedtuple
fields = ' '.join([x[0] for x in description])
return namedtuple(name, fields)

 r4 = db.executesql(db._select(table.ALL), process=lambda x: 
 named_tuple_process('tablerec', x))

 r4[0].b # will now work; not a full replacement, but good enough for many 
 uses.

 In fact, you can do that externally - 

 r4 = db.executesql(db._select(table.ALL))
 f = named_tuple_process('tablerec', db._adapter.cursor.description)
 r4 = [f(x) for x in r4]

 But this requires reaching into the internals of the db adapter.

 Finally, I propose to define x.raw_select(*args) to do: 
 db.executesql(x._select(*args))

 which would make this a relatively clean replacement.