Re: queryset caching note in docs

2015-03-18 Thread Asif Saifuddin
assigned myself to the ticket https://code.djangoproject.com/ticket/16614

will try to give akarai's patch a try


-- 
You received this message because you are subscribed to the Google Groups 
"Django developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-developers+unsubscr...@googlegroups.com.
To post to this group, send email to django-developers@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-developers/2295fb5a-c8ac-4d1f-93e7-5a3c6415a831%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: queryset caching note in docs

2011-11-03 Thread Marco Paolini

On 04/11/2011 03:05, Anssi Kääriäinen wrote:

On Nov 4, 3:38 am, Marco Paolini  wrote:

Postgresql:
.chunked(): 26716.0kB
.iterator(): 46652.0kB


what if you use .chunked().iterator() ?


Quick test shows that the actual memory used by the queryset is around
1.2Mb. Using smaller fetch size than the default 2000 would result in
less memory used, but even this isn't bad at all. The test is:

 i = 0
 print memory()
 for obj in User.objects.all().order_by('pk')
[0:1].chunked():
 i += 1
 if i % 1000 == 0:
 print memory()

Each instance has a 1000 char wide text field.

Before the loop the usage is:
25584.0kB
In loop the memory use stays at:
26716.0kB

These are process total usages, and there is likely a lot of error in
the numbers. Anyways, the memory usage is most likely somewhere around
1-3MB.


tested your patch for memory usage, looks good
 
sqlite on a 1000 record queryset


76353536.0 .iterator()

74825728.0 .chunked()

sqlite on a 1 record qs:

99713024.0 .iterator()

79847424.0 .chunked()

~20% memory usage decrease

Marco

--
You received this message because you are subscribed to the Google Groups "Django 
developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: queryset caching note in docs

2011-11-03 Thread Anssi Kääriäinen
On Nov 4, 3:38 am, Marco Paolini  wrote:
> > Postgresql:
> > .chunked(): 26716.0kB
> > .iterator(): 46652.0kB
>
> what if you use .chunked().iterator() ?

Quick test shows that the actual memory used by the queryset is around
1.2Mb. Using smaller fetch size than the default 2000 would result in
less memory used, but even this isn't bad at all. The test is:

i = 0
print memory()
for obj in User.objects.all().order_by('pk')
[0:1].chunked():
i += 1
if i % 1000 == 0:
print memory()

Each instance has a 1000 char wide text field.

Before the loop the usage is:
25584.0kB
In loop the memory use stays at:
26716.0kB

These are process total usages, and there is likely a lot of error in
the numbers. Anyways, the memory usage is most likely somewhere around
1-3MB.

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: queryset caching note in docs

2011-11-03 Thread Anssi Kääriäinen
On Nov 4, 3:29 am, Marco Paolini  wrote:
> where/when do we close() cursors, or we rely on cursor __del__() 
> implementation?
I guess we should rely on it going away when it happens to go away
(that is, __del__ way).

>
> postgres named cursors can't be used in autocommit mode [1]

I don't know if there is any other way than documentation + checking
this in postgresql backend's chunked_cursor method. I think there is
actually a more severe limitation: the cursor is not usable outside
the transaction. Or you will need to create it as "with hold" cursor,
which means it can live even outside transactions, but then closing it
is really important. This according to psycopg2 documentation.

I really think this is documentation / sanity checking
stuff. .chunked() is there because it can have weird requirements /
backend specific behavior. Making it behave exactly the same way on
every backend, without transaction etc requirements will make this
feature really, really hard to implement. There is a tradeoff here
between what can be achieved easily, and per backend requirements.

>
> psycopg fetches 2000 rows at the time

That is luckily alterable, we can make it fetch 100 rows at a time if
we want. I would think that the 2000 rows has a good balance between
memory usage and network round trips. Maybe chunked could have a
chunk_size kwarg, but that can wait for later.

 - Anssi

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: queryset caching note in docs

2011-11-03 Thread Anssi Kääriäinen
On Nov 4, 3:38 am, Marco Paolini  wrote:
> what if you use .chunked().iterator() ?

You can't. .chunked() returns a generator. Note that the memory usage
is total memory usage for the process, not for the query. The memory
usage for the query is probably just a small part of the total memory
usage.

The .chunked will basically alter the qs.query to use a chunked_cursor
and avoid the fetching of all results when using sqlite3. Then it
calls .iterator().

 - Anssi

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: queryset caching note in docs

2011-11-03 Thread Marco Paolini

On 04/11/2011 01:50, Anssi Kääriäinen wrote:

On Nov 4, 1:20 am, Marco Paolini  wrote:

time to write some patches, now!


Here is a proof of concept for one way to achieve chunked reads when
using PyscoPG2. This lacks tests and documentation. I think the
approach is sane, though. It allows different database backends to be
able to decide how to do the chunked reads, and there is a qs method
to use when you want extreme caching avoidance - even if it means non-
safe behavior is sqlite3 / using named cursors in postgresql.

For SQLite, results for User.objects.all()[0:1]
.chunked(): 25852.0kB
.iterator(): 65508.0kB
Modifications will be seen by the iterator.

Postgresql:
.chunked(): 26716.0kB
.iterator(): 46652.0kB

what if you use .chunked().iterator() ?

Marco

--
You received this message because you are subscribed to the Google Groups "Django 
developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: queryset caching note in docs

2011-11-03 Thread Marco Paolini

On 04/11/2011 01:50, Anssi Kääriäinen wrote:

On Nov 4, 1:20 am, Marco Paolini  wrote:

time to write some patches, now!


Here is a proof of concept for one way to achieve chunked reads when
using PyscoPG2. This lacks tests and documentation. I think the
approach is sane, though. It allows different database backends to be
able to decide how to do the chunked reads, and there is a qs method
to use when you want extreme caching avoidance - even if it means non-
safe behavior is sqlite3 / using named cursors in postgresql.

For SQLite, results for User.objects.all()[0:1]
.chunked(): 25852.0kB
.iterator(): 65508.0kB
Modifications will be seen by the iterator.

Postgresql:
.chunked(): 26716.0kB
.iterator(): 46652.0kB

MySQL should not have any changes between chunked<->  iterator,
neither should Oracle.

I would write tests for this, but it seems a bit hard - how to test if
the backend fetched the objects in one go or not? You could test
memory usage, but that seems brittle, or you could test internals of
the backend, but that is ugly and brittle. Ideas?

The patch is at:
https://github.com/akaariai/django/commit/8990e20df50ce110fe6ddbbdfed7a98987bb5835

nice patch!

where/when do we close() cursors, or we rely on cursor __del__() implementation?

postgres named cursors can't be used in autocommit mode [1]

psycopg fetches 2000 rows at the time

[1] http://initd.org/psycopg/docs/usage.html#server-side-cursors

Marco

--
You received this message because you are subscribed to the Google Groups "Django 
developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: queryset caching note in docs

2011-11-03 Thread Anssi Kääriäinen
On Nov 4, 1:20 am, Marco Paolini  wrote:
> time to write some patches, now!

Here is a proof of concept for one way to achieve chunked reads when
using PyscoPG2. This lacks tests and documentation. I think the
approach is sane, though. It allows different database backends to be
able to decide how to do the chunked reads, and there is a qs method
to use when you want extreme caching avoidance - even if it means non-
safe behavior is sqlite3 / using named cursors in postgresql.

For SQLite, results for User.objects.all()[0:1]
.chunked(): 25852.0kB
.iterator(): 65508.0kB
Modifications will be seen by the iterator.

Postgresql:
.chunked(): 26716.0kB
.iterator(): 46652.0kB

MySQL should not have any changes between chunked <-> iterator,
neither should Oracle.

I would write tests for this, but it seems a bit hard - how to test if
the backend fetched the objects in one go or not? You could test
memory usage, but that seems brittle, or you could test internals of
the backend, but that is ugly and brittle. Ideas?

The patch is at:
https://github.com/akaariai/django/commit/8990e20df50ce110fe6ddbbdfed7a98987bb5835

 - Anssi

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: queryset caching note in docs

2011-11-03 Thread Marco Paolini

...

The SQLite3 shared cache mode seems to suffer from the same problem
than mysql:

"""
At any one time, a single table may have any number of active read-
locks or a single active write lock. To read data a table, a
connection must first obtain a read-lock. To write to a table, a
connection must obtain a write-lock on that table. If a required table
lock cannot be obtained, the query fails and SQLITE_LOCKED is returned
to the caller.
"""

So, if you have an open connection to a table in one cursor (which is
a separate connection), and you try to modify the table while that
connection is open by another cursor, it seems you will deadlock.

You could use the read uncommitted isolation level, but then you will
have the same problem you have now when using fetchmany() - updates
might be seen in the other cursor's objects. The SQLite transaction
isolation implementation is not MVCC. It is likely it simply does not
support multiple row versions, and this means you can't update a table
and expect to still get the old version from the server in another
cursor.

yeah, you're right about the locking issues we'd hit if using sqlite3 in
shared cache mode as described above.

so eventually a .chunked() queryset implemetation for sqlite3
has to live with the fact that cursors are not isolated and
if you update a table you can get "freshly" updated objects in other querysets

time to write some patches, now!

Marco

--
You received this message because you are subscribed to the Google Groups "Django 
developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: queryset caching note in docs

2011-11-03 Thread Anssi Kääriäinen


On Nov 3, 11:09 pm, Marco Paolini  wrote:
> > Now, calling the .iterator() directly is not safe on SQLite3. If you
> > do updates to objects not seen by the iterator yet, you will see those
> > changes. On MySQL, all the results are fetched into Python in one go,
> > and the only saving is from not populating the _results_cache. I guess
> > Oracle will just work with the .iterator() approach. Without
> > the .iterator() call, it will consume the memory, as ._results_cache
> > will be populated.
>
> it *is* safe to call .iterator with sqlite because our ORM fetches all
> rows into memory (see below and django/django/db/models/sql/compiler.py)

Doh, a little miss in my reading of the code, in the end of
compiler.py execute_sql is this little snippet:
"""
if not self.connection.features.can_use_chunked_reads:
# If we are using non-chunked reads, we return the same
data
# structure as normally, but ensure it is all read into
memory
# before going any further.
return list(result)
"""

That might explain something... :)

The test data, which shows a big memory saving when using fetchmany()
as opposed to fetchall in the compiler.py for PostgreSQL is probably
because even if the data is fetched in one go from the server, it is
transformed to Python objects in bulks by fetchmany(). I am not sure
of this. This confused me, sorry for that.

So, what you said before is correct:
  - mysql supports chunked fetch but will lock the table while
fetching is in progress (likely causing deadlocks)
  - postgresql does not seem to suffer this issue and chunked fetch
seems doable (not trivial) using named cursor
  - oracle does chunked fetch already (someone confirm this, please)
  - sqlite3 COULD do chunked fetch by using one connection per cursor
(otherwise cursors will not be isolated)

The SQLite3 shared cache mode seems to suffer from the same problem
than mysql:

"""
At any one time, a single table may have any number of active read-
locks or a single active write lock. To read data a table, a
connection must first obtain a read-lock. To write to a table, a
connection must obtain a write-lock on that table. If a required table
lock cannot be obtained, the query fails and SQLITE_LOCKED is returned
to the caller.
"""

So, if you have an open connection to a table in one cursor (which is
a separate connection), and you try to modify the table while that
connection is open by another cursor, it seems you will deadlock.

You could use the read uncommitted isolation level, but then you will
have the same problem you have now when using fetchmany() - updates
might be seen in the other cursor's objects. The SQLite transaction
isolation implementation is not MVCC. It is likely it simply does not
support multiple row versions, and this means you can't update a table
and expect to still get the old version from the server in another
cursor.

 - Anssi

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: queryset caching note in docs

2011-11-03 Thread Marco Paolini

Now, calling the .iterator() directly is not safe on SQLite3. If you
do updates to objects not seen by the iterator yet, you will see those
changes. On MySQL, all the results are fetched into Python in one go,
and the only saving is from not populating the _results_cache. I guess
Oracle will just work with the .iterator() approach. Without
the .iterator() call, it will consume the memory, as ._results_cache
will be populated.

it *is* safe to call .iterator with sqlite because our ORM fetches all
rows into memory (see below and django/django/db/models/sql/compiler.py)



I would like to introduce a new method (wrapping .iterator() and doing
sanity checking) for this. Allowing this to work on PostgreSQL would
be a big plus in my book, and I think Oracle would benefit from this,
too. SQLite3 should fetch all the data in one go (to get rid of update
effects on the results) and MySQL could use this, too, although it
would mean there is no large benefit for this. MySQL and SQLite3 would
benefit from not needing to create the objects, which could in some
cases be a bigger win than avoiding the cursor's row cache.

There would need to be sanity checks for at least prefetch_related (as
that will first need to fetch all the results, and then do the related
fetching). I don't know of other sanity checks needed.

This would of course need documentation detailing the cave-eats. In
addition, it would be good to document that if at all possible, try to
avoid the pattern:
for obj in Model.objects.all():
 obj.field = changed_value
 obj.save()

That is going to be really expensive if the result set is large,
with .iterator() or not. Basically you are going to do N selects and N
updates. Even if you use force_update, you will still be doing N
updates, which is expensive. Another matter is that Django could be a
lot wiser here, doing automatically the update if the object is
fetched from the DB, and not doing an update at all if there are no
changes to field values.

I think this is the developer's job, not framework's



If a new method for this is not introduced, the other choice is to
document the .iterator() trick directly, and let the user beware of
the possible problems for SQLite3 / prefetch_related. Then there is
still the possibility of keeping this as an undocumented trick.

Sorry but I really think you are on the wrong track, maybe because the thread is
bending further and further away from it's starting point...

.iterator() disables queryset  *model instance* caching, it's documented and 
works all right,
even in sqlite.

the problem lies one layer below in the sqlite3, psycopg2 and mysqlDB adapers

when our ORM calls fetchmany(100), both psycoopg and mysqlDB fetch *all* rows 
form db server in one go
at the very first fetchmany(100) call.
All next fetchmany(100) calls will *not* hit the db server, only the local 
backend cache.

sqlite3, instead, only fetches 100 rows (actually it feches 101, like the 
dalmatians, and keeps one for the next call)
 but our ORM knows that two cursors are not isolated and it keeps calling 
fetchmany(100) storing all
rows in a single list

net effect is: the very first time a queryset hits the database,
rows are all loaded into memory by our DB backends

oracle works as expected: fetchmany(100) only fetches 100 rows from the db 
server (well
actually may be some more depending on driver settings)

even if we use .iterator()

Marco

--
You received this message because you are subscribed to the Google Groups "Django 
developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: queryset caching note in docs

2011-11-03 Thread Anssi Kääriäinen


On Nov 3, 1:06 am, I wrote:
> I did a little testing. It seems you can get the behavior you want if you 
> just do this in PostgreSQL:
> for obj in Model.objects.all().iterator(): # Note the extra .iterator()
>     # handle object here.

> I would sure like a verification to this test, I am tired and this seems like 
> too easy of an fix. Or am I missing the problem?
Quoting myself:
"""
I did a little testing. It seems you can get the behavior you want if
you just do this in PostgreSQL:
for obj in Model.objects.all().iterator(): # Note the
extra .iterator()
# handle object here.



I would sure like a verification to this test, I am tired and this
seems like too easy of an fix. Or am I missing the problem?
"""

I did some more testing, and I think that PostgreSQL will do the right
thing if you use the .iterator() trick, even without named cursors. I
tested combinations of changing User.objects.all().iterator() <->
User.objects.all() and in compiler.py changing fetchmany(FETCH_SIZE) <-
> fetchall().

The testcase is 2 objects with a largish (1000 char) wide text
field fetched in a loop:
for obj in User.objects.all(): #  / all().iterator()
print memory()

The results for .all() without iterator() and standard .fetchmany():
68600.0kB -> 147300.0kB

The results for .all() with iterator() and standard .fetchmany():
64960.0kB -> 64960.0kB
Here the first print memory() is called after first fetch, so that is
why there is no memory difference at all.

The results for .all() without iterator() and .fetchall():
143176.0kB -> 149232.0kB
Here the strings take almost all of the space, and filling
the ._results_cache with objects will take the last 6000kB.

Last, the results for .all() with iterator() and .fetchall():
142912.0kB -> 142912.0kB

Now, calling the .iterator() directly is not safe on SQLite3. If you
do updates to objects not seen by the iterator yet, you will see those
changes. On MySQL, all the results are fetched into Python in one go,
and the only saving is from not populating the _results_cache. I guess
Oracle will just work with the .iterator() approach. Without
the .iterator() call, it will consume the memory, as ._results_cache
will be populated.

I would like to introduce a new method (wrapping .iterator() and doing
sanity checking) for this. Allowing this to work on PostgreSQL would
be a big plus in my book, and I think Oracle would benefit from this,
too. SQLite3 should fetch all the data in one go (to get rid of update
effects on the results) and MySQL could use this, too, although it
would mean there is no large benefit for this. MySQL and SQLite3 would
benefit from not needing to create the objects, which could in some
cases be a bigger win than avoiding the cursor's row cache.

There would need to be sanity checks for at least prefetch_related (as
that will first need to fetch all the results, and then do the related
fetching). I don't know of other sanity checks needed.

This would of course need documentation detailing the cave-eats. In
addition, it would be good to document that if at all possible, try to
avoid the pattern:
for obj in Model.objects.all():
obj.field = changed_value
obj.save()

That is going to be really expensive if the result set is large,
with .iterator() or not. Basically you are going to do N selects and N
updates. Even if you use force_update, you will still be doing N
updates, which is expensive. Another matter is that Django could be a
lot wiser here, doing automatically the update if the object is
fetched from the DB, and not doing an update at all if there are no
changes to field values.

If a new method for this is not introduced, the other choice is to
document the .iterator() trick directly, and let the user beware of
the possible problems for SQLite3 / prefetch_related. Then there is
still the possibility of keeping this as an undocumented trick.

 - Anssi

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: queryset caching note in docs

2011-11-03 Thread Tom Evans
On Thu, Nov 3, 2011 at 2:14 AM, Javier Guerra Giraldez
 wrote:
> this seems to be the case with MyISAM tables; on the InnoDB engine
> docs, it says that SELECT statements don't set any lock, since it
> reads from a snapshot of the table.
>
> on MyISAM, there are (clumsy) workarounds by forcing the use of
> scratch tables, explicitly copying to temporary tables, or buffering
> the output.
>
> """
> SELECT ... FROM is a consistent read, reading a snapshot of the
> database and setting no locks unless the transaction isolation level
> is set to SERIALIZABLE. For SERIALIZABLE level, the search sets shared
> next-key locks on the index records it encounters.
> """
> (http://dev.mysql.com/doc/refman/5.0/en/innodb-locks-set.html)
>
>

Still, in your use case you wanted to update the table whilst reading
a query row by row from the server. Updating the table would mean
committing the current transaction, which would not be possible as the
mysql server still hasn't finished sending you the results. You would
get a "Commands out of sync" error.

Cheers

Tom

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: queryset caching note in docs

2011-11-02 Thread Javier Guerra Giraldez
On Wed, Nov 2, 2011 at 11:33 AM, Tom Evans  wrote:
>> other connections in other transactions are locked too?
>
> Yes. The exact wording from the C API:
>
> """
> On the other hand, you shouldn't use mysql_use_result() if you are
> doing a lot of processing for each row on the client side, or if the
> output is sent to a screen on which the user may type a ^S (stop
> scroll). This ties up the server and prevent other threads from
> updating any tables from which the data is being fetched.
> """

this seems to be the case with MyISAM tables; on the InnoDB engine
docs, it says that SELECT statements don't set any lock, since it
reads from a snapshot of the table.

on MyISAM, there are (clumsy) workarounds by forcing the use of
scratch tables, explicitly copying to temporary tables, or buffering
the output.

"""
SELECT ... FROM is a consistent read, reading a snapshot of the
database and setting no locks unless the transaction isolation level
is set to SERIALIZABLE. For SERIALIZABLE level, the search sets shared
next-key locks on the index records it encounters.
"""
(http://dev.mysql.com/doc/refman/5.0/en/innodb-locks-set.html)


-- 
Javier

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



RE: queryset caching note in docs

2011-11-02 Thread Kääriäinen Anssi
"""
so, summarizing again:
  - mysql supports chunked fetch but will lock the table while fetching is in 
progress (likely causing deadlocks)
  - postgresql does not seem to suffer this issue and chunked fetch seems 
doable (not trivial) using named cursor
  - oracle does chunked fetch already (someone confirm this, please)
  - sqlite3 COULD do chunked fetch by using one connection per cursor 
(otherwise cursors will not be isolated)
"""

I did a little testing. It seems you can get the behavior you want if you just 
do this in PostgreSQL:
for obj in Model.objects.all().iterator(): # Note the extra .iterator()
# handle object here.

What is happening? Django correctly uses cursor.fetchmany(chunk_size) in 
models/sql/compiler.py. The chunk_size is hardcoded to 100. The problem is in 
db/models/query.py, and its __iter__ method. __iter__ will keep 
self._results_cache, and that is where the memory is consumed. Changing that is 
not wise, as in many cases you do want to keep the results around. The 
.iterator() call will skip the __iter__ and directly access the underlying 
generator.

You can also do objects.all()[0:10].iterator() and objects are correctly 
fetched without caching.

Here is a printout from my tests. The memory report is the total process memory 
use:

Code:
i = 0
for obj in User.objects.all()[0:10]:
i += 1
if i % 1000 == 0:
print memory()

25780.0kB
26304.0kB
26836.0kB
27380.0kB
27932.0kB
28468.0kB
29036.0kB
29580.0kB
29836.0kB
30388.0kB

And then:
i = 0
for obj in User.objects.all()[0:10].iterator():
i += 1
if i % 1000 == 0:
print memory()

25216.0kB
25216.0kB
25216.0kB
25216.0kB
25216.0kB
25216.0kB
25216.0kB
25216.0kB
25216.0kB
25216.0kB


This would be worth documenting, with maybe a better named method wrapping the 
.iterator(). I have no ideas for a better name, though.

I would sure like a verification to this test, I am tired and this seems like 
too easy of an fix. Or am I missing the problem?

 - Anssi

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: queryset caching note in docs

2011-11-02 Thread Ryan McIntosh
I think the discussion actually went a bit sideways.  Is there value in a model 
method to return an iterator which pulls results from a temporary table that 
gets filled from a model query?  This puts the onus on the django-user to use 
the correct method.

Model.foo().bar().buffered() or .from_tmp()

peace,

Ryan

- Original Message -
From: "Marco Paolini" <markopaol...@gmail.com>
To: django-developers@googlegroups.com
Sent: Wednesday, November 2, 2011 12:11:41 PM GMT -06:00 US/Canada Central
Subject: Re: queryset caching note in docs

On 02/11/2011 17:33, Tom Evans wrote:
> On Wed, Nov 2, 2011 at 4:22 PM, Marco Paolini<markopaol...@gmail.com>  wrote:
>> On 02/11/2011 17:12, Tom Evans wrote:
>>> If you do a database query that quickly returns a lot of rows from the
>>> database, and each row returned from the database requires long
>>> processing in django, and you use mysql_use_result, then other mysql
>>> threads are unable to update any table being used, where as if you do
>>> the same thing with mysql_store_result, the tables are unlocked as
>>> soon as the client has retrieved all the data from the server.
>>>
>> other connections in other transactions are locked too?
>
> Yes. The exact wording from the C API:
>
> """
> On the other hand, you shouldn't use mysql_use_result() if you are
> doing a lot of processing for each row on the client side, or if the
> output is sent to a screen on which the user may type a ^S (stop
> scroll). This ties up the server and prevent other threads from
> updating any tables from which the data is being fetched.
> """
>
> mysql treats the table as in use until the result is freed.
>
> If this behaviour was in place, then you wouldn't have even raised the
> original query - the approach you were using was to iterate through a
> result set and modify the table you are fetching from. With
> mysql_use_result, you would have deadlocked that table in the mysql
> server as soon as you tried to update it without first completing the
> first query.
Yeah, the discussion has drifted a bit from it's staring point

so, summarizing again:
  - mysql supports chunked fetch but will lock the table while fetching is in 
progress (likely causing deadlocks)
  - postgresql does not seem to suffer this issue and chunked fetch seems 
doable (not trivial) using named cursor
  - oracle does chunked fetch already (someone confirm this, please)
  - sqlite3 COULD do chunked fetch by using one connection per cursor 
(otherwise cursors will not be isolated)

Marco

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: queryset caching note in docs

2011-11-02 Thread Marco Paolini

On 02/11/2011 17:33, Tom Evans wrote:

On Wed, Nov 2, 2011 at 4:22 PM, Marco Paolini  wrote:

On 02/11/2011 17:12, Tom Evans wrote:

If you do a database query that quickly returns a lot of rows from the
database, and each row returned from the database requires long
processing in django, and you use mysql_use_result, then other mysql
threads are unable to update any table being used, where as if you do
the same thing with mysql_store_result, the tables are unlocked as
soon as the client has retrieved all the data from the server.


other connections in other transactions are locked too?


Yes. The exact wording from the C API:

"""
On the other hand, you shouldn't use mysql_use_result() if you are
doing a lot of processing for each row on the client side, or if the
output is sent to a screen on which the user may type a ^S (stop
scroll). This ties up the server and prevent other threads from
updating any tables from which the data is being fetched.
"""

mysql treats the table as in use until the result is freed.

If this behaviour was in place, then you wouldn't have even raised the
original query - the approach you were using was to iterate through a
result set and modify the table you are fetching from. With
mysql_use_result, you would have deadlocked that table in the mysql
server as soon as you tried to update it without first completing the
first query.

Yeah, the discussion has drifted a bit from it's staring point

so, summarizing again:
 - mysql supports chunked fetch but will lock the table while fetching is in 
progress (likely causing deadlocks)
 - postgresql does not seem to suffer this issue and chunked fetch seems doable 
(not trivial) using named cursor
 - oracle does chunked fetch already (someone confirm this, please)
 - sqlite3 COULD do chunked fetch by using one connection per cursor (otherwise 
cursors will not be isolated)

Marco

--
You received this message because you are subscribed to the Google Groups "Django 
developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: queryset caching note in docs

2011-11-02 Thread Tom Evans
On Wed, Nov 2, 2011 at 4:22 PM, Marco Paolini  wrote:
> On 02/11/2011 17:12, Tom Evans wrote:
>> If you do a database query that quickly returns a lot of rows from the
>> database, and each row returned from the database requires long
>> processing in django, and you use mysql_use_result, then other mysql
>> threads are unable to update any table being used, where as if you do
>> the same thing with mysql_store_result, the tables are unlocked as
>> soon as the client has retrieved all the data from the server.
>>
> other connections in other transactions are locked too?

Yes. The exact wording from the C API:

"""
On the other hand, you shouldn't use mysql_use_result() if you are
doing a lot of processing for each row on the client side, or if the
output is sent to a screen on which the user may type a ^S (stop
scroll). This ties up the server and prevent other threads from
updating any tables from which the data is being fetched.
"""

mysql treats the table as in use until the result is freed.

If this behaviour was in place, then you wouldn't have even raised the
original query - the approach you were using was to iterate through a
result set and modify the table you are fetching from. With
mysql_use_result, you would have deadlocked that table in the mysql
server as soon as you tried to update it without first completing the
first query.

Cheers

Tom

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: queryset caching note in docs

2011-11-02 Thread Marco Paolini

On 02/11/2011 17:12, Tom Evans wrote:

On Wed, Nov 2, 2011 at 11:28 AM, Marco Paolini  wrote:

mysql can do chunked row fetching from server, but only one row at a time

curs = connection.cursor(CursorUseResultMixIn)
curs.fetchmany(100) # fetches 100 rows, one by one

Marco



The downsides to mysql_use_result over mysql_store_result are that the
mysql thread is locked and unavailable to do anything until the query
is completed and mysql_free_result has been called.

If you do a database query that quickly returns a lot of rows from the
database, and each row returned from the database requires long
processing in django, and you use mysql_use_result, then other mysql
threads are unable to update any table being used, where as if you do
the same thing with mysql_store_result, the tables are unlocked as
soon as the client has retrieved all the data from the server.


other connections in other transactions are locked too?

In other words, you trade off memory usage against scalability. If
some part of the ORM was reworked to use mysql_use_result, then we
would need to add appropriate docs to explain the dangers of this
approach.

yes indeed.

Scalability is also affected by python thread memory consumption,
it all depends on how big is the queryset being fetched and how often
that queryset is iterated

if you fetch a huge queryset in one chunk and python eats up say 1G
of heap, that's not going to scale well either.

caveats should be clearly documented for both approaches, I think

Marco

--
You received this message because you are subscribed to the Google Groups "Django 
developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: queryset caching note in docs

2011-11-02 Thread Tom Evans
On Wed, Nov 2, 2011 at 11:28 AM, Marco Paolini  wrote:
> mysql can do chunked row fetching from server, but only one row at a time
>
> curs = connection.cursor(CursorUseResultMixIn)
> curs.fetchmany(100) # fetches 100 rows, one by one
>
> Marco
>

The downsides to mysql_use_result over mysql_store_result are that the
mysql thread is locked and unavailable to do anything until the query
is completed and mysql_free_result has been called.

If you do a database query that quickly returns a lot of rows from the
database, and each row returned from the database requires long
processing in django, and you use mysql_use_result, then other mysql
threads are unable to update any table being used, where as if you do
the same thing with mysql_store_result, the tables are unlocked as
soon as the client has retrieved all the data from the server.

In other words, you trade off memory usage against scalability. If
some part of the ORM was reworked to use mysql_use_result, then we
would need to add appropriate docs to explain the dangers of this
approach.

Cheers

Tom

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: queryset caching note in docs

2011-11-02 Thread Marco Paolini

On 02/11/2011 15:18, Anssi Kääriäinen wrote:

On 11/02/2011 01:36 PM, Marco Paolini wrote:

maybe we could implement something like:

for obj in qs.all().chunked(100):
pass

.chunked() will automatically issue LIMITed SELECTs

that should work with all backends

I don't think that will be a performance improvement - this will get rid of the 
memory overhead in Django, but would lead to a lot of overhead in the DB. 
Assuming you are fetching 1 objects from the DB, you would issue these 
commands to the DB (I hope I got the idea correctly):

SELECT ID, ...
ORDER BY order
LIMIT 0 OFFSET 100

SELECT ID, ...
ORDER BY order
LIMIT 100 OFFSET 100
...
SELECT ID, ...
ORDER BY order
LIMIT 100 OFFSET 9900


For each query the DB will need to do:
- query parse & plan
- If the order is not indexed, a top N sort.
- Fetch the items (even in indexed case you will need to travel the index for 
the OFFSET which is not free at all).

So, for the last fetch the DB would need to travel the first 9900 items in the 
index (or worse, do a top 1 sort) and then return the 100 items wanted. 
This is going to be expensive in the DB. The trade-off of saving some memory 
Django side at the expense of doing a lot more work at the DB side is not a 
good one. DB resources are in general much more harder to scale than the Django 
resources.

You are going to do in total 1 + 9900 + 9800 + ... + 100 index travels in 
the DB, which equals to somewhere around 0.5 million items traveled in the 
index. In addition, you will do 100 parse + plan stages. You really don't want 
to do that. In addition if there are concurrent updates to the items, it might 
be you will miss some objects and see some objects as duplicates.


Yes, that's right,

qs.chunked() can be easily implemented this way:

i = 0
while True:
 for chunk in qs[i:i+CHUNK_SIZE]:
   pass
 if not chunk:
  break
 i += CHUNK_SIZE

we should find another way for avoiding this memory-hungriness issue for huge 
querysets

or at least we should document the issue ;)

currently oracle is the only backend that DOES chunked row fetch, all others, 
for different reasons,
load all rows in memory

Marco

--
You received this message because you are subscribed to the Google Groups "Django 
developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: queryset caching note in docs

2011-11-02 Thread Marco Paolini

On 02/11/2011 14:36, Ian Kelly wrote:

On Wed, Nov 2, 2011 at 5:05 AM, Anssi Kääriäinen
  wrote:

For PostgreSQL this would be a nice feature. Any idea what MySQL and Oracle
do currently?


If I'm following the thread correctly, the oracle backend already does
chunked reads.  The default chunk size is 100 rows, IIRC.


yes, in Oracle it looks like rows are ALWAYS [1] fetched from the server in 
chunks when using fetchmany()

[1] http://cx-oracle.sourceforge.net/html/cursor.html#Cursor.arraysize

summarizing (real) chunked fetch capability by backend:

 currently | supported | how to implement
sqlite   N | Y (1) | using one connection per cursor + shared cache 
mode
postgres N | Y (2) | using named cursors
mysqlN | Y (3) | using custom cursor class
oracle   Y | Y | default behavior

(1) in sqlite cursors are not isolated, so we can't effectively use chunked 
fetchmany
(2) postgres supports chunks when using named cursors
(3) mysql supports only single-row chunks

--
You received this message because you are subscribed to the Google Groups "Django 
developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: queryset caching note in docs

2011-11-02 Thread Anssi Kääriäinen

On 11/02/2011 01:36 PM, Marco Paolini wrote:

maybe we could implement something like:

for obj in qs.all().chunked(100):
   pass

.chunked() will automatically issue LIMITed SELECTs

that should work with all backends
I don't think that will be a performance improvement - this will get rid 
of the memory overhead in Django, but would lead to a lot of overhead in 
the DB. Assuming you are fetching 1 objects from the DB, you would 
issue these commands to the DB (I hope I got the idea correctly):


SELECT ID, ...
ORDER BY order
LIMIT 0 OFFSET 100

SELECT ID, ...
ORDER BY order
LIMIT 100 OFFSET 100
...
SELECT ID, ...
ORDER BY order
LIMIT 100 OFFSET 9900


For each query the DB will need to do:
  - query parse & plan
  - If the order is not indexed, a top N sort.
  - Fetch the items (even in indexed case you will need to travel the 
index for the OFFSET which is not free at all).


So, for the last fetch the DB would need to travel the first 9900 items 
in the index (or worse, do a top 1 sort) and then return the 100 
items wanted. This is going to be expensive in the DB. The trade-off of 
saving some memory Django side at the expense of doing a lot more work 
at the DB side is not a good one. DB resources are in general much more 
harder to scale than the Django resources.


You are going to do in total 1 + 9900 + 9800 + ... + 100 index 
travels in the DB, which equals to somewhere around 0.5 million items 
traveled in the index. In addition, you will do 100 parse + plan stages. 
You really don't want to do that. In addition if there are concurrent 
updates to the items, it might be you will miss some objects and see 
some objects as duplicates.


 - Anssi

--
You received this message because you are subscribed to the Google Groups "Django 
developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: queryset caching note in docs

2011-11-02 Thread Ian Kelly
On Wed, Nov 2, 2011 at 5:05 AM, Anssi Kääriäinen
 wrote:
> For PostgreSQL this would be a nice feature. Any idea what MySQL and Oracle
> do currently?

If I'm following the thread correctly, the oracle backend already does
chunked reads.  The default chunk size is 100 rows, IIRC.

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: queryset caching note in docs

2011-11-02 Thread Marco Paolini

On 02/11/2011 12:05, Anssi Kääriäinen wrote:

On 11/02/2011 12:47 PM, Marco Paolini wrote:

if that option is true, sqlite shoud open one connection per cursor
and psycopg2 should use named cursors


The sqlite behavior leads to some problems with transaction management -
different connections, different transactions (or is there some sort of
"shared
transaction" in sqlite?). I would just fetch all the data in one go when
Yes there is a "shared cache mode" that makes all connections share the 
same transaction


but this API is not exposed to python by sqlite3

--
You received this message because you are subscribed to the Google Groups "Django 
developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: queryset caching note in docs

2011-11-02 Thread Marco Paolini

On 02/11/2011 12:05, Anssi Kääriäinen wrote:

On 11/02/2011 12:47 PM, Marco Paolini wrote:

if that option is true, sqlite shoud open one connection per cursor
and psycopg2 should use named cursors


The sqlite behavior leads to some problems with transaction management -
different connections, different transactions (or is there some sort of "shared
transaction" in sqlite?). I would just fetch all the data in one go when using
sqlite. I wouldn't worry about performance problems when using sqlite, it is
meant mostly for testing when using Django.


This will cause some overhead for small querysets but will save some memory
for huge ones

For PostgreSQL this would be a nice feature. Any idea what MySQL and Oracle
do currently?


maybe we could implement something like:

for obj in qs.all().chunked(100):
 pass

.chunked() will automatically issue LIMITed SELECTs

that should work with all backends

this could be a no-op for sqlite, where
cursor are not isolated and you have bad bad performance for LIMITed SELECTs

Marco

--
You received this message because you are subscribed to the Google Groups "Django 
developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: queryset caching note in docs

2011-11-02 Thread Marco Paolini

On 02/11/2011 12:05, Anssi Kääriäinen wrote:

On 11/02/2011 12:47 PM, Marco Paolini wrote:

if that option is true, sqlite shoud open one connection per cursor
and psycopg2 should use named cursors


The sqlite behavior leads to some problems with transaction management -
different connections, different transactions (or is there some sort of "shared
transaction" in sqlite?). I would just fetch all the data in one go when using
sqlite. I wouldn't worry about performance problems when using sqlite, it is
meant mostly for testing when using Django.


This will cause some overhead for small querysets but will save some memory
for huge ones

For PostgreSQL this would be a nice feature. Any idea what MySQL and Oracle
do currently?


mysql can do chunked row fetching from server, but only one row at a time

curs = connection.cursor(CursorUseResultMixIn)
curs.fetchmany(100) # fetches 100 rows, one by one

Marco

--
You received this message because you are subscribed to the Google Groups "Django 
developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: queryset caching note in docs

2011-11-02 Thread Anssi Kääriäinen

On 11/02/2011 12:47 PM, Marco Paolini wrote:

if that option is true, sqlite shoud open one connection per cursor
and psycopg2 should use named cursors


The sqlite behavior leads to some problems with transaction management -
different connections, different transactions (or is there some sort of 
"shared
transaction" in sqlite?). I would just fetch all the data in one go when 
using

sqlite. I wouldn't worry about performance problems when using sqlite, it is
meant mostly for testing when using Django.


This will cause some overhead for small querysets but will save some memory
for huge ones

For PostgreSQL this would be a nice feature. Any idea what MySQL and Oracle
do currently?

 - Anssi

--
You received this message because you are subscribed to the Google Groups "Django 
developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: queryset caching note in docs

2011-11-02 Thread Marco Paolini

On 02/11/2011 10:10, Luke Plant wrote:

On 02/11/11 08:48, Marco Paolini wrote:


thanks for pointing that to me, do you see this as an issue to be fixed?

If there is some interest, I might give it a try.

Maybe it's not fixable, at least I can investigate a bit


Apparently, the protocol between the Postgres client and server only
does partial sends when using named cursors, which Django doesn't use.
Using named cursors with psycopg2 in certainly possible, but probably
not trivial. That's as much as I know.

Source:

http://thebuild.com/blog/2011/07/26/unbreaking-your-django-application/

The author of that page did say that he was working on a patch.

ok, I'll try to contact the author

talking about sqlite3, it looks like the only way to isolate two cursors
is to use two different connections.

Let's imagine there is a way to implement this (I'm not sure at this point)

We could have an option somewhere that tells django to use or not use
chunked cursor read from db.

if that option is true, sqlite shoud open one connection per cursor
and psycopg2 should use named cursors

This will cause some overhead for small querysets but will save some memory
for huge ones

cheers,

Marco

--
You received this message because you are subscribed to the Google Groups "Django 
developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: queryset caching note in docs

2011-11-02 Thread Luke Plant
On 02/11/11 08:48, Marco Paolini wrote:

> thanks for pointing that to me, do you see this as an issue to be fixed?
> 
> If there is some interest, I might give it a try.
> 
> Maybe it's not fixable, at least I can investigate a bit

Apparently, the protocol between the Postgres client and server only
does partial sends when using named cursors, which Django doesn't use.
Using named cursors with psycopg2 in certainly possible, but probably
not trivial. That's as much as I know.

Source:

http://thebuild.com/blog/2011/07/26/unbreaking-your-django-application/

The author of that page did say that he was working on a patch.

Luke

-- 
"The number you have dialled is imaginary.  Please rotate your
telephone by 90 degrees and try again."

Luke Plant || http://lukeplant.me.uk/

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: queryset caching note in docs

2011-11-02 Thread Marco Paolini

On 02/11/2011 09:43, Luke Plant wrote:

On 02/11/11 00:41, Marco Paolini wrote:


so if you do this:

for obj in Entry.objects.all():
  pass

django does this:
  - creates a cursor
  - then calls fetchmany(100) until ALL rows are fetched
  - creates a list containing ALL fetched rows
  - passes this list to queryset instance for lazy model instance creation

I didn't know that. (maybe we should document it somewhere...)

Now that I do, I also know it's time to move to postgresql...


And you will then find that the behaviour of the psycopg2 adapter means
that you get very similar behaviour - all rows are fetched as soon as
you start iterating - even if you do .iterator().

thanks for pointing that to me, do you see this as an issue to be fixed?

If there is some interest, I might give it a try.

Maybe it's not fixable, at least I can investigate a bit

Cheers,

Marco

--
You received this message because you are subscribed to the Google Groups "Django 
developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: queryset caching note in docs

2011-11-02 Thread Luke Plant
On 02/11/11 00:41, Marco Paolini wrote:

> so if you do this:
> 
> for obj in Entry.objects.all():
>  pass
> 
> django does this:
>  - creates a cursor
>  - then calls fetchmany(100) until ALL rows are fetched
>  - creates a list containing ALL fetched rows
>  - passes this list to queryset instance for lazy model instance creation
> 
> I didn't know that. (maybe we should document it somewhere...)
> 
> Now that I do, I also know it's time to move to postgresql...

And you will then find that the behaviour of the psycopg2 adapter means
that you get very similar behaviour - all rows are fetched as soon as
you start iterating - even if you do .iterator().

Luke

-- 
"The number you have dialled is imaginary.  Please rotate your
telephone by 90 degrees and try again."

Luke Plant || http://lukeplant.me.uk/

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: queryset caching note in docs

2011-11-01 Thread Marco Paolini

On 28/10/2011 15:55, Tom Evans wrote:

On Fri, Oct 28, 2011 at 1:05 PM, Marco Paolini  wrote:

it's a bit more complex: there are basically two phases:
  1) row fetching from db using cursor.fetchmany()
  2) model instance creation in queryset

both are delayed as much as possible (queryset lazyness)

phase two (model instance creation) depends on how you evaluate the
queryset:
  - testing for truth will only create enough instances to fill the cache
  - calling .exists() will try to fetch a single object and leave the cache
alone
  - iterating will create instances in batches
  - .iterator() will create the instances one at the time (but on the same db
cursos using cursor.fetchmany()
  - many other options...



You are probably more informed about the internals than me, but isn't
it irrelevant about when the instances are created; they are created
from the same query, even if the instances are not instantiated until
iterated to. Any changes to the database that happened after that
query is executed will not be reflected in the instances that are
created from that query.

IE the critical point is when the query is executed, not when the data
is fetched from the db cursor.

Yep, you're right

while digging into django and sqlite3 backend internals I discovered:

sqlite cursors from the same connection are NOT isoltated:


curs1 = conn.cursor()
curs2 = conn.cursor()
curs1.execute('select * from entry')
curs1.fetchmany(100)

[(1, 'hey'), (2, 'hey',)...

curs2.execute('update entry set name =\'\'')
curs1.fetchmany(100)

[(100, ''), (101, '',)...

to overcome this issue django fetches all cursor rows in memory in a single 
list while iterating
a resultset (see django/db/models/sql/compiler.py SqlCompiler.execute_sql)

so if you do this:

for obj in Entry.objects.all():
 pass

django does this:
 - creates a cursor
 - then calls fetchmany(100) until ALL rows are fetched
 - creates a list containing ALL fetched rows
 - passes this list to queryset instance for lazy model instance creation

I didn't know that. (maybe we should document it somewhere...)

Now that I do, I also know it's time to move to postgresql...

cheers,

Marco

--
You received this message because you are subscribed to the Google Groups "Django 
developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: queryset caching note in docs

2011-10-28 Thread Tom Evans
On Fri, Oct 28, 2011 at 1:05 PM, Marco Paolini  wrote:
> it's a bit more complex: there are basically two phases:
>  1) row fetching from db using cursor.fetchmany()
>  2) model instance creation in queryset
>
> both are delayed as much as possible (queryset lazyness)
>
> phase two (model instance creation) depends on how you evaluate the
> queryset:
>  - testing for truth will only create enough instances to fill the cache
>  - calling .exists() will try to fetch a single object and leave the cache
> alone
>  - iterating will create instances in batches
>  - .iterator() will create the instances one at the time (but on the same db
> cursos using cursor.fetchmany()
>  - many other options...
>

You are probably more informed about the internals than me, but isn't
it irrelevant about when the instances are created; they are created
from the same query, even if the instances are not instantiated until
iterated to. Any changes to the database that happened after that
query is executed will not be reflected in the instances that are
created from that query.

IE the critical point is when the query is executed, not when the data
is fetched from the db cursor.

Cheers

Tom

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.