This probably merits the Zope Dev list too, since this is part of "core zope"...

cheers,

Chris

FAIRS, Dan, GBM wrote:
Hi,

Under load, we're seeing KeyErrors from the _cached_result method of the DA
class in the code used to clear out cached results. We're using Python
2.3.5, Zope 2.8.5, mxODBCZopeDA-1.0.8 running on RHEL3 connecting to a
remote Microsoft SQL Server 2000 database.

Note this is a different issue to Collector #1745.

Here's the salient part of the traceback:

  File "/opt/intranet/cbfmnet/zope_home/Products/Membership/Member.py", line
118, in _getDBValue
    values = self._getDBValues(attributes=[attribute])
  File "/opt/intranet/cbfmnet/zope_home/Products/Membership/Member.py", line
96, in _getDBValues
    , userid=login_elements[1]
  File "/opt/intranet/cbfmnet/zope/lib/python/Shared/DC/ZRDB/DA.py", line
453, in __call__
    result=self._cached_result(DB__, (query, self.max_rows_))
  File "/opt/intranet/cbfmnet/zope/lib/python/Shared/DC/ZRDB/DA.py", line
375, in _cached_result
    if int(cache[q][0]) == key:
KeyError: ("execute dbo.EmployeeByNT 'DOMAIN', 'user'\n", 1000, '\nDBConnId:
None')

The ZSQL Method EmployeeByNT looks like this (it's actually an FSZSQL method
in a skin):

<dtml-comment>
connection_id: DB_Connection
arguments:domain userid
cache_time: 60
max_cache: 50000
</dtml-comment>
execute dbo.EmployeeByNT <dtml-sqlvar domain type="string">, <dtml-sqlvar
userid type="string">

We do have other ZSQL methods, but this one is by far the most heavily used.


Here's (most of) the offending method, _cached_result (I've snipped some
code which just appends the db connection id to the query. Note also that
Bucket is also defined as lambda: {} at the top of DA.py):

        # Try to fetch from cache
        if hasattr(self,'_v_cache'): cache=self._v_cache
        else: cache=self._v_cache={}, Bucket()
        cache, tcache = cache
        max_cache=self.max_cache_
        now=time()
        t=now-self.cache_time_
        if len(cache) > max_cache / 2:
            keys=tcache.keys()
            keys.reverse()
            while keys and (len(keys) > max_cache or keys[-1] < t):
                key=keys[-1]
                q=tcache[key]
                del tcache[key]
                if int(cache[q][0]) == key:
                    del cache[q]
                del keys[-1]

        if cache.has_key(query):
            k, r = cache[query]
            if k > t: return r

        # call the pure query
        result=apply(DB__.query, pure_query)
        if self.cache_time_ > 0:
            tcache[int(now)]=query
            cache[query]= now, result

        return result

In this code, 'tcache' is a dict which maps integers representing the time
to queries (the result of calling int(t) where t is the floating-point
result from time.time()) , and 'cache' maps queries to a 2-tuple consisting
of the floating-point timestamp of when the method was called, and the
result of the query.

I've been trying to figure out a circumstance which could cause the KeyError
we're seeing in production, without success.
Here are some observations which may give some clues:

The result of time.time() is converted to an integer and used as a key into
tcache. AFAICS, this means that only 1 entry per second will be stored in
tcache, even though every query will be stored in cache. It also looks like
cache could be the source of a memory leak.

For example:
- 'SELECT a FROM b' and 'SELECT c FROM b' run in the same 1-second window - 'SELECT c FROM b' will end up stored in tcache. - Both queries end up in cache. - 'SELECT a FROM b' will only ever be removed from cache if 'SELECT a FROM
b' is executed once more later that second - and then 'SELECT c FROM b' will
remain.

We only see the KeyError after 'some time' in the application. Presumably
this is the time it takes for the caches to fill and start actively removing
entries as per the while loop. (We restarted the cluster last night, and
have yet to see the errors return).

The SQL query we are running maps an NT credential (domain and userid) to an
internal user id from our corporate database. To give an idea of the usage
pattern of this ZSQL method, it gets run at least once per request, with
application serving approximately 150,000 - 180,000 requests per day, with
12,000 - 14,000 distinct users (and hence different arguments to the ZSQL
method).

I'm going to put some logging code around the line that causes the KeyError
next time we do a release. However - has anyone seen this before? Can anyone
see what might be going wrong here?

Cheers,
Dan

--
Simplistix - Content Management, Zope & Python Consulting
           - http://www.simplistix.co.uk

_______________________________________________
Zope-Dev maillist  -  Zope-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - http://mail.zope.org/mailman/listinfo/zope-announce
http://mail.zope.org/mailman/listinfo/zope )

Reply via email to