Seems one of my roles in life is to find all the oddities in Python.
Hints from truly experienced Pythonistas welcome.
I am finding that after I do
fh.seek(pos)
buf = fh.read(pagesz)
match = regexobj.search(buf)
the next fh.seek() will always be to _at least_ the end of the match.
I can no longer fh.seek(pos) or fh.seek(pos+1) -- the call succeeds
but the next read() _always_ starts at the end of the last match. But!
We never seek to that particular point.
Is this expected? Known? Normal when reading the docs under the
influence of powerful drugs?
Sample script is attached for the truly curious - try it on a large
CJSON file, providing a separate file with dict keys to search for.
Once you get past the first 'page', and the code has to re-seek after
it found the match, you'll see the problem.
I suspect it's related to Python possibly having the file mmap()ed
behind the scenes, without telling me.
(What I am writing is actually a grep over very large files for cases
where python's mmap is not available. Say for instance our wonky
initrd).
cheers,
m
--
martin.langh...@gmail.com
mar...@laptop.org -- School Server Architect
- ask interesting questions
- don't get distracted with shiny stuff - working code first
- http://wiki.laptop.org/go/User:Martinlanghoff
#!/usr/bin/python
import re
def grep_for_lease_mmap(fpath, sn):
Search a potentially larger-than-mem cjson file for
something that looks like a lease or a series of leases.
Uses mmap.
returns a string or False
import mmap
fh = open(fpath, 'r+')
m = mmap.mmap(fh.fileno(), 0)
# find the start of it
rx = re.compile(''+sn+':')
objkey = rx.search(m)
if objkey:
# find the tail - the first non-escaped
# doublequotes. This relies on sigs not
# having escape chars themselves.
# TODO: Negative look-behind assertion to handle
# escaped values.
rx = re.compile('')
objend = rx.search(m, objkey.end())
if objkey and objend:
found = m[objkey.end():objend.start()-1]
else:
found = False
m.close()
fh.close()
return found
def grep_for_lease(fpath, sn):
Search a potentially larger-than-mem cjson file for
something that looks like a lease or a series of leases.
Uses old read()s
returns a string or False
# Use read()s, but keep stuff aligned to 4KB pages
# so we stand a chance to hit the fast paths.
page = 4096 #* 1024
step = 0
cursor = 0
needlerx = re.compile(''+sn+':')
needlelength = len(sn) + 2
fh = open(fpath, 'r+')
buf = ''
buftail = ''
while True:
buf = fh.read(page)
if (buf == ''): # EOF
break
buf = buftail + buf
objkey = needlerx.search(buf)
if objkey:
# found the needle - issue a read
# from here and break
# -- we rewind 1 char so the rx includes
# -- the opening single-quote
fh.seek( page * step + objkey.start()-1 )
buf = fh.read(page)
# re-search for objkey - to get the offsets right
objkey = needlerx.search(buf)
break
# prep for next read - keep tail
# in case needle is on the boundary
buftail = buf[-needlelength:]
step = step+1
fh.seek( page * step )
print [ Seek to %s ] % page * step
if objkey:
# find the tail - the first non-escaped
# doublequotes. This relies on sigs not
# having escape chars themselves.
# TODO: Negative look-behind assertion to handle
# escaped values.
rx = re.compile('')
objend = rx.search(buf, objkey.end())
if objkey and objend:
found = buf[objkey.end():objend.start()]
else:
found = False
fh.close()
return found
import sys
fh = file(sys.argv[1])
bigdata = {}
lines = fh.readlines()
for k in lines:
k = k.strip()
print Looking for %s % k
found = grep_for_lease(sys.argv[2], k)
if found:
if found == k.swapcase():
print ... found good match
else:
print BAD MATCH %s % found
else:
print NO MATCH
#found = grep_for_lease('/media/soas/big.json', 'CSN7470319B')
#
#if found:
#print Found: + found
#else:
#print 'not found'
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel