Re: search speed
On Jan 30, 3:49 am, Diez B. Roggisch de...@nospam.web.de wrote: alex23 gave you a set of tools that you can use for full-text-search. However, that's not necessarily the best thing to do if things have a record-like structure. In Nucular (and others I think) you can do searches for terms anywhere (full text) searches for terms within fields, searches for prefixes in fields, searches based on field inequality, or searches for field exact value. I would argue this subsumes the standard fielded approach. -- Aaron Watters === Oh, I'm a lumberjack and I'm O.K... -- http://mail.python.org/mailman/listinfo/python-list
Re: search speed
Tanks everyone that spent time helping my, the help was great. Best regards Anders -- http://mail.python.org/mailman/listinfo/python-list
Re: search speed
2009/1/30 Scott David Daniels scott.dani...@acm.org: Be careful with your assertion that a regex is faster, it is certainly not always true. I was careful *not* to assert that a regex would be faster, merely that it was *likely* to be in this case. -- Tim Rowe -- http://mail.python.org/mailman/listinfo/python-list
Re: search speed
anders schrieb: Hi! I have written a Python program that serach for specifik customer in files (around 1000 files) the trigger is LF01 + CUSTOMERNO So a read all fils with dirchached Then a loop thru all files each files is read with readLines() and after that scaned Today this works fine, it saves me a lot of manuall work, but a seach takes around 5 min, so my questin is is there another way of search in a file (Today i step line for line and check) What i like to find is just filenames for files with the customerdata in, there can and often is more than one, English is not my first language and i hope someone understand my beginner question what i am looking for is somting like if file.findInFile(LF01): ... Is there any library like this ?? No. Because nobody can automagically infer whatever structure your files have. alex23 gave you a set of tools that you can use for full-text-search. However, that's not necessarily the best thing to do if things have a record-like structure. The canonical answer to this is then to use a database to hold the data, instead of flat files. So if you have any chance to do that, you should try stuff things in there. Diez -- http://mail.python.org/mailman/listinfo/python-list
Re: search speed
On Fri, Jan 30, 2009 at 1:51 AM, anders anders.u.pers...@gmail.com wrote: Hi! I have written a Python program that serach for specifik customer in files (around 1000 files) the trigger is LF01 + CUSTOMERNO So a read all fils with dirchached Then a loop thru all files each files is read with readLines() and after that scaned Today this works fine, it saves me a lot of manuall work, but a seach takes around 5 min, so my questin is is there another way of search in a file (Today i step line for line and check) Do you require this information in a python application, seems like you did this manually before? If not then python is the wrong tool for this job, you can simply use this command on a unix-like environment (install cygwin, if you are on windows) $ find path_to_dirs_containing_files -name * -exec grep -nH LF01 {} \; | cut -d : -f 1 | sort | uniq Now if you do require this information inside a python app, I would just do the above in python filenames = [] searchCmd = find path_to_dirs_containing_files -name \*\ -exec grep -nH \LF01\ {} \; | cut -d \:\ -f 1 | sort | uniq searchp = Popen(searchCmd, shell=True, bufsize=4096, stdout=PIPE) for line in searchp.stdout: filenames.append(line.strip()) Thats my advise anyway, guess you can try some search libraries don't know of any mysql tho, the above will probably be faster than anything else. Cheers and good luck. -- http://mail.python.org/mailman/listinfo/python-list
Re: search speed
On Fri, 30 Jan 2009 15:46:33 +0200 Justin Wyer justinw...@gmail.com wrote: $ find path_to_dirs_containing_files -name * -exec grep -nH LF01 {} \; | cut -d : -f 1 | sort | uniq I know this isn't a Unix group but please allow me to suggest instead; $ grep -lR LF01 path_to_dirs_containing_files -- D'Arcy J.M. Cain da...@druid.net | Democracy is three wolves http://www.druid.net/darcy/| and a sheep voting on +1 416 425 1212 (DoD#0082)(eNTP) | what's for dinner. -- http://mail.python.org/mailman/listinfo/python-list
Re: search speed
2009/1/30 Diez B. Roggisch de...@nospam.web.de: No. Because nobody can automagically infer whatever structure your files have. Just so. But even without going to a full database solution it might be possible to make use of the flat file structure. For example, does the LF01 have to appear at a specific position in the input line? If so, there's no need to search for it in the complete line. *If* there is any such structure then a compiled regexp search is likely to be faster than just 'if LF01 in line', and (provided it's properly designed) provides a bit of extra insurance against false positives. -- Tim Rowe -- http://mail.python.org/mailman/listinfo/python-list
Re: search speed
Tim Rowe wrote: But even without going to a full database solution it might be possible to make use of the flat file structure. For example, does the LF01 have to appear at a specific position in the input line? If so, there's no need to search for it in the complete line. *If* there is any such structure then a compiled regexp search is likely to be faster than just 'if LF01 in line', and (provided it's properly designed) provides a bit of extra insurance against false positives. Clearly this is someone who regularly uses grep or perl. If you know the structure, like the position in a line, something like the following should be fast: with open(somename) as source: for n, line in enumerate(source): if n % 5 == 3 and line[5 : 9] == 'LF01': print ('Found on line %s: %s' % (1 + n, line.rstrip()) Be careful with your assertion that a regex is faster, it is certainly not always true. Measure speed, don't take mantras as gospel. --Scott David Daniels scott.dani...@acm.org -- http://mail.python.org/mailman/listinfo/python-list
Re: search speed
D'Arcy J.M. Cain darcy at druid.net writes: On Fri, 30 Jan 2009 15:46:33 +0200 Justin Wyer justinwyer at gmail.com wrote: $ find path_to_dirs_containing_files -name * -exec grep -nH LF01 {} \; | cut -d : -f 1 | sort | uniq I know this isn't a Unix group but please allow me to suggest instead; $ grep -lR LF01 path_to_dirs_containing_files and if the OP is on Windows: an alternative to cygwin is the GnuWin32 collection of Gnu utilities ported to Windows. See http://gnuwin32.sourceforge.net/ ... you'll want the Grep package but I'd suggest the CoreUtils package as worth a detailed look, and do scan through the whole list of packages while you're there. HTH, John -- http://mail.python.org/mailman/listinfo/python-list
Re: search speed
D'Arcy J.M. Cain wrote: On Fri, 30 Jan 2009 15:46:33 +0200 Justin Wyer justinw...@gmail.com wrote: $ find path_to_dirs_containing_files -name * -exec grep -nH LF01 {} \; | cut -d : -f 1 | sort | uniq I know this isn't a Unix group but please allow me to suggest instead; $ grep -lR LF01 path_to_dirs_containing_files That's a very good advice. I had to pull some statistics from a couple of log files recently some of which were gzip compressed. The obvious Python program just eats your first CPU's cycles parsing data into strings while the disk runs idle, but using the subprocess module to spawn a couple of gzgrep's in parallel that find the relevant lines, and then using Python to extract and aggregate the relevant information from them does the job in no-time. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: search speed
Diez B. Roggisch wrote: that's not necessarily the best thing to do if things have a record-like structure. The canonical answer to this is then to use a database to hold the data, instead of flat files. So if you have any chance to do that, you should try stuff things in there. It's worth mentioning to the OP that Python has a couple of database libraries in the stdlib, notably simple things like the various dbm flavoured modules (see the anydbm module) that provide fast string-to-string hash mappings (which might well be enough in this case), but also a pretty powerful SQL database called sqlite3 which allows much more complex (and complicated) ways to find the needle in the haystack. http://docs.python.org/library/persistence.html Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: search speed
Today this works fine, it saves me a lot of manuall work, but a seach takes around 5 min, so my questin is is there another way of search in a file (Today i step line for line and check) If the files you are searching are located at some other location on a network, you may find that much of the 5 minutes is actually the network delay in fetching each file. (Although you said something about your dir being cached?) Cheers, -- http://mail.python.org/mailman/listinfo/python-list
Re: search speed
I have written a Python program that serach for specifik customer in files (around 1000 files) the trigger is LF01 + CUSTOMERNO While most of the solutions folks have offered involve scanning all the files each time you search, if the content of those files doesn't change much, you can build an index once and then query the resulting index multiple times. Because I was bored, I threw together the code below (after the --- divider) which does what you detail as best I understand, allowing you to do python tkc.py 31415 to find the files containing CUSTOMERNO=31415 The first time, it's slow because it needs to create the index file. However, subsequent runs should be pretty speedy. You can also specify multiple customers on the command-line: python tkc.py 31415 1414 7 and it will search for each of them. I presume they're found by the regexp LF01(\d+) based on your description, that the file can be sensibly broken into lines, and the code allows for multiple results on the same line. Adjust accordingly if that's not the pattern you want or the conditions you expect. If your source files change, you can reinitialize the database with python tkc.py -i You can also change the glob pattern used for indexing -- by default, I assumed they were *.txt. But you can either override the default with python tkc.py -i -p *.dat or you can change the source to default differently (or even skip the glob-check completely...look for the fnmatch() call). There are a few more options. Just use python tkc.py --help as usual. It's also a simple demo of the optparse module if you've never used it. Enjoy! -tkc PS: as an aside, how do I import just the fnmatch function? I tried both of the following and neither worked: from glob.fnmatch import fnmatch from glob import fnmatch.fnmatch I finally resorted to the contortion coded below in favor of import glob fnmatch = glob.fnmatch.fnmatch - #!/usr/bin/env python import dbm import os import re from glob import fnmatch fnmatch = fnmatch.fnmatch from optparse import OptionParser customer_re = re.compile(rLF01(\d+)) def build_parser(): parser = OptionParser( usage=%prog [options] [cust#1 [cust#2 ... ]] ) parser.add_option(-i, --index, --reindex, action=store_true, dest=reindex, default=False, help=Reindex files found in the current directory in the event any files have changed, ) parser.add_option(-p, --pattern, action=store, dest=pattern, default=*.txt, metavar=GLOB_PATTERN, help=Index files matching GLOB_PATTERN, ) parser.add_option(-d, --db, --database, action=store, dest=indexfile, default=.index, metavar=FILE, help=Use the index stored at FILE, ) parser.add_option(-v, --verbose, action=count, dest=verbose, default=0, help=Increase verbosity ) return parser def reindex(options, db): if options.verbose: print Indexing... for path, dirs, files in os.walk('.'): for fname in files: if fname == options.indexfile: # ignore our database file continue if not fnmatch(fname, options.pattern): # ensure that it matches our pattern continue fullname = os.path.join(path, fname) if options.verbose: print fullname f = file(fullname) found_so_far = set() for line in f: for customer_number in customer_re.findall(line): if customer_number in found_so_far: continue found_so_far.add(customer_number) try: val = '\n'.join([ db[customer_number], fullname, ]) if options.verbose 1: print Appending %s % customer_number except KeyError: if options.verbose 1: print Creating %s % customer_number val = fullname db[customer_number] = val f.close() if __name__ == __main__: parser = build_parser() opt, args = parser.parse_args() reindexed = False if opt.reindex or not os.path.exists(%s.db % opt.indexfile): db = dbm.open(opt.indexfile, 'n') reindex(opt, db) reindexed = True else: db = dbm.open(opt.indexfile, 'r') if not (args or reindexed): parser.print_help() for arg in args: print %s: % arg, try: val = db[arg] print for item in val.splitlines(): print %s % item except KeyError: print Not found db.close() -- http://mail.python.org/mailman/listinfo/python-list
Re: search speed
Quoth Tim Chase t...@thechases.com: PS: as an aside, how do I import just the fnmatch function? I tried both of the following and neither worked: from glob.fnmatch import fnmatch from glob import fnmatch.fnmatch I finally resorted to the contortion coded below in favor of import glob fnmatch = glob.fnmatch.fnmatch What you want is: from fnmatch import fnmatch fnmatch is its own module, it just happens to be in the (non __all__) namespace of the glob module because glob uses it. --RDM -- http://mail.python.org/mailman/listinfo/python-list
Re: search speed
On Jan 29, 5:51 pm, anders anders.u.pers...@gmail.com wrote: if file.findInFile(LF01): Is there any library like this ?? Best Regards Anders Yea, it's called a for loop! for line in file: if string in line: do_this() -- http://mail.python.org/mailman/listinfo/python-list
Re: search speed
On Jan 30, 2:56 pm, r rt8...@gmail.com wrote: On Jan 29, 5:51 pm, anders anders.u.pers...@gmail.com wrote: if file.findInFile(LF01): Is there any library like this ?? Best Regards Anders Yea, it's called a for loop! for line in file: if string in line: do_this() Which is what the OP is already doing: (Today i step line for line and check) anders, you might have more luck with one of the text search libraries out there: PyLucene (although this makes Java a dependency): http://lucene.apache.org/pylucene/ Nucular: http://nucular.sourceforge.net/ mxTextTools: http://www.egenix.com/products/python/mxBase/mxTextTools/ -- http://mail.python.org/mailman/listinfo/python-list