Re: search speed

2009-02-01 Thread Aaron Watters
On Jan 30, 3:49 am, Diez B. Roggisch de...@nospam.web.de wrote:
 alex23 gave you a set of tools that you can use for full-text-search.
 However, that's not necessarily the best thing to do if things have a
 record-like structure.

In Nucular (and others I think) you can do searches
for terms anywhere (full text)
searches for terms within fields, searches for prefixes in fields,
searches
based on field inequality, or searches for field exact value.  I would
argue this subsumes the standard fielded approach.
   -- Aaron Watters
===
Oh, I'm a lumberjack and I'm O.K...

--
http://mail.python.org/mailman/listinfo/python-list


Re: search speed

2009-01-31 Thread anders
Tanks everyone that spent time helping my, the help was great.
Best regards Anders
--
http://mail.python.org/mailman/listinfo/python-list


Re: search speed

2009-01-31 Thread Tim Rowe
2009/1/30 Scott David Daniels scott.dani...@acm.org:

 Be careful with your assertion that a regex is faster, it is certainly
 not always true.

I was careful *not* to assert that a regex would be faster, merely
that it was *likely* to be in this case.


-- 
Tim Rowe
--
http://mail.python.org/mailman/listinfo/python-list


Re: search speed

2009-01-30 Thread Diez B. Roggisch

anders schrieb:

Hi!
I have written a Python program that serach for specifik customer in
files (around 1000 files)
the trigger is LF01 + CUSTOMERNO

So a read all fils with dirchached

Then a loop thru all files each files is read with readLines() and
after that scaned

Today this works fine, it saves me a lot of manuall work, but a seach
takes around 5 min,
so my questin is is there another way of search in a file
(Today i step line for line and check)

What i like to find is just filenames for files with the customerdata
in, there can and often
is more than one,

English is not my first language and i hope someone understand my
beginner question
what i am looking for is somting like

if file.findInFile(LF01):
...

Is there any library like this ??


No. Because nobody can automagically infer whatever structure your files 
have.


alex23 gave you a set of tools that you can use for full-text-search. 
However, that's not necessarily the best thing to do if things have a 
record-like structure. The canonical answer to this is then to use a 
database to hold the data, instead of flat files. So if you have any 
chance to do that, you should try  stuff things in there.



Diez
--
http://mail.python.org/mailman/listinfo/python-list


Re: search speed

2009-01-30 Thread Justin Wyer
On Fri, Jan 30, 2009 at 1:51 AM, anders anders.u.pers...@gmail.com wrote:

 Hi!
 I have written a Python program that serach for specifik customer in
 files (around 1000 files)
 the trigger is LF01 + CUSTOMERNO

 So a read all fils with dirchached

 Then a loop thru all files each files is read with readLines() and
 after that scaned

 Today this works fine, it saves me a lot of manuall work, but a seach
 takes around 5 min,
 so my questin is is there another way of search in a file
 (Today i step line for line and check)


Do you require this information in a python application, seems like you did
this manually before?

If not then python is the wrong tool for this job, you can simply use this
command on a unix-like environment (install cygwin, if you are on windows)

$ find path_to_dirs_containing_files -name * -exec grep -nH LF01 {} \;
| cut -d : -f 1 | sort | uniq

Now if you do require this information inside a python app, I would just do
the above in python

filenames = []
searchCmd = find path_to_dirs_containing_files -name \*\ -exec grep -nH
\LF01\ {} \; | cut -d \:\ -f 1 | sort | uniq
searchp = Popen(searchCmd, shell=True, bufsize=4096, stdout=PIPE)
for line in searchp.stdout:
  filenames.append(line.strip())

Thats my advise anyway, guess you can try some search libraries don't know
of any mysql tho, the above will probably be faster than anything else.

Cheers and good luck.
--
http://mail.python.org/mailman/listinfo/python-list


Re: search speed

2009-01-30 Thread D'Arcy J.M. Cain
On Fri, 30 Jan 2009 15:46:33 +0200
Justin Wyer justinw...@gmail.com wrote:
 $ find path_to_dirs_containing_files -name * -exec grep -nH LF01 {} \;
 | cut -d : -f 1 | sort | uniq

I know this isn't a Unix group but please allow me to suggest instead;

  $ grep -lR LF01 path_to_dirs_containing_files

-- 
D'Arcy J.M. Cain da...@druid.net |  Democracy is three wolves
http://www.druid.net/darcy/|  and a sheep voting on
+1 416 425 1212 (DoD#0082)(eNTP)   |  what's for dinner.
--
http://mail.python.org/mailman/listinfo/python-list


Re: search speed

2009-01-30 Thread Tim Rowe
2009/1/30 Diez B. Roggisch de...@nospam.web.de:

 No. Because nobody can automagically infer whatever structure your files
 have.

Just so. But even without going to a full database solution it might
be possible to make use of the flat file structure. For example, does
the LF01 have to appear at a specific position in the input line? If
so, there's no need to search for it in the complete line. *If* there
is any such structure then a compiled regexp search is likely to be
faster than just 'if LF01 in line', and (provided it's properly
designed) provides a bit of extra insurance against false positives.

-- 
Tim Rowe
--
http://mail.python.org/mailman/listinfo/python-list


Re: search speed

2009-01-30 Thread Scott David Daniels

Tim Rowe wrote:

 But even without going to a full database solution it might
be possible to make use of the flat file structure. For example, does
the LF01 have to appear at a specific position in the input line? If
so, there's no need to search for it in the complete line. *If* there
is any such structure then a compiled regexp search is likely to be
faster than just 'if LF01 in line', and (provided it's properly
designed) provides a bit of extra insurance against false positives.


Clearly this is someone who regularly uses grep or perl.  If you
know the structure, like the position in a line, something like
the following should be fast:

with open(somename) as source:
 for n, line in enumerate(source):
 if n % 5 == 3 and line[5 : 9] == 'LF01':
 print ('Found on line %s: %s' % (1 + n, line.rstrip())

Be careful with your assertion that a regex is faster, it is certainly
not always true.  Measure speed, don't take mantras as gospel.

--Scott David Daniels
scott.dani...@acm.org
--
http://mail.python.org/mailman/listinfo/python-list


Re: search speed

2009-01-30 Thread John Machin
D'Arcy J.M. Cain darcy at druid.net writes:

 
 On Fri, 30 Jan 2009 15:46:33 +0200
 Justin Wyer justinwyer at gmail.com wrote:
  $ find path_to_dirs_containing_files -name * -exec grep -nH LF01 {} \;
  | cut -d : -f 1 | sort | uniq
 
 I know this isn't a Unix group but please allow me to suggest instead;
 
   $ grep -lR LF01 path_to_dirs_containing_files


and if the OP is on Windows: an alternative to cygwin is the GnuWin32 collection
of Gnu utilities ported to Windows. See http://gnuwin32.sourceforge.net/ ...
you'll want the Grep package but I'd suggest the CoreUtils package as worth a
detailed look, and do scan through the whole list of packages while you're 
there.

HTH,
John




--
http://mail.python.org/mailman/listinfo/python-list


Re: search speed

2009-01-30 Thread Stefan Behnel
D'Arcy J.M. Cain wrote:
 On Fri, 30 Jan 2009 15:46:33 +0200
 Justin Wyer justinw...@gmail.com wrote:
 $ find path_to_dirs_containing_files -name * -exec grep -nH LF01 {} \;
 | cut -d : -f 1 | sort | uniq
 
 I know this isn't a Unix group but please allow me to suggest instead;
 
   $ grep -lR LF01 path_to_dirs_containing_files

That's a very good advice. I had to pull some statistics from a couple of
log files recently some of which were gzip compressed. The obvious Python
program just eats your first CPU's cycles parsing data into strings while
the disk runs idle, but using the subprocess module to spawn a couple of
gzgrep's in parallel that find the relevant lines, and then using Python to
extract and aggregate the relevant information from them does the job in
no-time.

Stefan
--
http://mail.python.org/mailman/listinfo/python-list


Re: search speed

2009-01-30 Thread Stefan Behnel
Diez B. Roggisch wrote:
 that's not necessarily the best thing to do if things have a
 record-like structure. The canonical answer to this is then to use a
 database to hold the data, instead of flat files. So if you have any
 chance to do that, you should try  stuff things in there.

It's worth mentioning to the OP that Python has a couple of database
libraries in the stdlib, notably simple things like the various dbm
flavoured modules (see the anydbm module) that provide fast
string-to-string hash mappings (which might well be enough in this case),
but also a pretty powerful SQL database called sqlite3 which allows much
more complex (and complicated) ways to find the needle in the haystack.

http://docs.python.org/library/persistence.html

Stefan
--
http://mail.python.org/mailman/listinfo/python-list


Re: search speed

2009-01-30 Thread Jervis Whitley


 Today this works fine, it saves me a lot of manuall work, but a seach
 takes around 5 min,
 so my questin is is there another way of search in a file
 (Today i step line for line and check)


If the files you are searching are located at some other location on a
network, you may find that much of the 5 minutes is actually the network
delay in fetching each file. (Although you said something about your dir
being cached?)

Cheers,
--
http://mail.python.org/mailman/listinfo/python-list


Re: search speed

2009-01-30 Thread Tim Chase

I have written a Python program that serach for specifik customer in
files (around 1000 files)
the trigger is LF01 + CUSTOMERNO


While most of the solutions folks have offered involve scanning 
all the files each time you search, if the content of those files 
doesn't change much, you can build an index once and then query 
the resulting index multiple times.  Because I was bored, I threw 
together the code below (after the --- divider) which does 
what you detail as best I understand, allowing you to do


  python tkc.py 31415

to find the files containing CUSTOMERNO=31415  The first time, 
it's slow because it needs to create the index file.  However, 
subsequent runs should be pretty speedy.  You can also specify 
multiple customers on the command-line:


  python tkc.py 31415 1414 7

and it will search for each of them.  I presume they're found by 
the regexp LF01(\d+) based on your description, that the file 
can be sensibly broken into lines, and the code allows for 
multiple results on the same line.  Adjust accordingly if that's 
not the pattern you want or the conditions you expect.


If your source files change, you can reinitialize the database with

  python tkc.py -i

You can also change the glob pattern used for indexing -- by 
default, I assumed they were *.txt.  But you can either 
override the default with


  python tkc.py -i -p *.dat

or you can change the source to default differently (or even skip 
the glob-check completely...look for the fnmatch() call).  There 
are a few more options.  Just use


  python tkc.py --help

as usual.  It's also a simple demo of the optparse module if 
you've never used it.


Enjoy!

-tkc

PS:  as an aside, how do I import just the fnmatch function?  I 
tried both of the following and neither worked:


  from glob.fnmatch import fnmatch
  from glob import fnmatch.fnmatch

I finally resorted to the contortion coded below in favor of
  import glob
  fnmatch = glob.fnmatch.fnmatch

-


#!/usr/bin/env python
import dbm
import os
import re
from glob import fnmatch
fnmatch = fnmatch.fnmatch
from optparse import OptionParser

customer_re = re.compile(rLF01(\d+))

def build_parser():
  parser = OptionParser(
usage=%prog [options] [cust#1 [cust#2 ... ]]
)
  parser.add_option(-i, --index, --reindex,
action=store_true,
dest=reindex,
default=False,
help=Reindex files found in the current directory 
  in the event any files have changed,
)
  parser.add_option(-p, --pattern,
action=store,
dest=pattern,
default=*.txt,
metavar=GLOB_PATTERN,
help=Index files matching GLOB_PATTERN,
)
  parser.add_option(-d, --db, --database,
action=store,
dest=indexfile,
default=.index,
metavar=FILE,
help=Use the index stored at FILE,
)
  parser.add_option(-v, --verbose,
action=count,
dest=verbose,
default=0,
help=Increase verbosity
)
  return parser

def reindex(options, db):
  if options.verbose: print Indexing...
  for path, dirs, files in os.walk('.'):
for fname in files:
  if fname == options.indexfile:
# ignore our database file
continue
  if not fnmatch(fname, options.pattern):
# ensure that it matches our pattern
continue
  fullname = os.path.join(path, fname)
  if options.verbose: print fullname
  f = file(fullname)
  found_so_far = set()
  for line in f:
for customer_number in customer_re.findall(line):
  if customer_number in found_so_far: continue
  found_so_far.add(customer_number)
  try:
val = '\n'.join([
  db[customer_number],
  fullname,
  ])
if options.verbose  1:
  print Appending %s % customer_number
  except KeyError:
if options.verbose  1:
  print Creating %s % customer_number
val = fullname
  db[customer_number] = val
  f.close()

if __name__ == __main__:
  parser = build_parser()
  opt, args = parser.parse_args()
  reindexed = False
  if opt.reindex or not os.path.exists(%s.db % opt.indexfile):
db = dbm.open(opt.indexfile, 'n')
reindex(opt, db)
reindexed = True
  else:
db = dbm.open(opt.indexfile, 'r')
  if not (args or reindexed):
parser.print_help()
  for arg in args:
print %s: % arg,
try:
  val = db[arg]
  print
  for item in val.splitlines():
print  %s % item
except KeyError:
  print Not found
  db.close()


--
http://mail.python.org/mailman/listinfo/python-list


Re: search speed

2009-01-30 Thread rdmurray
Quoth Tim Chase t...@thechases.com:
 PS:  as an aside, how do I import just the fnmatch function?  I 
 tried both of the following and neither worked:
 
from glob.fnmatch import fnmatch
from glob import fnmatch.fnmatch
 
 I finally resorted to the contortion coded below in favor of
import glob
fnmatch = glob.fnmatch.fnmatch

What you want is:

from fnmatch import fnmatch

fnmatch is its own module, it just happens to be in the (non __all__)
namespace of the glob module because glob uses it.

--RDM

--
http://mail.python.org/mailman/listinfo/python-list


Re: search speed

2009-01-29 Thread r
On Jan 29, 5:51 pm, anders anders.u.pers...@gmail.com wrote:
 if file.findInFile(LF01):
 Is there any library like this ??
 Best Regards
 Anders

Yea, it's called a for loop!

for line in file:
if string in line:
do_this()
--
http://mail.python.org/mailman/listinfo/python-list


Re: search speed

2009-01-29 Thread alex23
On Jan 30, 2:56 pm, r rt8...@gmail.com wrote:
 On Jan 29, 5:51 pm, anders anders.u.pers...@gmail.com wrote:

  if file.findInFile(LF01):
  Is there any library like this ??
  Best Regards
  Anders

 Yea, it's called a for loop!

 for line in file:
     if string in line:
         do_this()

Which is what the OP is already doing:

 (Today i step line for line and check)

anders, you might have more luck with one of the text search libraries
out there:

PyLucene (although this makes Java a dependency): 
http://lucene.apache.org/pylucene/
Nucular: http://nucular.sourceforge.net/
mxTextTools: http://www.egenix.com/products/python/mxBase/mxTextTools/
--
http://mail.python.org/mailman/listinfo/python-list