percent faster than format()? (was: Re: optomizations)

2013-04-23 Thread Ulrich Eckhardt

Am 23.04.2013 06:00, schrieb Steven D'Aprano:

If it comes down to micro-optimizations to shave a few microseconds off,
consider using string % formatting rather than the format method.


Why? I don't see any obvious difference between the two...


Greetings!

Uli

--
http://mail.python.org/mailman/listinfo/python-list


Re: percent faster than format()? (was: Re: optomizations)

2013-04-23 Thread Chris “Kwpolska” Warrick
On Tue, Apr 23, 2013 at 9:46 AM, Ulrich Eckhardt
ulrich.eckha...@dominolaser.com wrote:
 Am 23.04.2013 06:00, schrieb Steven D'Aprano:

 If it comes down to micro-optimizations to shave a few microseconds off,
 consider using string % formatting rather than the format method.


 Why? I don't see any obvious difference between the two...


 Greetings!

 Uli

 --
 http://mail.python.org/mailman/listinfo/python-list

$ python -m timeit a = '{0} {1} {2}'.format(1, 2, 42)
100 loops, best of 3: 0.824 usec per loop
$ python -m timeit a = '%s %s %s' % (1, 2, 42)
1000 loops, best of 3: 0.0286 usec per loop

--
Kwpolska http://kwpolska.tk | GPG KEY: 5EAAEA16
stop html mail| always bottom-post
http://asciiribbon.org| http://caliburn.nl/topposting.html
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: optomizations

2013-04-23 Thread Chris Angelico
On Tue, Apr 23, 2013 at 11:53 AM, Roy Smith r...@panix.com wrote:
 In article mailman.944.1366680414.3114.python-l...@python.org,
  Rodrick Brown rodrick.br...@gmail.com wrote:

 I would like some feedback on possible solutions to make this script run
 faster.

 If I had to guess, I would think this stuff:

 line = line.replace('mediacdn.xxx.com', 'media.xxx.com')
 line = line.replace('staticcdn.xxx.co.uk', '
 static.xxx.co.uk')
 line = line.replace('cdn.xxx', 'www.xxx')
 line = line.replace('cdn.xxx', 'www.xxx')
 line = line.replace('cdn.xx', 'www.xx')
 siteurl = line.split()[6].split('/')[2]
 line = re.sub(r'\bhttps?://%s\b' % siteurl, , line, 1)

 You make 6 copies of every line.  That's slow.

One of those is a regular expression substitution, which is also
likely to be a hot-spot. But definitely profile.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: percent faster than format()? (was: Re: optomizations)

2013-04-23 Thread Steven D'Aprano
On Tue, 23 Apr 2013 09:46:53 +0200, Ulrich Eckhardt wrote:

 Am 23.04.2013 06:00, schrieb Steven D'Aprano:
 If it comes down to micro-optimizations to shave a few microseconds
 off, consider using string % formatting rather than the format method.
 
 Why? I don't see any obvious difference between the two...


Possibly the state of the art has changed since then, but some years ago 
% formatting was slightly faster than the format method. Let's try it and 
see:

# Using Python 3.3.

py from timeit import Timer
py setup = a = 'spam'; b = 'ham'; c = 'eggs'
py t1 = Timer('%s, %s and %s for breakfast' % (a, b, c), setup)
py t2 = Timer('{}, {} and {} for breakfast'.format(a, b, c), setup)
py print(min(t1.repeat()))
0.8319804421626031
py print(min(t2.repeat()))
1.2395259491167963


Looks like the format method is about 50% slower.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: percent faster than format()? (was: Re: optomizations)

2013-04-23 Thread Chris Angelico
On Wed, Apr 24, 2013 at 12:36 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 # Using Python 3.3.

 py from timeit import Timer
 py setup = a = 'spam'; b = 'ham'; c = 'eggs'
 py t1 = Timer('%s, %s and %s for breakfast' % (a, b, c), setup)
 py t2 = Timer('{}, {} and {} for breakfast'.format(a, b, c), setup)
 py print(min(t1.repeat()))
 0.8319804421626031
 py print(min(t2.repeat()))
 1.2395259491167963


 Looks like the format method is about 50% slower.

Figures on my hardware are (naturally) different, with a similar (but
slightly more pronounced) difference:

 sys.version
'3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600 32 bit (Intel)]'
 print(min(t1.repeat()))
1.4841416995735415
 print(min(t2.repeat()))
2.5459869899666074
 t3 = Timer(a+', '+b+' and '+c+' for breakfast', setup)
 print(min(t3.repeat()))
1.5707538248576327
 t4 = Timer(''.join([a, ', ', b, ' and ', c, ' for breakfast']), setup)
 print(min(t4.repeat()))
1.5026834416105999

So on the face of it, format() is slower than everything else by a
good margin... until you note that repeat() is doing one million
iterations, so those figures are effectively in microseconds. Yeah, I
think I can handle a couple of microseconds.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


optomizations

2013-04-22 Thread Rodrick Brown
I would like some feedback on possible solutions to make this script run
faster.
The system is pegged at 100% CPU and it takes a long time to complete.


#!/usr/bin/env python

import gzip
import re
import os
import sys
from datetime import datetime
import argparse

if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('-f', dest='inputfile', type=str, help='data file
to parse')
parser.add_argument('-o', dest='outputdir', type=str,
default=os.getcwd(), help='Output directory')
args = parser.parse_args()

if len(sys.argv[1:])  1:
parser.print_usage()
sys.exit(-1)

print(args)
if args.inputfile and os.path.exists(args.inputfile):
try:
with gzip.open(args.inputfile) as datafile:
for line in datafile:
line = line.replace('mediacdn.xxx.com', 'media.xxx.com')
line = line.replace('staticcdn.xxx.co.uk', '
static.xxx.co.uk')
line = line.replace('cdn.xxx', 'www.xxx')
line = line.replace('cdn.xxx', 'www.xxx')
line = line.replace('cdn.xx', 'www.xx')
siteurl = line.split()[6].split('/')[2]
line = re.sub(r'\bhttps?://%s\b' % siteurl, , line, 1)

(day, month, year, hour, minute, second) =
(line.split()[3]).replace('[','').replace(':','/').split('/')
datelog = '{} {} {}'.format(month, day, year)
dateobj = datetime.strptime(datelog, '%b %d %Y')

outfile = '{}{}{}_combined.log'.format(dateobj.year,
dateobj.month, dateobj.day)
outdir = (args.outputdir + os.sep + siteurl)

if not os.path.exists(outdir):
os.makedirs(outdir)

with open(outdir + os.sep + outfile, 'w+') as outf:
outf.write(line)

except IOError, err:
sys.stderr.write(Error unable to read or extract inputfile: {}
{}\n.format(args.inputfile, err))
sys.exit(-1)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: optomizations

2013-04-22 Thread Chris Angelico
On Tue, Apr 23, 2013 at 11:19 AM, Rodrick Brown rodrick.br...@gmail.com wrote:
 with gzip.open(args.inputfile) as datafile:
 for line in datafile:
 outfile = '{}{}{}_combined.log'.format(dateobj.year,
 dateobj.month, dateobj.day)
 outdir = (args.outputdir + os.sep + siteurl)

 with open(outdir + os.sep + outfile, 'w+') as outf:
 outf.write(line)

You're opening files and closing them again for every line. This
wouldn't cause you to spin the CPU (more likely it'd thrash the hard
disk - unless you have an SSD), but it is certainly an optimization
target.

Can you know in advance what files you need? If not, I'd try something
like this:

outf = {} # Might want a better name though

.
   outfile = ...
   if outfile not in outf:
   os.makedirs(...)
   outf[outfile] = open(...)
   outf[outfile].write(line)

for f in outf.values():
  f.close()

Open files only as needed, close 'em all at the end.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: optomizations

2013-04-22 Thread Roy Smith
In article mailman.944.1366680414.3114.python-l...@python.org,
 Rodrick Brown rodrick.br...@gmail.com wrote:

 I would like some feedback on possible solutions to make this script run
 faster.

If I had to guess, I would think this stuff:

 line = line.replace('mediacdn.xxx.com', 'media.xxx.com')
 line = line.replace('staticcdn.xxx.co.uk', '
 static.xxx.co.uk')
 line = line.replace('cdn.xxx', 'www.xxx')
 line = line.replace('cdn.xxx', 'www.xxx')
 line = line.replace('cdn.xx', 'www.xx')
 siteurl = line.split()[6].split('/')[2]
 line = re.sub(r'\bhttps?://%s\b' % siteurl, , line, 1)

You make 6 copies of every line.  That's slow.  But I'm also going to 
quote something I wrote here a couple of months back:

 I've been doing some log analysis.  It's been taking a grovelingly long 
 time, so I decided to fire up the profiler and see what's taking so 
 long.  I had a pretty good idea of where the ONLY TWO POSSIBLE hotspots 
 might be (looking up IP addresses in the geolocation database, or 
 producing some pretty pictures using matplotlib).  It was just a matter 
 of figuring out which it was. 
 
 As with most attempts to out-guess the profiler, I was totally, 
 absolutely, and embarrassingly wrong. 

So, my real advice to you is to fire up the profiler and see what it 
says.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: optomizations

2013-04-22 Thread MRAB

On 23/04/2013 02:19, Rodrick Brown wrote:

I would like some feedback on possible solutions to make this script run
faster.
The system is pegged at 100% CPU and it takes a long time to complete.


#!/usr/bin/env python

import gzip
import re
import os
import sys
from datetime import datetime
import argparse

if __name__ == '__main__':
 parser = argparse.ArgumentParser()
 parser.add_argument('-f', dest='inputfile', type=str, help='data file to 
parse')
 parser.add_argument('-o', dest='outputdir', type=str, default=os.getcwd(), 
help='Output directory')
 args = parser.parse_args()

 if len(sys.argv[1:])  1:
 parser.print_usage()
 sys.exit(-1)

 print(args)
 if args.inputfile and os.path.exists(args.inputfile):
 try:
 with gzip.open(args.inputfile) as datafile:
 for line in datafile:
 line = line.replace('mediacdn.xxx.com', 'media.xxx.com')
 line = line.replace('staticcdn.xxx.co.uk', 
'static.xxx.co.uk')


These next 2 lines are duplicates; the second will have no effect (I
think!).


 line = line.replace('cdn.xxx', 'www.xxx')
 line = line.replace('cdn.xxx', 'www.xxx')


Won't the next line also do the work of the preceding 2 lines?


 line = line.replace('cdn.xx', 'www.xx')
 siteurl = line.split()[6].split('/')[2]
 line = re.sub(r'\bhttps?://%s\b' % siteurl, , line, 1)

 (day, month, year, hour, minute, second) = 
(line.split()[3]).replace('[','').replace(':','/').split('/')
 datelog = '{} {} {}'.format(month, day, year)
 dateobj = datetime.strptime(datelog, '%b %d %Y')

 outfile = '{}{}{}_combined.log'.format(dateobj.year, 
dateobj.month, dateobj.day)
 outdir = (args.outputdir + os.sep + siteurl)

 if not os.path.exists(outdir):
 os.makedirs(outdir)

 with open(outdir + os.sep + outfile, 'w+') as outf:
 outf.write(line)

 except IOError, err:
 sys.stderr.write(Error unable to read or extract inputfile: {} 
{}\n.format(args.inputfile, err))
 sys.exit(-1)


I wonder whether it'll make a difference if you read a chunk at a time
(datafile.read(chunk_size) + datafile.readline() to ensure you have
complete lines), perform the replacements on it (so that you're working 
on several lines in one go), and then split it into lines for further

processing.

Another thing you could try caching the result of parsing the date, 
using (month, day, year) the key and outfile as the value in a dict.


A third thing you could try is not writing a file for every line
(doesn't the 'w+' mode truncate the file?), but save the output for
each chunk (see first suggestion) and then write the files afterwards,
at the end of the chunk.

--
http://mail.python.org/mailman/listinfo/python-list


Re: optomizations

2013-04-22 Thread Dan Stromberg
On Mon, Apr 22, 2013 at 6:53 PM, Roy Smith r...@panix.com wrote:


 So, my real advice to you is to fire up the profiler and see what it
 says.


I agree.

Fire up a  line-oriented profiler and only then start trying to improve the
hot spots.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: optomizations

2013-04-22 Thread Steven D'Aprano
On Mon, 22 Apr 2013 21:19:23 -0400, Rodrick Brown wrote:

 I would like some feedback on possible solutions to make this script run
 faster.
 The system is pegged at 100% CPU and it takes a long time to complete.

Have you profiled the app to see where it is spending all its time?

What does a long time mean? For instance:

It takes two hours to process a 15KB file -- you have a problem.

It takes 20 minutes to process a 15GB file -- and why are you 
complaining?


Or somewhere in the middle... 


But before profiling, I suggest you clean up the program. For example:

if args.inputfile and os.path.exists(args.inputfile):

Don't do that. There really isn't any point in checking whether the input 
file exists, since:

1) Just because it exists doesn't mean you can read it;

2) Just because you can read it doesn't mean it is a valid gzip file;

3) Just because it is a valid gzip file that you can read *now*, doesn't 
mean that it still will be in 10 milliseconds when you actually try to 
open the file.


A lot can happen in 10ms, or 1ms. The file might be deleted, or 
overwritten, or permissions changed. Change that to:

try:
with gzip.open(args.inputfile) as datafile:
for line in datafile:

and catch the exception if the file doesn't exist, or cannot be read. 
Which you already do, which just demonstrates that the call to 
os.path.exists is a waste of effort. 


Then look for wasted effort like this:

line = line.replace('cdn.xxx', 'www.xxx')
line = line.replace('cdn.xx', 'www.xx')


Surely the first line is redundant, since it would be correctly caught 
and replaced by the second?

Also, you're searching the file system *for every line* in the input 
file. Pull this outside of the loop and have it run once:

if not os.path.exists(outdir):
os.makedirs(outdir)

Likewise for opening and closing the output file, which you currently 
open and close it for every line. It only needs to be opened and closed 
once.

If it comes down to micro-optimizations to shave a few microseconds off, 
consider using string % formatting rather than the format method. But 
really, if you find yourself shaving microseconds off something that runs 
for ten minutes, you have to ask why you're bothering.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: optomizations

2013-04-22 Thread Chris Angelico
On Tue, Apr 23, 2013 at 2:00 PM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 Also, you're searching the file system *for every line* in the input
 file. Pull this outside of the loop and have it run once:

 if not os.path.exists(outdir):
 os.makedirs(outdir)

 Likewise for opening and closing the output file, which you currently
 open and close it for every line. It only needs to be opened and closed
 once.

The outdir depends on the line, though. Hence my suggestion to retain
the open files in a dictionary.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: optomizations

2013-04-22 Thread Rodrick Brown
On Apr 22, 2013, at 11:18 PM, Dan Stromberg drsali...@gmail.com wrote:


On Mon, Apr 22, 2013 at 6:53 PM, Roy Smith r...@panix.com wrote:


 So, my real advice to you is to fire up the profiler and see what it
 says.


I agree.

Fire up a  line-oriented profiler and only then start trying to improve the
hot spots.


Got a doc or URL I have no experience working with python profilers.


-- 
http://mail.python.org/mailman/listinfo/python-list
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: optomizations

2013-04-22 Thread Steven D'Aprano
On Tue, 23 Apr 2013 00:20:59 -0400, Rodrick Brown wrote:

 Got a doc or URL I have no experience working with python profilers.


https://duckduckgo.com/html/?q=python%20profiler



This is also good:

http://pymotw.com/2/profile/



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list