Relative performance of comparable regular expressions

2009-01-13 Thread Barak, Ron
Hi,

I have a question about relative performance of comparable regular expressions.

I have large log files that start with three letters month names (non-unicode).

Which would give better performance, matching with  ^[a-zA-Z]{3}, or with 
^\S{3} ?
Also, which is better (if different at all): \d\d or \d{2} ?
Also, would matching . be different (performance-wise) than matching the 
actual character, e.g. matching : ?
And lastly, at the end of a line, is there any performance difference between 
(.+)$ and (.+)

Thanks,
Ron.
--
http://mail.python.org/mailman/listinfo/python-list


Re: Relative performance of comparable regular expressions

2009-01-13 Thread John Machin
On Jan 13, 7:24 pm, Barak, Ron ron.ba...@lsi.com wrote:
 Hi,

 I have a question about relative performance of comparable regular 
 expressions.

 I have large log files that start with three letters month names 
 (non-unicode).

 Which would give better performance, matching with  ^[a-zA-Z]{3}, or with 
 ^\S{3} ?

(1) If you want to match at the start of a line, use re.match()
*without* the pointless ^. Don't use re.search with a pattern
starting with ^ -- it won't be any faster than and it could be a lot
worse; re.search doesn't know to stop if the first match fails:

command-prompt\python26\python -m timeit -simport re;rx=re.compile
('^AB')
;text='Z'*100 rx.match(text)
100 loops, best of 3: 1.15 usec per loop

command-prompt\python26\python -m timeit -simport re;rx=re.compile
('^AB')
;text='Z'*100 rx.search(text)
10 loops, best of 3: 4.47 usec per loop

command-prompt\python26\python -m timeit -simport re;rx=re.compile
('^AB')
;text='Z'*1000 rx.search(text)
1 loops, best of 3: 34.1 usec per loop

(2) I think you mean ^\s{3} not ^\S{3}

(3) Now that you've seen how to do timings, over to you :-)

 Also, which is better (if different at all): \d\d or \d{2} ?
 Also, would matching . be different (performance-wise) than matching the 
 actual character, e.g. matching : ?
 And lastly, at the end of a line, is there any performance difference between 
 (.+)$ and (.+)

Cheers,
John
--
http://mail.python.org/mailman/listinfo/python-list


Re: Relative performance of comparable regular expressions

2009-01-13 Thread Steve Holden
John Machin wrote:
 On Jan 13, 7:24 pm, Barak, Ron ron.ba...@lsi.com wrote:
 Hi,

 I have a question about relative performance of comparable regular 
 expressions.

 I have large log files that start with three letters month names 
 (non-unicode).

 Which would give better performance, matching with  ^[a-zA-Z]{3}, or with 
 ^\S{3} ?
 
 (1) If you want to match at the start of a line, use re.match()
 *without* the pointless ^. Don't use re.search with a pattern
 starting with ^ -- it won't be any faster than and it could be a lot
 worse; re.search doesn't know to stop if the first match fails:
 
 command-prompt\python26\python -m timeit -simport re;rx=re.compile
 ('^AB')
 ;text='Z'*100 rx.match(text)
 100 loops, best of 3: 1.15 usec per loop
 
 command-prompt\python26\python -m timeit -simport re;rx=re.compile
 ('^AB')
 ;text='Z'*100 rx.search(text)
 10 loops, best of 3: 4.47 usec per loop
 
 command-prompt\python26\python -m timeit -simport re;rx=re.compile
 ('^AB')
 ;text='Z'*1000 rx.search(text)
 1 loops, best of 3: 34.1 usec per loop
 
 (2) I think you mean ^\s{3} not ^\S{3}
 
 (3) Now that you've seen how to do timings, over to you :-)
 
 Also, which is better (if different at all): \d\d or \d{2} ?
 Also, would matching . be different (performance-wise) than matching the 
 actual character, e.g. matching : ?
 And lastly, at the end of a line, is there any performance difference 
 between (.+)$ and (.+)
 
Of course if the log strings all begin with a string like Dec 12 2009
 then you don't need regular expressions at all - just pull the
characters out using their positions and slicing. The month would be
string[0:3] and so on.

regards
 Steve
-- 
Steve Holden+1 571 484 6266   +1 800 494 3119
Holden Web LLC  http://www.holdenweb.com/

--
http://mail.python.org/mailman/listinfo/python-list


RE: Relative performance of comparable regular expressions

2009-01-13 Thread Barak, Ron
Hi John,

Thanks for the below - teaching me how to fish  ( instead of just giving me 
a fish :-)
Now I could definitely get the answers for myself, and also be a bit more 
enlightened.

As for your (2) remark below (on my question: Which would give better 
performance, matching with ^[a-zA-Z]{3}, or with ^\S{3} ?):
(2) I think you mean ^\s{3} not ^\S{3},
I actually did meant to use \S, namely - a character that is not a white-space.

Bye,
Ron.

-Original Message-
From: John Machin []
Sent: Tuesday, January 13, 2009 11:15
To: python-list@python.org
Subject: Re: Relative performance of comparable regular expressions

On Jan 13, 7:24 pm, Barak, Ron ron.ba...@lsi.com wrote:
 Hi,

 I have a question about relative performance of comparable regular 
 expressions.

 I have large log files that start with three letters month names 
 (non-unicode).

 Which would give better performance, matching with  ^[a-zA-Z]{3}, or with 
 ^\S{3} ?

(1) If you want to match at the start of a line, use re.match()
*without* the pointless ^. Don't use re.search with a pattern starting with 
^ -- it won't be any faster than and it could be a lot worse; re.search 
doesn't know to stop if the first match fails:

command-prompt\python26\python -m timeit -simport re;rx=re.compile
('^AB')
;text='Z'*100 rx.match(text)
100 loops, best of 3: 1.15 usec per loop

command-prompt\python26\python -m timeit -simport re;rx=re.compile
('^AB')
;text='Z'*100 rx.search(text)
10 loops, best of 3: 4.47 usec per loop

command-prompt\python26\python -m timeit -simport re;rx=re.compile
('^AB')
;text='Z'*1000 rx.search(text)
1 loops, best of 3: 34.1 usec per loop

(2) I think you mean ^\s{3} not ^\S{3}

(3) Now that you've seen how to do timings, over to you :-)

 Also, which is better (if different at all): \d\d or \d{2} ?
 Also, would matching . be different (performance-wise) than matching the 
 actual character, e.g. matching : ?
 And lastly, at the end of a line, is there any performance difference between 
 (.+)$ and (.+)

Cheers,
John

--
http://mail.python.org/mailman/listinfo/python-list


Re: Relative performance of comparable regular expressions

2009-01-13 Thread Chris Rebert
On Tue, Jan 13, 2009 at 6:16 AM, Barak, Ron ron.ba...@lsi.com wrote:
 Hi John,

 Thanks for the below - teaching me how to fish  ( instead of just giving
 me a fish :-)
 Now I could definitely get the answers for myself, and also be a bit more
 enlightened.

 As for your (2) remark below (on my question: Which would give better
 performance, matching with ^[a-zA-Z]{3}, or with ^\S{3} ?):
 (2) I think you mean ^\s{3} not ^\S{3},
 I actually did meant to use \S, namely - a character that is not a
 white-space.

(A) Please don't top-post, it makes replying to you more awkward and
makes it harder for readers to follow the conversation.

(B) But ^[a-zA-Z]{3}, and ^\S{3} aren't even equivalent! \S allows
*digits* and *punctuation* too, whereas the former *only* matches
letters.

Cheers,
Chris
-- 
Follow the path of the Iguana...
http://rebertia.com


 -Original Message-
 From: John Machin [
 ]
 Sent: Tuesday, January 13, 2009 11:15
 To: python-list@python.org
 Subject: Re: Relative performance of comparable regular expressions

 On Jan 13, 7:24 pm, Barak, Ron ron.ba...@lsi.com wrote:
 Hi,

 I have a question about relative performance of comparable regular
 expressions.

 I have large log files that start with three letters month names
 (non-unicode).

 Which would give better performance, matching with  ^[a-zA-Z]{3}, or
 with ^\S{3} ?

 (1) If you want to match at the start of a line, use re.match()
 *without* the pointless ^. Don't use re.search with a pattern starting
 with ^ -- it won't be any faster than and it could be a lot worse;
 re.search doesn't know to stop if the first match fails:

 command-prompt\python26\python -m timeit -simport re;rx=re.compile
 ('^AB')
 ;text='Z'*100 rx.match(text)
 100 loops, best of 3: 1.15 usec per loop

 command-prompt\python26\python -m timeit -simport re;rx=re.compile
 ('^AB')
 ;text='Z'*100 rx.search(text)
 10 loops, best of 3: 4.47 usec per loop

 command-prompt\python26\python -m timeit -simport re;rx=re.compile
 ('^AB')
 ;text='Z'*1000 rx.search(text)
 1 loops, best of 3: 34.1 usec per loop

 (2) I think you mean ^\s{3} not ^\S{3}

 (3) Now that you've seen how to do timings, over to you :-)

 Also, which is better (if different at all): \d\d or \d{2} ?
 Also, would matching . be different (performance-wise) than matching the
 actual character, e.g. matching : ?
 And lastly, at the end of a line, is there any performance difference
 between (.+)$ and (.+)

 Cheers,
 John
--
http://mail.python.org/mailman/listinfo/python-list