Relative performance of comparable regular expressions
Hi, I have a question about relative performance of comparable regular expressions. I have large log files that start with three letters month names (non-unicode). Which would give better performance, matching with ^[a-zA-Z]{3}, or with ^\S{3} ? Also, which is better (if different at all): \d\d or \d{2} ? Also, would matching . be different (performance-wise) than matching the actual character, e.g. matching : ? And lastly, at the end of a line, is there any performance difference between (.+)$ and (.+) Thanks, Ron. -- http://mail.python.org/mailman/listinfo/python-list
Re: Relative performance of comparable regular expressions
On Jan 13, 7:24 pm, Barak, Ron ron.ba...@lsi.com wrote: Hi, I have a question about relative performance of comparable regular expressions. I have large log files that start with three letters month names (non-unicode). Which would give better performance, matching with ^[a-zA-Z]{3}, or with ^\S{3} ? (1) If you want to match at the start of a line, use re.match() *without* the pointless ^. Don't use re.search with a pattern starting with ^ -- it won't be any faster than and it could be a lot worse; re.search doesn't know to stop if the first match fails: command-prompt\python26\python -m timeit -simport re;rx=re.compile ('^AB') ;text='Z'*100 rx.match(text) 100 loops, best of 3: 1.15 usec per loop command-prompt\python26\python -m timeit -simport re;rx=re.compile ('^AB') ;text='Z'*100 rx.search(text) 10 loops, best of 3: 4.47 usec per loop command-prompt\python26\python -m timeit -simport re;rx=re.compile ('^AB') ;text='Z'*1000 rx.search(text) 1 loops, best of 3: 34.1 usec per loop (2) I think you mean ^\s{3} not ^\S{3} (3) Now that you've seen how to do timings, over to you :-) Also, which is better (if different at all): \d\d or \d{2} ? Also, would matching . be different (performance-wise) than matching the actual character, e.g. matching : ? And lastly, at the end of a line, is there any performance difference between (.+)$ and (.+) Cheers, John -- http://mail.python.org/mailman/listinfo/python-list
Re: Relative performance of comparable regular expressions
John Machin wrote: On Jan 13, 7:24 pm, Barak, Ron ron.ba...@lsi.com wrote: Hi, I have a question about relative performance of comparable regular expressions. I have large log files that start with three letters month names (non-unicode). Which would give better performance, matching with ^[a-zA-Z]{3}, or with ^\S{3} ? (1) If you want to match at the start of a line, use re.match() *without* the pointless ^. Don't use re.search with a pattern starting with ^ -- it won't be any faster than and it could be a lot worse; re.search doesn't know to stop if the first match fails: command-prompt\python26\python -m timeit -simport re;rx=re.compile ('^AB') ;text='Z'*100 rx.match(text) 100 loops, best of 3: 1.15 usec per loop command-prompt\python26\python -m timeit -simport re;rx=re.compile ('^AB') ;text='Z'*100 rx.search(text) 10 loops, best of 3: 4.47 usec per loop command-prompt\python26\python -m timeit -simport re;rx=re.compile ('^AB') ;text='Z'*1000 rx.search(text) 1 loops, best of 3: 34.1 usec per loop (2) I think you mean ^\s{3} not ^\S{3} (3) Now that you've seen how to do timings, over to you :-) Also, which is better (if different at all): \d\d or \d{2} ? Also, would matching . be different (performance-wise) than matching the actual character, e.g. matching : ? And lastly, at the end of a line, is there any performance difference between (.+)$ and (.+) Of course if the log strings all begin with a string like Dec 12 2009 then you don't need regular expressions at all - just pull the characters out using their positions and slicing. The month would be string[0:3] and so on. regards Steve -- Steve Holden+1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ -- http://mail.python.org/mailman/listinfo/python-list
RE: Relative performance of comparable regular expressions
Hi John, Thanks for the below - teaching me how to fish ( instead of just giving me a fish :-) Now I could definitely get the answers for myself, and also be a bit more enlightened. As for your (2) remark below (on my question: Which would give better performance, matching with ^[a-zA-Z]{3}, or with ^\S{3} ?): (2) I think you mean ^\s{3} not ^\S{3}, I actually did meant to use \S, namely - a character that is not a white-space. Bye, Ron. -Original Message- From: John Machin [] Sent: Tuesday, January 13, 2009 11:15 To: python-list@python.org Subject: Re: Relative performance of comparable regular expressions On Jan 13, 7:24 pm, Barak, Ron ron.ba...@lsi.com wrote: Hi, I have a question about relative performance of comparable regular expressions. I have large log files that start with three letters month names (non-unicode). Which would give better performance, matching with ^[a-zA-Z]{3}, or with ^\S{3} ? (1) If you want to match at the start of a line, use re.match() *without* the pointless ^. Don't use re.search with a pattern starting with ^ -- it won't be any faster than and it could be a lot worse; re.search doesn't know to stop if the first match fails: command-prompt\python26\python -m timeit -simport re;rx=re.compile ('^AB') ;text='Z'*100 rx.match(text) 100 loops, best of 3: 1.15 usec per loop command-prompt\python26\python -m timeit -simport re;rx=re.compile ('^AB') ;text='Z'*100 rx.search(text) 10 loops, best of 3: 4.47 usec per loop command-prompt\python26\python -m timeit -simport re;rx=re.compile ('^AB') ;text='Z'*1000 rx.search(text) 1 loops, best of 3: 34.1 usec per loop (2) I think you mean ^\s{3} not ^\S{3} (3) Now that you've seen how to do timings, over to you :-) Also, which is better (if different at all): \d\d or \d{2} ? Also, would matching . be different (performance-wise) than matching the actual character, e.g. matching : ? And lastly, at the end of a line, is there any performance difference between (.+)$ and (.+) Cheers, John -- http://mail.python.org/mailman/listinfo/python-list
Re: Relative performance of comparable regular expressions
On Tue, Jan 13, 2009 at 6:16 AM, Barak, Ron ron.ba...@lsi.com wrote: Hi John, Thanks for the below - teaching me how to fish ( instead of just giving me a fish :-) Now I could definitely get the answers for myself, and also be a bit more enlightened. As for your (2) remark below (on my question: Which would give better performance, matching with ^[a-zA-Z]{3}, or with ^\S{3} ?): (2) I think you mean ^\s{3} not ^\S{3}, I actually did meant to use \S, namely - a character that is not a white-space. (A) Please don't top-post, it makes replying to you more awkward and makes it harder for readers to follow the conversation. (B) But ^[a-zA-Z]{3}, and ^\S{3} aren't even equivalent! \S allows *digits* and *punctuation* too, whereas the former *only* matches letters. Cheers, Chris -- Follow the path of the Iguana... http://rebertia.com -Original Message- From: John Machin [ ] Sent: Tuesday, January 13, 2009 11:15 To: python-list@python.org Subject: Re: Relative performance of comparable regular expressions On Jan 13, 7:24 pm, Barak, Ron ron.ba...@lsi.com wrote: Hi, I have a question about relative performance of comparable regular expressions. I have large log files that start with three letters month names (non-unicode). Which would give better performance, matching with ^[a-zA-Z]{3}, or with ^\S{3} ? (1) If you want to match at the start of a line, use re.match() *without* the pointless ^. Don't use re.search with a pattern starting with ^ -- it won't be any faster than and it could be a lot worse; re.search doesn't know to stop if the first match fails: command-prompt\python26\python -m timeit -simport re;rx=re.compile ('^AB') ;text='Z'*100 rx.match(text) 100 loops, best of 3: 1.15 usec per loop command-prompt\python26\python -m timeit -simport re;rx=re.compile ('^AB') ;text='Z'*100 rx.search(text) 10 loops, best of 3: 4.47 usec per loop command-prompt\python26\python -m timeit -simport re;rx=re.compile ('^AB') ;text='Z'*1000 rx.search(text) 1 loops, best of 3: 34.1 usec per loop (2) I think you mean ^\s{3} not ^\S{3} (3) Now that you've seen how to do timings, over to you :-) Also, which is better (if different at all): \d\d or \d{2} ? Also, would matching . be different (performance-wise) than matching the actual character, e.g. matching : ? And lastly, at the end of a line, is there any performance difference between (.+)$ and (.+) Cheers, John -- http://mail.python.org/mailman/listinfo/python-list