subject:"Regex Question"

Re: a regex question

2019-10-25 Thread dieter

Maggie Q Roth  writes:
> There are two primary types of lines in the log:
>
> 60.191.38.xx/
> 42.120.161.xx   /archives/1005
>
> I know how to write regex to match each line, but don't get the good result
> with one regex to match both lines.
>
> Can you help?

When I look at these lines, I see 2 fields separated by whitespace
(note that two example lines are very very few to guess the
proper pattern). I would not use a regular expression
in this case, but the `split` string method.

A regular expression for this pattern could be `(\S+)\s+(.*)` which reads
a non-empty sequences of none whitespace (assigned to group 1),
whitespace, any sequence (assigned to group 2)
(note that the regular expression above is given on the
regex level. The string in your Python code may look slightly different).

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: a regex question

2019-10-25 Thread Antoon Pardon

On 25/10/19 12:22, Maggie Q Roth wrote:
> Hello
>
> There are two primary types of lines in the log:
>
> 60.191.38.xx/
> 42.120.161.xx   /archives/1005
>
> I know how to write regex to match each line, but don't get the good result
> with one regex to match both lines.

Could you provide the regexes that you have for each line?

-- 
Antoon.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: a regex question

2019-10-25 Thread Brian Oney via Python-list




On October 25, 2019 12:22:44 PM GMT+02:00, Maggie Q Roth  
wrote:
>Hello
>
>There are two primary types of lines in the log:
>
>60.191.38.xx/
>42.120.161.xx   /archives/1005
>
>I know how to write regex to match each line, but don't get the good
>result
>with one regex to match both lines.

What is a good result?

The is an re.MULTILINE flag. Did you try that? What does that do?

-- 
https://mail.python.org/mailman/listinfo/python-list

a regex question

2019-10-25 Thread Maggie Q Roth

Hello

There are two primary types of lines in the log:

60.191.38.xx/
42.120.161.xx   /archives/1005

I know how to write regex to match each line, but don't get the good result
with one regex to match both lines.

Can you help?

Thanks,
Maggie
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-18 Thread Mark Lawrence


On 18/08/2012 06:42, Chris Angelico wrote:

On Sat, Aug 18, 2012 at 2:41 PM, Frank Koshti frank.kos...@gmail.com wrote:

Hi,

I'm new to regular expressions. I want to be able to match for tokens
with all their properties in the following examples. I would
appreciate some direction on how to proceed.


h1@foo1/h1
p@foo2()/p
p@foo3(anything could go here)/p


You can find regular expression primers all over the internet - fire
up your favorite search engine and type those three words in. But it
may be that what you want here is a more flexible parser; have you
looked at BeautifulSoup (so rich and green)?

ChrisA



Totally agree with the sentiment.  There's a comparison of python 
parsers here http://nedbatchelder.com/text/python-parsers.html


--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-18 Thread Roy Smith

In article 
385e732e-1c02-4dd0-ab12-b92890bbe...@o3g2000yqp.googlegroups.com,
 Frank Koshti frank.kos...@gmail.com wrote:

 I'm new to regular expressions. I want to be able to match for tokens
 with all their properties in the following examples. I would
 appreciate some direction on how to proceed.
 
 
 h1@foo1/h1
 p@foo2()/p
 p@foo3(anything could go here)/p

Don't try to parse HTML with regexes.  Use a real HTML parser, such as 
lxml (http://lxml.de/).
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-18 Thread Steven D'Aprano

On Fri, 17 Aug 2012 21:41:07 -0700, Frank Koshti wrote:

 Hi,
 
 I'm new to regular expressions. I want to be able to match for tokens
 with all their properties in the following examples. I would appreciate
 some direction on how to proceed.

Others have already given you excellent advice to NOT use regular 
expressions to parse HTML files, but to use a proper HTML parser instead.

However, since I remember how hard it was to get started with regexes, 
I'm going to ignore that advice and show you how to abuse regexes to 
search for text, and pretend that they aren't HTML tags.

Here's your string you want to search for:

 h1@foo1/h1

You want to find a piece of text that starts with h1@, followed by 
any alphanumeric characters, followed by /h1.


We start by compiling a regex:

import re
pattern = rh1@\w+/h1
regex = re.compile(pattern, re.I)


First we import the re module. Then we define a pattern string. Note that 
I use a raw string instead of a regular string -- this is not 
compulsory, but it is very common.

The difference between a raw string and a regular string is how they 
handle backslashes. In Python, some (but not all!) backslashes are 
special. For example, the regular string \n is not two characters, 
backslash-n, but a single character, Newline. The Python string parser 
converts backslash combinations as special characters, e.g.:

\n = newline
\t = tab
\0 = ASCII Null character
\\ = a single backslash
etc.

We often call these backslash escapes.

Regular expressions use a lot of backslashes, and so it is useful to 
disable the interpretation of backlash escapes when writing regex 
patterns. We do that with a raw string -- if you prefix the string with 
the letter r, the string is raw and backslash-escapes are ignored:

# ordinary cooked string:
abc\n = a b c newline

# raw string
rabc\n = a b c backslash n


Here is our pattern again:

pattern = rh1@\w+/h1

which is thirteen characters:

less-than h 1 greater-than at-sign backslash w plus-sign less-than slash 
h 1 greater-than

Most of the characters shown just match themselves. For example, the @ 
sign will only match another @ sign. But some have special meaning to the 
regex:

\w doesn't match backslash w, but any alphanumeric character;

+ doesn't match a plus sign, but tells the regex to match the previous 
symbol one or more times. Since it immediately follows \w, this means 
match at least one alphanumeric character.

Now we feed that string into the re.compile, to create a pre-compiled 
regex. (This step is optional: any function which takes a compiled regex 
will also accept a string pattern. But pre-compiling regexes which you 
are going to use repeatedly is a good idea.)

regex = re.compile(pattern, re.I)

The second argument to re.compile is a flag, re.I which is a special 
value that tells the regular expression to ignore case, so h will match 
both h and H.

Now on to use the regex. Here's a bunch of text to search:

text = Now is the time for all good men blah blah blah h1spam/h1
and more text here blah blah blah
and some more h1@victory/h1 blah blah blah


And we search it this way:

mo = re.search(regex, text)

mo stands for Match Object, which is returned if the regular 
expression finds something that matches your pattern. If nothing matches, 
then None is returned instead.

if mo is not None:
print(mo.group(0))

= prints h1@victory/h1

So far so good. But we can do better. In this case, we don't really care 
about the tags h1, we only care about the victory part. Here's how to 
use grouping to extract substrings from the regex:

pattern = rh1@(\w+)/h1  # notice the round brackets ()
regex = re.compile(pattern, re.I)
mo = re.search(regex, text)
if mo is not None:
print(mo.group(0))
print(mo.group(1))

This prints:

h1@victory/h1
victory


Hope this helps.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-18 Thread Frank Koshti

I think the point was missed. I don't want to use an XML parser. The
point is to pick up those tokens, and yes I've done my share of RTFM.
This is what I've come up with:

'\$\w*\(?.*?\)'

Which doesn't work well on the above example, which is partly why I
reached out to the group. Can anyone help me with the regex?

Thanks,
Frank
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-18 Thread Frank Koshti

Hey Steven,

Thank you for the detailed (and well-written) tutorial on this very
issue. I actually learned a few things! Though, I still have
unresolved questions.

The reason I don't want to use an XML parser is because the tokens are
not always placed in HTML, and even in HTML, they may appear in
strange places, such as h1 $foo(x=3)Hello/h1. My specific issue is
I need to match, process and replace $foo(x=3), knowing that (x=3) is
optional, and the token might appear simply as $foo.

To do this, I decided to use:

re.compile('\$\w*\(?.*?\)').findall(mystring)

the issue with this is it doesn't match $foo by itself, and requires
there to be () at the end.

Thanks,
Frank
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-18 Thread Peter Otten

Frank Koshti wrote:

 I need to match, process and replace $foo(x=3), knowing that (x=3) is
 optional, and the token might appear simply as $foo.
 
 To do this, I decided to use:
 
 re.compile('\$\w*\(?.*?\)').findall(mystring)
 
 the issue with this is it doesn't match $foo by itself, and requires
 there to be () at the end.

 s = 
... h1$foo1/h1
... p$foo2()/p
... p$foo3(anything could go here)/p
... 
 re.compile((\$\w+(?:\(.*?\))?)).findall(s)
['$foo1', '$foo2()', '$foo3(anything could go here)']


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-18 Thread Vlastimil Brom

2012/8/18 Frank Koshti frank.kos...@gmail.com:
 Hey Steven,

 Thank you for the detailed (and well-written) tutorial on this very
 issue. I actually learned a few things! Though, I still have
 unresolved questions.

 The reason I don't want to use an XML parser is because the tokens are
 not always placed in HTML, and even in HTML, they may appear in
 strange places, such as h1 $foo(x=3)Hello/h1. My specific issue is
 I need to match, process and replace $foo(x=3), knowing that (x=3) is
 optional, and the token might appear simply as $foo.

 To do this, I decided to use:

 re.compile('\$\w*\(?.*?\)').findall(mystring)

 the issue with this is it doesn't match $foo by itself, and requires
 there to be () at the end.

 Thanks,
 Frank
 --
 http://mail.python.org/mailman/listinfo/python-list

Hi,
Although I don't quite get the pattern you are using (with respect to
the specified task), you most likely need raw string syntax for the
pattern, e.g.: r..., instead of ..., or you have to double all
backslashes (which should be escaped), i.e. \\w etc.

I am likely misunderstanding the specification, as the following:
 re.sub(r\$foo\(x=3\), bar, h1 $foo(x=3)Hello/h1)
'h1 barHello/h1'

is probably not the desired output.

For some kind of processing the matched text, you can use the
replace function instead of the replace pattern in re.sub too.
see
http://docs.python.org/library/re.html#re.sub

hth,
  vbr
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-18 Thread Frank Koshti

On Aug 18, 11:48 am, Peter Otten __pete...@web.de wrote:
 Frank Koshti wrote:
  I need to match, process and replace $foo(x=3), knowing that (x=3) is
  optional, and the token might appear simply as $foo.

  To do this, I decided to use:

  re.compile('\$\w*\(?.*?\)').findall(mystring)

  the issue with this is it doesn't match $foo by itself, and requires
  there to be () at the end.
  s = 

 ... h1$foo1/h1
 ... p$foo2()/p
 ... p$foo3(anything could go here)/p
 ...  re.compile((\$\w+(?:\(.*?\))?)).findall(s)

 ['$foo1', '$foo2()', '$foo3(anything could go here)']

PERFECT-
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-18 Thread Jussi Piitulainen

Frank Koshti writes:

 not always placed in HTML, and even in HTML, they may appear in
 strange places, such as h1 $foo(x=3)Hello/h1. My specific issue
 is I need to match, process and replace $foo(x=3), knowing that
 (x=3) is optional, and the token might appear simply as $foo.
 
 To do this, I decided to use:
 
 re.compile('\$\w*\(?.*?\)').findall(mystring)
 
 the issue with this is it doesn't match $foo by itself, and requires
 there to be () at the end.

Adding a ? after the meant-to-be-optional expression would let the
regex engine know what you want. You can also separate the mandatory
and the optional part in the regex to receive pairs as matches. The
test program below prints this:

$foo()$foo(bar=3)$$$foo($)$foo($bar(v=0))etc/htm
('$foo', '')
('$foo', '(bar=3)')
('$foo', '($)')
('$foo', '')
('$bar', '(v=0)')

Here is the program:

import re

def grab(text):
p = re.compile(r'([$]\w+)([(][^()]+[)])?')
return re.findall(p, text)

def test(html):
print(html)
for hit in grab(html):
print(hit)

if __name__ == '__main__':
test('$foo()$foo(bar=3)$$$foo($)$foo($bar(v=0))etc/htm')
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-18 Thread python

Steven,

Well done!!!

Regards,
Malcolm
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-18 Thread Frank Koshti

On Aug 18, 12:22 pm, Jussi Piitulainen jpiit...@ling.helsinki.fi
wrote:
 Frank Koshti writes:
  not always placed in HTML, and even in HTML, they may appear in
  strange places, such as h1 $foo(x=3)Hello/h1. My specific issue
  is I need to match, process and replace $foo(x=3), knowing that
  (x=3) is optional, and the token might appear simply as $foo.

  To do this, I decided to use:

  re.compile('\$\w*\(?.*?\)').findall(mystring)

  the issue with this is it doesn't match $foo by itself, and requires
  there to be () at the end.

 Adding a ? after the meant-to-be-optional expression would let the
 regex engine know what you want. You can also separate the mandatory
 and the optional part in the regex to receive pairs as matches. The
 test program below prints this:

 $foo()$foo(bar=3)$$$foo($)$foo($bar(v=0))etc/htm

 ('$foo', '')
 ('$foo', '(bar=3)')
 ('$foo', '($)')
 ('$foo', '')
 ('$bar', '(v=0)')

 Here is the program:

 import re

 def grab(text):
     p = re.compile(r'([$]\w+)([(][^()]+[)])?')
     return re.findall(p, text)

 def test(html):
     print(html)
     for hit in grab(html):
         print(hit)

 if __name__ == '__main__':
     test('$foo()$foo(bar=3)$$$foo($)$foo($bar(v=0))etc/htm')

You read my mind. I didn't even know that's possible. Thank you-
-- 
http://mail.python.org/mailman/listinfo/python-list

Regex Question

2012-08-17 Thread Frank Koshti

Hi,

I'm new to regular expressions. I want to be able to match for tokens
with all their properties in the following examples. I would
appreciate some direction on how to proceed.


h1@foo1/h1
p@foo2()/p
p@foo3(anything could go here)/p


Thanks-
Frank
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-17 Thread Chris Angelico

On Sat, Aug 18, 2012 at 2:41 PM, Frank Koshti frank.kos...@gmail.com wrote:
 Hi,

 I'm new to regular expressions. I want to be able to match for tokens
 with all their properties in the following examples. I would
 appreciate some direction on how to proceed.


 h1@foo1/h1
 p@foo2()/p
 p@foo3(anything could go here)/p

You can find regular expression primers all over the internet - fire
up your favorite search engine and type those three words in. But it
may be that what you want here is a more flexible parser; have you
looked at BeautifulSoup (so rich and green)?

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

regex question

2011-07-29 Thread rusi

Can someone throw some light on this anomalous behavior?

 import re
 r = re.search('a(b+)', 'ababbaaab')

 r.group(1)
'b'
 r.group(0)
'ab'
 r.group(2)
Traceback (most recent call last):
  File stdin, line 1, in module
IndexError: no such group

 re.findall('a(b+)', 'ababbaaab')
['b', 'bb', 'b']

So evidently group counts by number of '()'s and not by number of
matches (and this is the case whether one uses match or search). So
then whats the point of search-ing vs match-ing?

Or equivalently how to move to the groups of the next match in?

[Side note: The docstrings for this really suck:

 help(r.group)
Help on built-in function group:

group(...)


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2011-07-29 Thread Thomas Jollans

On 29/07/11 16:53, rusi wrote:
 Can someone throw some light on this anomalous behavior?

 import re
 r = re.search('a(b+)', 'ababbaaab')
 r.group(1)
 'b'
 r.group(0)
 'ab'
 r.group(2)
 Traceback (most recent call last):
   File stdin, line 1, in module
 IndexError: no such group

 re.findall('a(b+)', 'ababbaaab')
 ['b', 'bb', 'b']

 So evidently group counts by number of '()'s and not by number of
 matches (and this is the case whether one uses match or search). So
 then whats the point of search-ing vs match-ing?

 Or equivalently how to move to the groups of the next match in?

 [Side note: The docstrings for this really suck:

 help(r.group)
 Help on built-in function group:

 group(...)


Pretty standard regex behaviour: Group 1 is the first pair of brackets.
Group 2 is the second, etc. pp. Group 0 is the whole match.
The difference between matching and searching is that match assumes that
the start of the regex coincides with the start of the string (and this
is documented in the library docs IIRC). re.match(exp, s) is equivalent
to re.search('^'+exp, s). (if not exp.startswith('^'))

Apparently, findall() returns the content of the first group if there is
one. I didn't check this, but I assume it is documented.

 - Thomas
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2011-07-29 Thread MRAB


On 29/07/2011 16:45, Thomas Jollans wrote:

On 29/07/11 16:53, rusi wrote:

Can someone throw some light on this anomalous behavior?


import re
r = re.search('a(b+)', 'ababbaaab')
r.group(1)

'b'

r.group(0)

'ab'

r.group(2)

Traceback (most recent call last):
   File stdin, line 1, inmodule
IndexError: no such group


re.findall('a(b+)', 'ababbaaab')

['b', 'bb', 'b']

So evidently group counts by number of '()'s and not by number of
matches (and this is the case whether one uses match or search). So
then whats the point of search-ing vs match-ing?

Or equivalently how to move to the groups of the next match in?

[Side note: The docstrings for this really suck:


help(r.group)

Help on built-in function group:

group(...)



Pretty standard regex behaviour: Group 1 is the first pair of brackets.
Group 2 is the second, etc. pp. Group 0 is the whole match.
The difference between matching and searching is that match assumes that
the start of the regex coincides with the start of the string (and this
is documented in the library docs IIRC). re.match(exp, s) is equivalent
to re.search('^'+exp, s). (if not exp.startswith('^'))

Apparently, findall() returns the content of the first group if there is
one. I didn't check this, but I assume it is documented.


findall returns a list of tuples (what the groups captured) if there is
more than 1 group, or a list of strings (what the group captured) if
there is 1 group, or a list of strings (what the regex matched) if
there are no groups.
--
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2011-07-29 Thread Rustom Mody

MRAB wrote:
 findall returns a list of tuples (what the groups captured) if there is
more than 1 group,
 or a list of strings (what the group captured) if there is 1 group, or a
list of
 strings (what the regex matched) if there are no groups.

Thanks.
It would be good to put this in the manual dont you think?

Also, the manual says in the 'match' section

Note If you want to locate a match anywhere in *string*, use search()instead.

to guard against users using match when they should be using search.

Likewise it would be helpful if the manual also said (in the match,search
sections)
If more than one match/search is required use findall
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2011-07-29 Thread Thomas Jollans

On 29/07/11 19:52, Rustom Mody wrote:
 MRAB wrote:
  findall returns a list of tuples (what the groups captured) if there
 is more than 1 group,
  or a list of strings (what the group captured) if there is 1 group,
 or a list of
  strings (what the regex matched) if there are no groups.

 Thanks.
 It would be good to put this in the manual dont you think?
It is in the manual.

 Also, the manual says in the 'match' section

 Note If you want to locate a match anywhere in /string/, use search()
 instead.

 to guard against users using match when they should be using search.

 Likewise it would be helpful if the manual also said (in the
 match,search sections)
 If more than one match/search is required use findall



-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question on .findall and \b

2009-07-06 Thread Ethan Furman

Many thanks to all who replied!  And, yes, I will *definitely* use raw 
strings from now on.  :)


~Ethan~
--
http://mail.python.org/mailman/listinfo/python-list

regex question on .findall and \b

2009-07-02 Thread Ethan Furman


Greetings!

My closest to successfull attempt:

Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)]
Type copyright, credits or license for more information.

IPython 0.9.1 -- An enhanced Interactive Python.

  In [161]: re.findall('\d+','this is test a3 attempt 79')
  Out[161]: ['3', '79']

What I really want in just the 79, as a3 is not a decimal number, but 
when I add the \b word boundaries I get:


  In [162]: re.findall('\b\d+\b','this is test a3 attempt 79')
  Out[162]: []

What am I missing?

~Ethan~
--
http://mail.python.org/mailman/listinfo/python-list

Re: regex question on .findall and \b

2009-07-02 Thread Tim Chase


Ethan Furman wrote:

Greetings!

My closest to successfull attempt:

Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)]
Type copyright, credits or license for more information.

IPython 0.9.1 -- An enhanced Interactive Python.

   In [161]: re.findall('\d+','this is test a3 attempt 79')
   Out[161]: ['3', '79']

What I really want in just the 79, as a3 is not a decimal number, but 
when I add the \b word boundaries I get:


   In [162]: re.findall('\b\d+\b','this is test a3 attempt 79')
   Out[162]: []

What am I missing?


The sneaky detail that the regexp should be in a raw string 
(always a good practice), not a cooked string:


  r'\b\d+\b'

The \d isn't a valid character-expansion, so python leaves it 
alone.  However, I believe the \b is a control character, so 
your actual string ends up something like:


   print repr('\b\d+\b')
  '\x08\\d+\x08'
   print repr(r'\b\d+\b')
  '\\b\\d+\\b'

the first of which doesn't match your target string, as you might 
imagine.


-tkc



--
http://mail.python.org/mailman/listinfo/python-list

Re: regex question on .findall and \b

2009-07-02 Thread Sjoerd Mullender


On 2009-07-02 18:38, Ethan Furman wrote:

Greetings!

My closest to successfull attempt:

Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit
(Intel)]
Type copyright, credits or license for more information.

IPython 0.9.1 -- An enhanced Interactive Python.

In [161]: re.findall('\d+','this is test a3 attempt 79')
Out[161]: ['3', '79']

What I really want in just the 79, as a3 is not a decimal number, but
when I add the \b word boundaries I get:

In [162]: re.findall('\b\d+\b','this is test a3 attempt 79')
Out[162]: []

What am I missing?

~Ethan~


Try this:
 re.findall(r'\b\d+\b','this is test a3 attempt 79')
['79']

The \b is a backspace, by using raw strings you get an actual backslash 
and b.


--
Sjoerd Mullender
--
http://mail.python.org/mailman/listinfo/python-list

Re: regex question on .findall and \b

2009-07-02 Thread Nobody

On Thu, 02 Jul 2009 09:38:56 -0700, Ethan Furman wrote:

 Greetings!
 
 My closest to successfull attempt:
 
 Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)]
 Type copyright, credits or license for more information.
 
 IPython 0.9.1 -- An enhanced Interactive Python.
 
In [161]: re.findall('\d+','this is test a3 attempt 79')
Out[161]: ['3', '79']
 
 What I really want in just the 79, as a3 is not a decimal number, but 
 when I add the \b word boundaries I get:
 
In [162]: re.findall('\b\d+\b','this is test a3 attempt 79')
Out[162]: []
 
 What am I missing?

You need to use a raw string (r'...') to prevent \b from being interpreted
as a backspace:

re.findall(r'\b\d+\b','this is test a3 attempt 79')

\d isn't a recognised escape sequence, so it doesn't get interpreted:

 print '\b'
   ^H
 print '\d'
\d
 print r'\b'
\b

Try to get into the habit of using raw strings for regexps.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question on .findall and \b

2009-07-02 Thread Ethan Furman


Ethan Furman wrote:

Greetings!

My closest to successfull attempt:

Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit 
(Intel)]

Type copyright, credits or license for more information.

IPython 0.9.1 -- An enhanced Interactive Python.

  In [161]: re.findall('\d+','this is test a3 attempt 79')
  Out[161]: ['3', '79']

What I really want in just the 79, as a3 is not a decimal number, but 
when I add the \b word boundaries I get:


  In [162]: re.findall('\b\d+\b','this is test a3 attempt 79')
  Out[162]: []

What am I missing?

~Ethan~



ARGH!!

Okay, I need two \\ so I'm not trying to match a backspace.  I knew 
(okay, hoped ;) I would figure it out once I posted the question and 
moved on.


*sheepish grin*

--
http://mail.python.org/mailman/listinfo/python-list

Python Regex Question

2008-10-29 Thread MalteseUnderdog


Hi there I just started python (but this question isn't that trivial
since I couldn't find it in google :) )

I have the following text file entries (simplified)

start  #frag 1 start
x=Dog # frag 1 end
stop
start# frag 2 start
x=Cat # frag 2 end
stop
start #frag 3 start
x=Dog #frag 3 end
stop


I need a regex expression which returns the start to the x=ANIMAL for
only the x=Dog fragments so all my entries should be start ...
(something here) ... x=Dog .  So I am really interested in fragments 1
and 3 only.

My idea (primitive) ^start.*?x=Dog doesn't work because clearly it
would return results

start
x=Dog  # (good)

and

start
x=Cat
stop
start
x=Dog # bad since I only want start ... x=Dog portion

Can you help me ?

Thanks
JP, Malta.
--
http://mail.python.org/mailman/listinfo/python-list

Re: Python Regex Question

2008-10-29 Thread Tim Chase


I need a regex expression which returns the start to the x=ANIMAL for
only the x=Dog fragments so all my entries should be start ...
(something here) ... x=Dog .  So I am really interested in fragments 1
and 3 only.

My idea (primitive) ^start.*?x=Dog doesn't work because clearly it
would return results

start
x=Dog  # (good)

and

start
x=Cat
stop
start
x=Dog # bad since I only want start ... x=Dog portion


Looks like the following does the trick:

 s = start  #frag 1 start
... x=Dog # frag 1 end
... stop
... start# frag 2 start
... x=Cat # frag 2 end
... stop
... start #frag 3 start
... x=Dog #frag 3 end
... stop
 import re
 r = re.compile(r'^start.*\nx=Dog.*\nstop.*', re.MULTILINE)
 for i, result in enumerate(r.findall(s)):
... print i, repr(result)
...
0 'start  #frag 1 start\nx=Dog # frag 1 end\nstop'
1 'start #frag 3 start\nx=Dog #frag 3 end\nstop'

-tkc







--
http://mail.python.org/mailman/listinfo/python-list

Re: Python Regex Question

2008-10-29 Thread Arnaud Delobelle

On Oct 29, 7:01 pm, Tim Chase [EMAIL PROTECTED] wrote:
  I need a regex expression which returns the start to the x=ANIMAL for
  only the x=Dog fragments so all my entries should be start ...
  (something here) ... x=Dog .  So I am really interested in fragments 1
  and 3 only.

  My idea (primitive) ^start.*?x=Dog doesn't work because clearly it
  would return results

  start
  x=Dog  # (good)

  and

  start
  x=Cat
  stop
  start
  x=Dog # bad since I only want start ... x=Dog portion

 Looks like the following does the trick:

   s = start      #frag 1 start
 ... x=Dog # frag 1 end
 ... stop
 ... start    # frag 2 start
 ... x=Cat # frag 2 end
 ... stop
 ... start     #frag 3 start
 ... x=Dog #frag 3 end
 ... stop
   import re
   r = re.compile(r'^start.*\nx=Dog.*\nstop.*', re.MULTILINE)
   for i, result in enumerate(r.findall(s)):
 ...     print i, repr(result)
 ...
 0 'start      #frag 1 start\nx=Dog # frag 1 end\nstop'
 1 'start     #frag 3 start\nx=Dog #frag 3 end\nstop'

 -tkc

This will only work if 'x=Dog' directly follows 'start' (which happens
in the given example).  If that's not necessarily the case, I would do
it in two steps (in fact I wouldn't use regexps probably but...):

 for chunk in re.split(r'\nstop', data):
... m = re.search('^start.*^x=Dog', chunk, re.DOTALL |
re.MULTILINE)
... if m: print repr(m.group())
...
'start  #frag 1 start \nx=Dog'
'start #frag 3 start \nx=Dog'

--
Arnaud

--
http://mail.python.org/mailman/listinfo/python-list

Re: Python Regex Question

2008-10-29 Thread Terry Reedy


MalteseUnderdog wrote:

Hi there I just started python (but this question isn't that trivial
since I couldn't find it in google :) )

I have the following text file entries (simplified)

start  #frag 1 start
x=Dog # frag 1 end
stop
start# frag 2 start
x=Cat # frag 2 end
stop
start #frag 3 start
x=Dog #frag 3 end
stop


I need a regex expression which returns the start to the x=ANIMAL for
only the x=Dog fragments so all my entries should be start ...
(something here) ... x=Dog .  So I am really interested in fragments 1
and 3 only.


As I understand the above
I would first write a generator that separates the file into fragments 
and yields them one at a time.  Perhaps something like


def fragments(ifile):
  frag = []
  for line in ifile:
frag += line
if line ends fragment:
  yield frag
  frag = []

Then I would iterate through fragments, testing for the ones I want:

for frag in fragments(somefile):
  if 'x=Dog' in frag:
do whatever

Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Re: Python regex question

2008-08-15 Thread Tim N. van der Leeuw


Hey Gerhard,


Gerhard Häring wrote:
 
 Tim van der Leeuw wrote:
 Hi,
 
 I'm trying to create a regular expression for matching some particular 
 XML strings. I want to extract the contents of a particular XML tag, 
 only if it follows one tag, but not follows another tag. Complicating 
 this, is that there can be any number of other tags in between. [...]
 
 Sounds like this would be easier to implement using Python's SAX API.
 
 Here's a short example that does something similar to what you want to 
 achieve:
 
 [...]
 

I so far forgot to say a thank you for the suggestion :-)

The sample code as you sent it doesn't do what I need to do, but I did look
at it for creating SAX handler code that does what I want.

It took me a while to implement, as it didn't fit in the parser-engine I had
and I was close to making a release.

But still: thanks!

--Tim

-- 
View this message in context: 
http://www.nabble.com/Python-regex-question-tp17773487p18997385.html
Sent from the Python - python-list mailing list archive at Nabble.com.

--
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2008-08-06 Thread Tobiah

On Tue, 05 Aug 2008 15:55:46 +0100, Fred Mangusta wrote:

 Chris wrote:
 
 Doesn't work for his use case as he wants to keep periods marking the
 end of a sentence.

Doesn't it?  The period has to be surrounded by digits in the
example solution, so wouldn't periods followed by a space
(end of sentence) always make it through?



** Posted from http://www.teranews.com **
--
http://mail.python.org/mailman/listinfo/python-list

regex question

2008-08-05 Thread Fred Mangusta


Hi,

I would like to delete all the instances of a '.' into a number.

In other words I'd like to replace all the instances of a '.' character 
with something (say nothing at all) when the '.' is representing a 
decimal separator. E.g.


500.675     500675

but also

1.000.456.344  1000456344

I don't care about the fact the the resulting number is difficult to 
read: as long as it remains a series of digits it's ok: the important 
thing is to get rid of the period, because I want to keep it only where 
it marks the end of a sentence.


I was trying to do like this

s=re.sub([(\d+)(\.)(\d+)],... ,s)

but I don't know much about regular expressions, and don't know how to 
get the two groups of numbers and join them in the sub. Moreover doing 
like this I only match things like 345.000 and not 1.000.000.


What's the correct approach?

Thanks
F.
--
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2008-08-05 Thread Marc 'BlackJack' Rintsch

On Tue, 05 Aug 2008 11:39:36 +0100, Fred Mangusta wrote:

 In other words I'd like to replace all the instances of a '.' character
 with something (say nothing at all) when the '.' is representing a
 decimal separator. E.g.
 
 500.675     500675
 
 but also
 
 1.000.456.344  1000456344
 
 I don't care about the fact the the resulting number is difficult to
 read: as long as it remains a series of digits it's ok: the important
 thing is to get rid of the period, because I want to keep it only where
 it marks the end of a sentence.
 
 I was trying to do like this
 
 s=re.sub([(\d+)(\.)(\d+)],... ,s)
 
 but I don't know much about regular expressions, and don't know how to
 get the two groups of numbers and join them in the sub. Moreover doing
 like this I only match things like 345.000 and not 1.000.000.
 
 What's the correct approach?

In [13]: re.sub(r'(\d)\.(\d)', r'\1\2', '1.000.456.344')
Out[13]: '1000456344'

Ciao,
Marc 'BlackJack' Rintsch
--
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2008-08-05 Thread Alexei Zankevich

No, there is a bad way - because of the example doesn't solve arbitrary
amount of number.number.. blocks.
But the python regexp engine supports for lookahead (?=pattern) and
lookbehind (?=pattern).
In those cases patterns are not included into the replaced sequence of
characters:
 re.sub('(?=\d)\.(?=\d)', '', '1234.324 abc.100.abc abc.abc')
'1234324 abc.100.abc abc.abc'

Alexey

On Tue, Aug 5, 2008 at 2:10 PM, Marc 'BlackJack' Rintsch [EMAIL 
PROTECTED]wrote:

 On Tue, 05 Aug 2008 11:39:36 +0100, Fred Mangusta wrote:

  In other words I'd like to replace all the instances of a '.' character
  with something (say nothing at all) when the '.' is representing a
  decimal separator. E.g.
 
  500.675     500675
 
  but also
 
  1.000.456.344  1000456344
 
  I don't care about the fact the the resulting number is difficult to
  read: as long as it remains a series of digits it's ok: the important
  thing is to get rid of the period, because I want to keep it only where
  it marks the end of a sentence.
 
  I was trying to do like this
 
  s=re.sub([(\d+)(\.)(\d+)],... ,s)
 
  but I don't know much about regular expressions, and don't know how to
  get the two groups of numbers and join them in the sub. Moreover doing
  like this I only match things like 345.000 and not 1.000.000.
 
  What's the correct approach?

 In [13]: re.sub(r'(\d)\.(\d)', r'\1\2', '1.000.456.344')
 Out[13]: '1000456344'


 Ciao,
 Marc 'BlackJack' Rintsch
 --
 http://mail.python.org/mailman/listinfo/python-list

--
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2008-08-05 Thread Jeff

On Aug 5, 7:10 am, Marc 'BlackJack' Rintsch [EMAIL PROTECTED] wrote:
 On Tue, 05 Aug 2008 11:39:36 +0100, Fred Mangusta wrote:
  In other words I'd like to replace all the instances of a '.' character
  with something (say nothing at all) when the '.' is representing a
  decimal separator. E.g.

  500.675         500675

  but also

  1.000.456.344  1000456344

  I don't care about the fact the the resulting number is difficult to
  read: as long as it remains a series of digits it's ok: the important
  thing is to get rid of the period, because I want to keep it only where
  it marks the end of a sentence.

  I was trying to do like this

  s=re.sub([(\d+)(\.)(\d+)],... ,s)

  but I don't know much about regular expressions, and don't know how to
  get the two groups of numbers and join them in the sub. Moreover doing
  like this I only match things like 345.000 and not 1.000.000.

  What's the correct approach?

 In [13]: re.sub(r'(\d)\.(\d)', r'\1\2', '1.000.456.344')
 Out[13]: '1000456344'

 Ciao,
         Marc 'BlackJack' Rintsch

Even faster:

'1.000.456.344'.replace('.', '') = '1000456344'
--
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2008-08-05 Thread Alexei Zankevich

=)
Indeed. But it will replace all dots including ordinary strings instead of
numbers only.

On Tue, Aug 5, 2008 at 3:23 PM, Jeff [EMAIL PROTECTED] wrote:

 On Aug 5, 7:10 am, Marc 'BlackJack' Rintsch [EMAIL PROTECTED] wrote:
  On Tue, 05 Aug 2008 11:39:36 +0100, Fred Mangusta wrote:
   In other words I'd like to replace all the instances of a '.' character
   with something (say nothing at all) when the '.' is representing a
   decimal separator. E.g.
 
   500.675     500675
 
   but also
 
   1.000.456.344  1000456344
 
   I don't care about the fact the the resulting number is difficult to
   read: as long as it remains a series of digits it's ok: the important
   thing is to get rid of the period, because I want to keep it only where
   it marks the end of a sentence.
 
   I was trying to do like this
 
   s=re.sub([(\d+)(\.)(\d+)],... ,s)
 
   but I don't know much about regular expressions, and don't know how to
   get the two groups of numbers and join them in the sub. Moreover doing
   like this I only match things like 345.000 and not 1.000.000.
 
   What's the correct approach?
 
  In [13]: re.sub(r'(\d)\.(\d)', r'\1\2', '1.000.456.344')
  Out[13]: '1000456344'
 
  Ciao,
  Marc 'BlackJack' Rintsch

 Even faster:

 '1.000.456.344'.replace('.', '') = '1000456344'
 --
 http://mail.python.org/mailman/listinfo/python-list

--
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2008-08-05 Thread Chris

On Aug 5, 2:23 pm, Jeff [EMAIL PROTECTED] wrote:
 On Aug 5, 7:10 am, Marc 'BlackJack' Rintsch [EMAIL PROTECTED] wrote:



  On Tue, 05 Aug 2008 11:39:36 +0100, Fred Mangusta wrote:
   In other words I'd like to replace all the instances of a '.' character
   with something (say nothing at all) when the '.' is representing a
   decimal separator. E.g.

   500.675         500675

   but also

   1.000.456.344  1000456344

   I don't care about the fact the the resulting number is difficult to
   read: as long as it remains a series of digits it's ok: the important
   thing is to get rid of the period, because I want to keep it only where
   it marks the end of a sentence.

   I was trying to do like this

   s=re.sub([(\d+)(\.)(\d+)],... ,s)

   but I don't know much about regular expressions, and don't know how to
   get the two groups of numbers and join them in the sub. Moreover doing
   like this I only match things like 345.000 and not 1.000.000.

   What's the correct approach?

  In [13]: re.sub(r'(\d)\.(\d)', r'\1\2', '1.000.456.344')
  Out[13]: '1000456344'

  Ciao,
          Marc 'BlackJack' Rintsch

 Even faster:

 '1.000.456.344'.replace('.', '') = '1000456344'

Doesn't work for his use case as he wants to keep periods marking the
end of a sentence.
--
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2008-08-05 Thread Fred Mangusta


Chris wrote:


Doesn't work for his use case as he wants to keep periods marking the
end of a sentence.


Exactly. Thanks to all of you anyway, now I have a better understanding 
on how to go on :)


F.
--
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2008-08-05 Thread MRAB

On Aug 5, 11:39 am, Fred Mangusta [EMAIL PROTECTED] wrote:
 Hi,

 I would like to delete all the instances of a '.' into a number.

 In other words I'd like to replace all the instances of a '.' character
 with something (say nothing at all) when the '.' is representing a
 decimal separator. E.g.

 500.675         500675

 but also

 1.000.456.344  1000456344

 I don't care about the fact the the resulting number is difficult to
 read: as long as it remains a series of digits it's ok: the important
 thing is to get rid of the period, because I want to keep it only where
 it marks the end of a sentence.

 I was trying to do like this

 s=re.sub([(\d+)(\.)(\d+)],... ,s)

 but I don't know much about regular expressions, and don't know how to
 get the two groups of numbers and join them in the sub. Moreover doing
 like this I only match things like 345.000 and not 1.000.000.

 What's the correct approach?

I would use look-behind (is it preceded by a digit?) and look-ahead
(is it followed by a digit?):

s = re.sub(r'(?=\d)\.(?=\d)', '', s)
--
http://mail.python.org/mailman/listinfo/python-list

Python regex question

2008-06-11 Thread Tim van der Leeuw

Hi,

I'm trying to create a regular expression for matching some particular XML
strings. I want to extract the contents of a particular XML tag, only if it
follows one tag, but not follows another tag. Complicating this, is that
there can be any number of other tags in between.

So basically, my regular expression should have 3 parts:
- first match
- any random text, that should not contain string 'Xds'
- second match

I have a problem figuring out how to do the second part: a random bit of
text, that should _not_ contain the substring 'Xds' ('Xds' being the start
of any tags which should not be in between my first and second match).
Because of the variable length of the overal match, I cannot do this with a
negative look-behind assertion, and a negative look-ahead assertion doesn't
seem to work either.

The regular expression that I have now is:

r'(?s)Xds\w*Policy.*?ref(?Ppol_ref\d+)/ref'

(hopefully without typos)

Here 'Xds\w*Policy' is my first match, and 'ref(?Ppol_ref\d+)/ref'
is my second match.

In this expression, I want to change the generic '.*?', which matches
everything, with something that matches every string that does not include
the substring 'Xds'.

I know that I could capture the text matched by '.*?' and manually check if
it contains that string 'Xds', but that would be very hard to fit into the
rest of the code, for a number of reasons.

Does anyone have an idea how to do this within one regular expression?

Regards,

--Tim
--
http://mail.python.org/mailman/listinfo/python-list

Re: Python regex question

2008-06-11 Thread Gerhard Häring


Tim van der Leeuw wrote:

Hi,

I'm trying to create a regular expression for matching some particular 
XML strings. I want to extract the contents of a particular XML tag, 
only if it follows one tag, but not follows another tag. Complicating 
this, is that there can be any number of other tags in between. [...]


Sounds like this would be easier to implement using Python's SAX API.

Here's a short example that does something similar to what you want to 
achieve:


import xml.sax

test_str = 
xml
ignore/
foo x=1 y=2/
noignore/
foo x=3 y=4/
/xml


class MyHandler(xml.sax.handler.ContentHandler):
def __init__(self):
xml.sax.handler.ContentHandler.__init__(self)
self.ignore_next = False

def startElement(self, name, attrs):
if name == ignore:
self.ignore_next = True
return
elif name == foo:
if not self.ignore_next:
# handle the element you're interested in here
print MY ELEMENT, name, with, dict(attrs)

self.ignore_next = False

xml.sax.parseString(test_str, MyHandler())

In this case, this looks much clearer and easier to understand to me 
than regular expressions.


-- Gerhard

--
http://mail.python.org/mailman/listinfo/python-list

regex question

2008-02-13 Thread mathieu

I do not understand what is wrong with the following regex expression.
I clearly mark that the separator in between group 3 and group 4
should contain at least 2 white space, but group 3 is actually reading
3 +4

Thanks
-Mathieu

import re

line =   (0021,xx0A)   Siemens: Thorax/Multix FD Lab Settings
Auto Window Width  SL   1 
patt = re.compile(^\s*\(([0-9A-Z]+),([0-9A-Zx]+)\)\s+([A-Za-z0-9./:_
-]+)\s\s+([A-Za-z0-9 ()._,/#-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s*
$)
m = patt.match(line)
if m:
  print m.group(3)
  print m.group(4)
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2008-02-13 Thread Wanja Chresta

Hey Mathieu

Due to word wrap I'm not sure what you want to do. What result do you
expect? I get:
 print m.groups()
('0021', 'xx0A', 'Siemens: Thorax/Multix FD Lab Settings Auto Window
Width  ', ' ', 'SL', '1')
But only when I insert a space in the 3rd char group (I'm not sure if
your original pattern has a space there or not). So the third group is:
([A-Za-z0-9./:_ -]+). If I do not insert the space, the pattern does not
match the line.

I also cant see how the format of your line is. If it is like this:
line = ...Siemens: Thorax/Multix FD Lab Settings  Auto Window Width...
where Auto Window Width should be the 4th group, you have to mark the
+ in the 3rd group as non-greedy (it's done with a ?):
http://docs.python.org/lib/re-syntax.html
([A-Za-z0-9./:_ -]+?)
With that I get:
 patt.match(line).groups()
('0021', 'xx0A', 'Siemens: Thorax/Multix FD Lab Settings', 'Auto Window
Width ', 'SL', '1')
Which probably is what you want. You can also add the non-greedy marker
in the fourth group, to get rid of the tailing spaces.

HTH
Wanja


mathieu wrote:
 I clearly mark that the separator in between group 3 and group 4
 should contain at least 2 white space, but group 3 is actually reading
 3 +4

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2008-02-13 Thread grflanagan

On Feb 13, 1:53 pm, mathieu [EMAIL PROTECTED] wrote:
 I do not understand what is wrong with the following regex expression.
 I clearly mark that the separator in between group 3 and group 4
 should contain at least 2 white space, but group 3 is actually reading
 3 +4

 Thanks
 -Mathieu

 import re

 line =   (0021,xx0A)   Siemens: Thorax/Multix FD Lab Settings
 Auto Window Width  SL   1 
 patt = re.compile(^\s*\(([0-9A-Z]+),([0-9A-Zx]+)\)\s+([A-Za-z0-9./:_
 -]+)\s\s+([A-Za-z0-9 ()._,/#-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s*
 $)
 m = patt.match(line)
 if m:
   print m.group(3)
   print m.group(4)


I don't know if it solves your problem, but if you want to match a
dash (-), then it must be either escaped or be the first element in a
character class.

Gerard
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2008-02-13 Thread bearophileHUGS

mathieu, stop writing complex REs like obfuscated toys, use the
re.VERBOSE flag and split that RE into several commented and
*indented* lines (indented just like Python code), the indentation
level has to be used to denote nesting. With that you may be able to
solve the problem by yourself. If not, you can offer us a much more
readable thing to fix.

Bye,
bearophile
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2008-02-13 Thread Paul McGuire

On Feb 13, 6:53 am, mathieu [EMAIL PROTECTED] wrote:
 I do not understand what is wrong with the following regex expression.
 I clearly mark that the separator in between group 3 and group 4
 should contain at least 2 white space, but group 3 is actually reading
 3 +4

 Thanks
 -Mathieu

 import re

 line =       (0021,xx0A)   Siemens: Thorax/Multix FD Lab Settings
 Auto Window Width          SL   1 
 patt = re.compile(^\s*\(([0-9A-Z]+),([0-9A-Zx]+)\)\s+([A-Za-z0-9./:_
 -]+)\s\s+([A-Za-z0-9 ()._,/#-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s*
 $)
snip

I love the smell of regex'es in the morning!

For more legible posting (and general maintainability), try breaking
up your quoted strings like this:

line = \
  (0021,xx0A)   Siemens: Thorax/Multix FD Lab Settings   \
Auto Window Width  SL   1 

patt = re.compile(
^\s*
\(
([0-9A-Z]+),
([0-9A-Zx]+)
\)\s+
([A-Za-z0-9./:_ -]+)\s\s+
([A-Za-z0-9 ()._,/#-]+)\s+
([A-Z][A-Z]_?O?W?)\s+
([0-9n-]+)\s*$)


Of course, the problem is that you have a greedy match in the part of
the regex that is supposed to stop between Settings and Auto.
Change patt to:

patt = re.compile(
^\s*
\(
([0-9A-Z]+),
([0-9A-Zx]+)
\)\s+
([A-Za-z0-9./:_ -]+?)\s\s+
([A-Za-z0-9 ()._,/#-]+)\s+
([A-Z][A-Z]_?O?W?)\s+
([0-9n-]+)\s*$)

or if you prefer:

patt = re.compile(^\s*\(([0-9A-Z]+),([0-9A-Zx]+)\)\s+([A-Za-z0-9./:_
-]+?)\s\s+([A-Za-z0-9 ()._,/#-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s*
$)

It looks like you wrote this regex to process this specific input
string - it has a fragile feel to it, as if you will have to go back
and tweak it to handle other data that might come along, such as

  (xx42,xx0A)   Honeywell: Inverse Flitznoid (Kelvin)
80  SL   1


Just out of curiosity, I wondered what a pyparsing version of this
would look like.  See below:

from pyparsing import Word,hexnums,delimitedList,printables,\
White,Regex,nums

line = \
  (0021,xx0A)   Siemens: Thorax/Multix FD Lab Settings   \
Auto Window Width  SL   1 

# define fields
hexint = Word(hexnums+x)
text = delimitedList(Word(printables),
delim=White( ,exact=1), combine=True)
type_label = Regex([A-Z][A-Z]_?O?W?)
int_label = Word(nums+n-)

# define line structure - give each field a name
line_defn = ( + hexint(x) + , + hexint(y) + ) + \
text(desc) + text(window) + type_label(type) + \
int_label(int)

line_parts = line_defn.parseString(line)
print line_parts.dump()
print line_parts.desc

Prints:
['(', '0021', ',', 'xx0A', ')', 'Siemens: Thorax/Multix FD Lab
Settings', 'Auto Window Width', 'SL', '1']
- desc: Siemens: Thorax/Multix FD Lab Settings
- int: 1
- type: SL
- window: Auto Window Width
- x: 0021
- y: xx0A
Siemens: Thorax/Multix FD Lab Settings

I was just guessing on the field names, but you can see where they are
defined and change them to the appropriate values.

-- Paul
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: a newbie regex question

2008-01-25 Thread Max Erickson

Dotan Cohen [EMAIL PROTECTED] wrote:
 Maybe you mean:
 for match in re.finditer(r'\([A-Z].+[a-z])\', contents):
 Note the last backslash was in the wrong place.

The location of the backslash in the orignal reply is correct, it is 
there to escape the closing paren, which is a special character:

 import re
 s='Abcd\nabc (Ab), (ab)'
 re.findall(r'\([A-Z].+[a-z]\)', s)
['(Ab), (ab)']

Putting the backslash at the end of the string like you indicated 
results in a syntax error, as it escapes the closing single quote of 
the raw string literal: 

 re.findall(r'\([A-Z].+[a-z])\', s)
   
SyntaxError: EOL while scanning single-quoted string
 


max


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: a newbie regex question

2008-01-25 Thread Dotan Cohen

On 24/01/2008, Jonathan Gardner [EMAIL PROTECTED] wrote:
 On Jan 24, 12:14 pm, Shoryuken [EMAIL PROTECTED] wrote:
  Given a regular expression pattern, for example, \([A-Z].+[a-z]\),
 
  print out all strings that match the pattern in a file
 
  Anyone tell me a way to do it? I know it's easy, but i'm completely
  new to python
 
  thanks alot

 You may want to read the pages on regular expressions in the online
 documentation: http://www.python.org/doc/2.5/lib/module-re.html

 The simple approach works:

   import re

   # Open the file
   f = file('/your/filename.txt')

   # Read the file into a single string.
   contents = f.read()

   # Find all matches in the string of the regular expression and
 iterate through them.
   for match in re.finditer(r'\([A-Z].+[a-z]\)', contents):
 # Print what was matched
 print match.group()

Maybe you mean:
for match in re.finditer(r'\([A-Z].+[a-z])\', contents):

Note the last backslash was in the wrong place.

Dotan Cohen

http://what-is-what.com
http://gibberish.co.il
א-ב-ג-ד-ה-ו-ז-ח-ט-י-ך-כ-ל-ם-מ-ן-נ-ס-ע-ף-פ-ץ-צ-ק-ר-ש-ת

A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
-- 
http://mail.python.org/mailman/listinfo/python-list

a newbie regex question

2008-01-24 Thread Shoryuken

Given a regular expression pattern, for example, \([A-Z].+[a-z]\),

print out all strings that match the pattern in a file

Anyone tell me a way to do it? I know it's easy, but i'm completely
new to python

thanks alot
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: a newbie regex question

2008-01-24 Thread Jonathan Gardner

On Jan 24, 12:14 pm, Shoryuken [EMAIL PROTECTED] wrote:
 Given a regular expression pattern, for example, \([A-Z].+[a-z]\),

 print out all strings that match the pattern in a file

 Anyone tell me a way to do it? I know it's easy, but i'm completely
 new to python

 thanks alot

You may want to read the pages on regular expressions in the online
documentation: http://www.python.org/doc/2.5/lib/module-re.html

The simple approach works:

  import re

  # Open the file
  f = file('/your/filename.txt')

  # Read the file into a single string.
  contents = f.read()

  # Find all matches in the string of the regular expression and
iterate through them.
  for match in re.finditer(r'\([A-Z].+[a-z]\)', contents):
# Print what was matched
print match.group()
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: python/regex question... hope someone can help

2007-12-09 Thread John Machin

On Dec 9, 6:13 pm, charonzen [EMAIL PROTECTED] wrote:
 I have a list of strings.  These strings are previously selected
 bigrams with underscores between them ('and_the', 'nothing_given', and
 so on).  I need to write a regex that will read another text string
 that this list was derived from and replace selections in this text
 string with those from my list.  So in my text string, '... and the...
 ' becomes ' ... and_the...'.   I can't figure out how to manipulate

 re.sub(r'([a-z]*) ([a-z]*)', r'()', textstring)

 Any suggestions?

The usual suggestion is: Don't bother with regexes when simple string
methods will do the job.

 def ch_replace(alist, text):
... for bigram in alist:
... original = bigram.replace('_', ' ')
... text = text.replace(original, bigram)
... return text
...
 print ch_replace(
... ['quick_brown', 'lazy_dogs', 'brown_fox'],
... 'The quick brown fox jumped over the lazy dogs.'
... )
The quick_brown_fox jumped over the lazy_dogs.
 print ch_replace(['red_herring'], 'He prepared herring fillets.')
He prepared_herring fillets.


Another suggestion is to ensure that the job specification is not
overly simplified. How did you parse the text into words in the
prior exercise that produced the list of bigrams? Won't you need to
use the same parsing method in the current exercise of tagging the
bigrams with an underscore?

Cheers,
John
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: python/regex question... hope someone can help

2007-12-09 Thread John Machin

On Dec 9, 6:13 pm, charonzen [EMAIL PROTECTED] wrote:

The following *may* come close to doing what your revised spec
requires:

import re
def ch_replace2(alist, text):
for bigram in alist:
pattern = r'\b' + bigram.replace('_', ' ') + r'\b'
text = re.sub(pattern, bigram, text)
return text

Cheers,
John
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: python/regex question... hope someone can help

2007-12-09 Thread charonzen


 Another suggestion is to ensure that the job specification is not
 overly simplified. How did you parse the text into words in the
 prior exercise that produced the list of bigrams? Won't you need to
 use the same parsing method in the current exercise of tagging the
 bigrams with an underscore?

 Cheers,
 John

Thank you John, that definitely puts things in perspective!  I'm very
new to both Python and text parsing, and I often feel that I can't see
the forest for the trees.  If you're asking, I'm working on a project
that utilizes Church's mutual information score.  I tokenize my text,
split it into a list, derive some unigram and bigram dictionaries, and
then calculate a pmi dictionary based on x,y from the bigrams and
unigrams.  The bigrams that pass my threshold then get put into my
list of x_y strings, and you know the rest.  By modifying the original
text file, I can view 'x_y', z pairs as x,y and iterate it until I
have some collocations that are worth playing with.  So I think that
covers the question the same parsing method.  I'm sure there are more
pythonic ways to do it, but I'm on deadline :)

Thanks again!

Brandon
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: python/regex question... hope someone can help

2007-12-09 Thread Gabriel Genellina

En Sun, 09 Dec 2007 16:45:53 -0300, charonzen [EMAIL PROTECTED]  
escribió:

 [John Machin] Another suggestion is to ensure that the job  
 specification is not
 overly simplified. How did you parse the text into words in the
 prior exercise that produced the list of bigrams? Won't you need to
 use the same parsing method in the current exercise of tagging the
 bigrams with an underscore?

 Thank you John, that definitely puts things in perspective!  I'm very
 new to both Python and text parsing, and I often feel that I can't see
 the forest for the trees.  If you're asking, I'm working on a project
 that utilizes Church's mutual information score.  I tokenize my text,
 split it into a list, derive some unigram and bigram dictionaries, and
 then calculate a pmi dictionary based on x,y from the bigrams and
 unigrams.  The bigrams that pass my threshold then get put into my
 list of x_y strings, and you know the rest.  By modifying the original
 text file, I can view 'x_y', z pairs as x,y and iterate it until I
 have some collocations that are worth playing with.  So I think that
 covers the question the same parsing method.  I'm sure there are more
 pythonic ways to do it, but I'm on deadline :)

Looks like you should work with the list of tokens, collapsing consecutive  
elements, not with the original text. Should be easier, and faster because  
you don't regenerate the text and tokenize it again and again.

-- 
Gabriel Genellina

-- 
http://mail.python.org/mailman/listinfo/python-list

python/regex question... hope someone can help

2007-12-08 Thread charonzen

I have a list of strings.  These strings are previously selected
bigrams with underscores between them ('and_the', 'nothing_given', and
so on).  I need to write a regex that will read another text string
that this list was derived from and replace selections in this text
string with those from my list.  So in my text string, '... and the...
' becomes ' ... and_the...'.   I can't figure out how to manipulate

re.sub(r'([a-z]*) ([a-z]*)', r'()', textstring)

Any suggestions?

Thank you if you can help!
-- 
http://mail.python.org/mailman/listinfo/python-list

RegEx question

2007-10-04 Thread Robert Dailey

Hi,

The following regex (Not including the end quotes):

@param\[in|out\] \w+ 

Should match any of the following:

@param[in] variable
@param[out] state
@param[in] foo
@param[out] bar


Correct? (Note the trailing whitespace in the regex as well as in the
examples)
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RegEx question

2007-10-04 Thread Robert Dailey

It should also match:

@param[out] state Some description of this variable


On 10/4/07, Robert Dailey [EMAIL PROTECTED] wrote:

 Hi,

 The following regex (Not including the end quotes):

 @param\[in|out\] \w+ 

 Should match any of the following:

 @param[in] variable
 @param[out] state
 @param[in] foo
 @param[out] bar


 Correct? (Note the trailing whitespace in the regex as well as in the
 examples)

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RegEx question

2007-10-04 Thread Adam Lanier

On Thu, 2007-10-04 at 10:58 -0500, Robert Dailey wrote:
 It should also match:
 
 @param[out] state Some description of this variable
 
 
 On 10/4/07, Robert Dailey [EMAIL PROTECTED] wrote:
 Hi,
 
 The following regex (Not including the end quotes):
 
 @param\[in|out\] \w+ 
 
 Should match any of the following:
 
 @param[in] variable 
 @param[out] state 
 @param[in] foo 
 @param[out] bar 
 
 
 Correct? (Note the trailing whitespace in the regex as well as
 in the examples)
 
 -- 
 http://mail.python.org/mailman/listinfo/python-list

try @param\[(in|out)\] \w+ 

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RegEx question

2007-10-04 Thread J. Clifford Dyer

You *are* talking about python regular expressions, right?  There are a number 
of different dialects.  Also, there could be issues with the quoting method 
(are you using raw strings?)  

The more specific you can get, the more we can help you.

Cheers,
Cliff
On Thu, Oct 04, 2007 at 11:54:32AM -0500, Robert Dailey wrote regarding Re: 
RegEx question:
 
On 10/4/07, Adam Lanier [EMAIL PROTECTED] wrote:
 
  try @param\[(in|out)\] \w+
 
This didn't work either :(
The tool using this regular expression (Comment Reflower for VS2005)
May be broken...
 
 References
 
1. mailto:[EMAIL PROTECTED]

 -- 
 http://mail.python.org/mailman/listinfo/python-list
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RegEx question

2007-10-04 Thread Robert Dailey

On 10/4/07, Adam Lanier [EMAIL PROTECTED] wrote:


 try @param\[(in|out)\] \w+


This didn't work either :(

The tool using this regular expression (Comment Reflower for VS2005) May be
broken...
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RegEx question

2007-10-04 Thread Robert Dailey

On 10/4/07, J. Clifford Dyer [EMAIL PROTECTED] wrote:

 You *are* talking about python regular expressions, right?  There are a
 number of different dialects.  Also, there could be issues with the quoting
 method (are you using raw strings?)

 The more specific you can get, the more we can help you.


As far as the dialect, I can't be sure. I am unable to find documentation
for Comment Reflower and thus cannot figure out what type of regex it is
using. What exactly do you mean by your question, are you using raw
strings?. Thanks for your response and I apologize for the lack of detail.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RegEx question

2007-10-04 Thread Jerry Hill

 As far as the dialect, I can't be sure. I am unable to find documentation
 for Comment Reflower and thus cannot figure out what type of regex it is
 using. What exactly do you mean by your question, are you using raw
 strings?. Thanks for your response and I apologize for the lack of detail.

Comment Reflower appears to be a plugin for Visual Studio written in
C#.  As far as I can tell, it has nothing to do with Python at all.

A quick look at their sourceforge page
(http://sourceforge.net/projects/commentreflower/) doesn't show any
mailing lists or discussion groups.  Maybe try emailing the author
directly, or asking a C# language group about whatever the standard C#
regular expression library is.

-- 
Jerry
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RegEx question

2007-10-04 Thread Manu Hack

On 10/4/07, Robert Dailey [EMAIL PROTECTED] wrote:
 On 10/4/07, Adam Lanier [EMAIL PROTECTED] wrote:
 
  try @param\[(in|out)\] \w+
 

 This didn't work either :(

 The tool using this regular expression (Comment Reflower for VS2005) May be
 broken...

 --
 http://mail.python.org/mailman/listinfo/python-list


How about @param\[[i|o][n|u]t*\]\w+ ?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RegEx question

2007-10-04 Thread Tim Chase

[sigh...replying to my own post]
 However, things to try:
 
 - sometimes the grouping parens need to be escaped with \
 
 - sometimes \w isn't a valid character class, so use the 
 long-hand variant of something like [a-zA-Z0-9_]]
 
 - sometimes the + is escaped with a \
 
 - if you don't use raw strings, you'll need to escape your \ 
 characters, making each instance \\

just to be clear...these are some variants you may find in 
non-python regexps (or in python regexps if you're not using raw 
strings)

-tkc




-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RegEx question

2007-10-04 Thread Robert Dailey

I am not a regex expert, I simply assumed regex was standardized to follow
specific guidelines. I also made the assumption that this was a good place
to pose the question since regular expressions are a feature of Python. The
question concerned regular expressions in general, not really the
application. However, now that I know that regex can be different, I'll try
to contact the author directly to find out the dialect and then find the
appropriate location for my question from there. I do appreciate everyone's
help. I've tried the various suggestions offered here, however none of them
work. I can only assume at this point that this regex is drastically
different or the application reading the regex is just broken.

Thanks again for everyones help!
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RegEx question

2007-10-04 Thread Tim Chase

 try @param\[(in|out)\] \w+

 This didn't work either :(

 The tool using this regular expression (Comment Reflower for VS2005) May be
 broken...
 
 How about @param\[[i|o][n|u]t*\]\w+ ?

...if you want to accept patterns like

   @param[iutt]xxx

...

The regexp at the top (Adam's original reply) would be the valid 
regexp in python and matches all the tests thrown at it, assuming 
it's placed in a raw string:

   r = re.compile(r@param\[(in|out)\] \w+)

If it's not a python regexp, this isn't really the list for the 
question, is it? ;)

However, things to try:

- sometimes the grouping parens need to be escaped with \

- sometimes \w isn't a valid character class, so use the 
long-hand variant of something like [a-zA-Z0-9_]]

- sometimes the + is escaped with a \

- if you don't use raw strings, you'll need to escape your \ 
characters, making each instance \\

HTH,

-tkc


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RegEx question

2007-10-04 Thread John Masters

On 15:25 Thu 04 Oct , Robert Dailey wrote:
 I am not a regex expert, I simply assumed regex was standardized to follow
 specific guidelines.

There are as many different regex flavours as there are Linux distros.
Each follows the basic rules but implements them slightly differently
and adds their own 'extensions'. 

 I also made the assumption that this was a good place
 to pose the question since regular expressions are a feature of Python.

The best place to pose a regex question is in the sphere of usage, i.e.
Perl regexes differ hugely in implementation from OO langs like Python
or Java, while shells like bash or zsh use regexes slightly differently,
as do shell scripting languages like awk or sed. 

 The question concerned regular expressions in general, not really the
 application. However, now that I know that regex can be different, I'll try
 to contact the author directly to find out the dialect and then find the
 appropriate location for my question from there. I do appreciate everyone's
 help. I've tried the various suggestions offered here, however none of them
 work. I can only assume at this point that this regex is drastically
 different or the application reading the regex is just broken.

If you care to PM me with details of the language/context I will try to
help but I am no expert.

Regards, John
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Python Regex Question

2007-09-21 Thread Ivo

crybaby wrote:
 On Sep 20, 4:12 pm, Tobiah [EMAIL PROTECTED] wrote:
 [EMAIL PROTECTED] wrote:
 I need to extract the number on each td tags from a html file.
 i.e 49.950 from the following:
 td align=right width=80font size=2 face=New Times
 Roman,Times,Serifnbsp;49.950nbsp;/font/td
 The actual number between: nbsp;49.950nbsp; can be any number of
 digits before decimal and after decimal.
 td align=right width=80font size=2 face=New Times
 Roman,Times,Serifnbsp;##.nbsp;/font/td
 How can I just extract the real/integer number using regex?
 '[0-9]*\.[0-9]*'

 --
 Posted via a free Usenet account fromhttp://www.teranews.com
 
 I am trying to use BeautifulSoup:
 
 soup = BeautifulSoup(page)
 
 td_tags = soup.findAll('td')
 i=0
 for td in td_tags:
 i = i+1
 print td: , td
 # re.search('[0-9]*\.[0-9]*', td)
 price = re.compile('[0-9]*\.[0-9]*').search(td)
 
 I am getting an error:
 
price= re.compile('[0-9]*\.[0-9]*').search(td)
 TypeError: expected string or buffer
 
 Does beautiful soup returns array of objects? If so, how do I pass
 td instance as string to re.search?  What is the different between
 re.search vs re.compile().search?
 

I don't know anything about BeautifulSoup, but to the other questions:

var=re.compile(regexpr) compiles the expression and after that you can 
use var as the reference to that compiled expression (costs less)

re.search(expr, string) compiles and searches every time. This can 
potentially be more expensive in calculating power. especially if you 
have to use the expression a lot of times.

The way you use it it doesn't matter.

do:
pattern = re.compile('[0-9]*\.[0-9]*')
result = pattern.findall(your tekst here)

Now you can reuse pattern.

Cheers,
Ivo.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Python Regex Question

2007-09-21 Thread David

 re.search(expr, string) compiles and searches every time. This can
 potentially be more expensive in calculating power. especially if you
 have to use the expression a lot of times.

The re module-level helper functions cache expressions and their
compiled form in a dict. They are only compiled once. The main
overhead would be for repeated dict lookups.

See sre.py (included from re.py) for more details. /usr/lib/python2.4/sre.py
-- 
http://mail.python.org/mailman/listinfo/python-list

Python Regex Question

2007-09-20 Thread joemystery123

I need to extract the number on each td tags from a html file.

i.e 49.950 from the following:

td align=right width=80font size=2 face=New Times
Roman,Times,Serifnbsp;49.950nbsp;/font/td

The actual number between: nbsp;49.950nbsp; can be any number of
digits before decimal and after decimal.

td align=right width=80font size=2 face=New Times
Roman,Times,Serifnbsp;##.nbsp;/font/td

How can I just extract the real/integer number using regex?

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Python Regex Question

2007-09-20 Thread Tobiah

[EMAIL PROTECTED] wrote:
 I need to extract the number on each td tags from a html file.
 
 i.e 49.950 from the following:
 
 td align=right width=80font size=2 face=New Times
 Roman,Times,Serifnbsp;49.950nbsp;/font/td
 
 The actual number between: nbsp;49.950nbsp; can be any number of
 digits before decimal and after decimal.
 
 td align=right width=80font size=2 face=New Times
 Roman,Times,Serifnbsp;##.nbsp;/font/td
 
 How can I just extract the real/integer number using regex?
 


'[0-9]*\.[0-9]*'

-- 
Posted via a free Usenet account from http://www.teranews.com

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Python Regex Question

2007-09-20 Thread Gerardo Herzig

[EMAIL PROTECTED] wrote:

I need to extract the number on each td tags from a html file.

i.e 49.950 from the following:

td align=right width=80font size=2 face=New Times
Roman,Times,Serifnbsp;49.950nbsp;/font/td

The actual number between: nbsp;49.950nbsp; can be any number of
digits before decimal and after decimal.

td align=right width=80font size=2 face=New Times
Roman,Times,Serifnbsp;##.nbsp;/font/td

How can I just extract the real/integer number using regex?

  

If all the td's content has the nbsp;[value_to_extract]nbsp; pattern, 
things goes simplest

[untested]

/td.*nbsp;([^]*)nbsp;/

the parentesis will be used to group() the result (and extract what you 
really want)

Cheers
Gerardo
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Python Regex Question

2007-09-20 Thread crybaby

On Sep 20, 4:12 pm, Tobiah [EMAIL PROTECTED] wrote:
 [EMAIL PROTECTED] wrote:
  I need to extract the number on each td tags from a html file.

  i.e 49.950 from the following:

  td align=right width=80font size=2 face=New Times
  Roman,Times,Serifnbsp;49.950nbsp;/font/td

  The actual number between: nbsp;49.950nbsp; can be any number of
  digits before decimal and after decimal.

  td align=right width=80font size=2 face=New Times
  Roman,Times,Serifnbsp;##.nbsp;/font/td

  How can I just extract the real/integer number using regex?

 '[0-9]*\.[0-9]*'

 --
 Posted via a free Usenet account fromhttp://www.teranews.com

I am trying to use BeautifulSoup:

soup = BeautifulSoup(page)

td_tags = soup.findAll('td')
i=0
for td in td_tags:
i = i+1
print td: , td
# re.search('[0-9]*\.[0-9]*', td)
price = re.compile('[0-9]*\.[0-9]*').search(td)

I am getting an error:

   price= re.compile('[0-9]*\.[0-9]*').search(td)
TypeError: expected string or buffer

Does beautiful soup returns array of objects? If so, how do I pass
td instance as string to re.search?  What is the different between
re.search vs re.compile().search?

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Simple Python REGEX Question

2007-05-12 Thread James T. Dennis

johnny [EMAIL PROTECTED] wrote:
 I need to get the content inside the bracket.

 eg. some characters before bracket (3.12345).
 I need to get whatever inside the (), in this case 3.12345.
 How do you do this with python regular expression?

 I'm going to presume that you mean something like:

I want to extract floating point numerics from parentheses
embedded in other, arbitrary, text.

 Something like:

 given='adfasdfafd(3.14159265)asdfasdfadsfasf'
 import re
 mymatch = re.search(r'\(([0-9.]+)\)', given).groups()[0]
 mymatch
'3.14159265'
 

 Of course, as with any time you're contemplating the use of regular
 expressions, there are lots of questions to consider about the exact
 requirements here.  What if there are more than such pattern?  Do you
 only want the first match per line (or other string)?  (That's all my
 example will give you).  What if there are no matches?  My example
 will raise an AttributeError (since the re.search will return the
 None object rather than a match object; and naturally the None
 object has no .groups()' method.

 The following might work better:

 mymatches = re.findall(r'\(([0-9.]+)\)', given).groups()[0]
 if len(mymatches):
 ...

 ... and, of couse, you might be better with a compiled regexp if
 you're going to repeast the search on many strings:

num_extractor = re.compile(r'\(([0-9.]+)\)')
for line in myfile:
for num in num_extractor(line):
pass
# do whatever with all these numbers


-- 
Jim Dennis,
Starshine: Signed, Sealed, Delivered

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Simple Python REGEX Question

2007-05-11 Thread Gary Herron

johnny wrote:
 I need to get the content inside the bracket.

 eg. some characters before bracket (3.12345).

 I need to get whatever inside the (), in this case 3.12345.

 How do you do this with python regular expression?
   

 import re
 x = re.search([0-9.]+, (3.12345))
 print x.group(0)
3.12345

There's a lot more to the re module, of course.  I'd suggest reading the
manual, but this should get you started.


Gary Herron

-- 
http://mail.python.org/mailman/listinfo/python-list

Simple Python REGEX Question

2007-05-11 Thread johnny

I need to get the content inside the bracket.

eg. some characters before bracket (3.12345).

I need to get whatever inside the (), in this case 3.12345.

How do you do this with python regular expression?

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Simple Python REGEX Question

2007-05-11 Thread John Machin

On May 12, 2:21 am, Gary Herron [EMAIL PROTECTED] wrote:
 johnny wrote:
  I need to get the content inside the bracket.

  eg. some characters before bracket (3.12345).

  I need to get whatever inside the (), in this case 3.12345.

  How do you do this with python regular expression?

  import re
  x = re.search([0-9.]+, (3.12345))
  print x.group(0)

 3.12345

 There's a lot more to the re module, of course.  I'd suggest reading the
 manual, but this should get you started.


 s = some chars like 987 before the bracket (3.12345) etc
 x = re.search([0-9.]+, s)
 x.group(0)
'987'

OP sez: I need to get the content inside the bracket
OP sez: I need to get whatever inside the ()

My interpretation:

 for s in ['foo(123)bar', 'foo(123))bar', 'foo()bar', 'foobar']:
... x = re.search(r\([^)]*\), s)
... print repr(x and x.group(0)[1:-1])
...
'123'
'123'
''
None




-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Simple Python REGEX Question

2007-05-11 Thread Steven D'Aprano

On Fri, 11 May 2007 08:54:31 -0700, johnny wrote:

 I need to get the content inside the bracket.
 
 eg. some characters before bracket (3.12345).
 
 I need to get whatever inside the (), in this case 3.12345.
 
 How do you do this with python regular expression?

Why would you bother? If you know your string is a bracketed expression,
all you need is:

s = (3.12345)
contents = s[1:-1] # ignore the first and last characters

If your string is more complex:

s = lots of things here (3.12345) and some more things here

then the task is harder. In general, you can't use regular expressions for
that, you need a proper parser, because brackets can be nested.

But if you don't care about nested brackets, then something like this is
easy:

def get_bracket(s):
p, q = s.find('('), s.find(')')
if p == -1 or q == -1: raise ValueError(Missing bracket)
if p  q: raise ValueError(Close bracket before open bracket)
return s[p+1:q-1]

Or as a one liner with no error checking:

s[s.find('(')+1:s.find(')'-1]


-- 
Steven.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2007-04-29 Thread proctor

On Apr 27, 8:26 am, Michael Hoffman [EMAIL PROTECTED] wrote:
 proctorwrote:
  On Apr 27, 1:33 am, Paul McGuire [EMAIL PROTECTED] wrote:
  On Apr 27, 1:33 am,proctor[EMAIL PROTECTED] wrote:
  rx_test = re.compile('/x([^x])*x/')
  s = '/xabcx/'
  if rx_test.findall(s):
  print rx_test.findall(s)
  
  i expect the output to be ['abc'] however it gives me only the last
  single character in the group: ['c']

  As Josiah already pointed out, the * needs to be inside the grouping
  parens.
  so my question remains, why doesn't the star quantifier seem to grab
  all the data.

 Because you didn't use it *inside* the group, as has been said twice.
 Let's take a simpler example:

   import re
   text = xabc
   re_test1 = re.compile(x([^x])*)
   re_test2 = re.compile(x([^x]*))
   re_test1.match(text).groups()
 ('c',)
   re_test2.match(text).groups()
 ('abc',)

 There are three places that match ([^x]) in text. But each time you find
 one you overwrite the previous example.

  isn't findall() intended to return all matches?

 It returns all matches of the WHOLE pattern, /x([^x])*x/. Since you used
 a grouping parenthesis in there, it only returns one group from each
 pattern.

 Back to my example:

   re_test1.findall(xabcxaaaxabc)
 ['c', 'a', 'c']

 Here it finds multiple matches, but only because the x occurs multiple
 times as well. In your example there is only one match.

  i would expect either 'abc' or 'a', 'b', 'c' or at least just
  'a' (because that would be the first match).

 You are essentially doing this:

 group1 = a
 group1 = b
 group1 = c

 After those three statements, you wouldn't expect group1 to be abc or
 a. You'd expect it to be c.
 --
 Michael Hoffman

thank you all again for helping to clarify this for me.  of course you
were exactly right, and the problem lay not with python or the text,
but with me.  i mistakenly understood the text to be attempting to
capture the C style comment, when in fact it was merely matching it.

apologies.

sincerely,
proctor

-- 
http://mail.python.org/mailman/listinfo/python-list

regex question

2007-04-27 Thread proctor

hello,

i have a regex:  rx_test = re.compile('/x([^x])*x/')

which is part of this test program:



import re

rx_test = re.compile('/x([^x])*x/')

s = '/xabcx/'

if rx_test.findall(s):
print rx_test.findall(s)



i expect the output to be ['abc'] however it gives me only the last
single character in the group: ['c']

C:\testpython retest.py
['c']

can anyone point out why this is occurring?  i can capture the entire
group by doing this:

rx_test = re.compile('/x([^x]+)*x/')
but why isn't the 'star' grabbing the whole group?  and why isn't each
letter 'a', 'b', and 'c' present, either individually, or as a group
(group is expected)?

any clarification is appreciated!

sincerely,
proctor

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2007-04-27 Thread Josiah Carlson

proctor wrote:
 i have a regex:  rx_test = re.compile('/x([^x])*x/')

You probably want...

rx_test = re.compile('/x([^x]*)x/')


  - Josiah
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2007-04-27 Thread Paul McGuire

On Apr 27, 1:33 am, proctor [EMAIL PROTECTED] wrote:
 hello,

 i have a regex:  rx_test = re.compile('/x([^x])*x/')

 which is part of this test program:

 

 import re

 rx_test = re.compile('/x([^x])*x/')

 s = '/xabcx/'

 if rx_test.findall(s):
 print rx_test.findall(s)

 

 i expect the output to be ['abc'] however it gives me only the last
 single character in the group: ['c']

 C:\testpython retest.py
 ['c']

 can anyone point out why this is occurring?  i can capture the entire
 group by doing this:

 rx_test = re.compile('/x([^x]+)*x/')
 but why isn't the 'star' grabbing the whole group?  and why isn't each
 letter 'a', 'b', and 'c' present, either individually, or as a group
 (group is expected)?

 any clarification is appreciated!

 sincerely,
 proctor

As Josiah already pointed out, the * needs to be inside the grouping
parens.

Since re's do lookahead/backtracking, you can also write:

rx_test = re.compile('/x(.*?)x/')

The '?' is there to make sure the .* repetition stops at the first
occurrence of x/.

-- Paul

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2007-04-27 Thread proctor

On Apr 27, 1:33 am, Paul McGuire [EMAIL PROTECTED] wrote:
 On Apr 27, 1:33 am, proctor [EMAIL PROTECTED] wrote:



  hello,

  i have a regex:  rx_test = re.compile('/x([^x])*x/')

  which is part of this test program:

  

  import re

  rx_test = re.compile('/x([^x])*x/')

  s = '/xabcx/'

  if rx_test.findall(s):
  print rx_test.findall(s)

  

  i expect the output to be ['abc'] however it gives me only the last
  single character in the group: ['c']

  C:\testpython retest.py
  ['c']

  can anyone point out why this is occurring?  i can capture the entire
  group by doing this:

  rx_test = re.compile('/x([^x]+)*x/')
  but why isn't the 'star' grabbing the whole group?  and why isn't each
  letter 'a', 'b', and 'c' present, either individually, or as a group
  (group is expected)?

  any clarification is appreciated!

  sincerely,
  proctor

 As Josiah already pointed out, the * needs to be inside the grouping
 parens.

 Since re's do lookahead/backtracking, you can also write:

 rx_test = re.compile('/x(.*?)x/')

 The '?' is there to make sure the .* repetition stops at the first
 occurrence of x/.

 -- Paul

i am working through an example from the oreilly book mastering
regular expressions (2nd edition) by jeffrey friedl.  my post was a
snippet from a regex to match C comments.   every 'x' in the regex
represents a 'star' in actual usage, so that backslash escaping is not
needed in the example (on page 275).  it looks like this:

===

/x([^x]|x+[^/x])*x+/

it is supposed to match '/x', the opening delimiter, then

(
either anything that is 'not x',

or,

'x' one or more times, 'not followed by a slash or an x'
) any number of times (the 'star')

followed finally by the closing delimiter.

===

this does not seem to work in python the way i understand it should
from the book, and i simplified the example in my first post to
concentrate on just one part of the alternation that i felt was not
acting as expected.

so my question remains, why doesn't the star quantifier seem to grab
all the data.  isn't findall() intended to return all matches?  i
would expect either 'abc' or 'a', 'b', 'c' or at least just
'a' (because that would be the first match).  why does it give only
one letter, and at that, the /last/ letter in the sequence??

thanks again for replying!

sincerely,
proctor

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2007-04-27 Thread Michael Hoffman

proctor wrote:
 On Apr 27, 1:33 am, Paul McGuire [EMAIL PROTECTED] wrote:
 On Apr 27, 1:33 am, proctor [EMAIL PROTECTED] wrote:

 rx_test = re.compile('/x([^x])*x/')
 s = '/xabcx/'
 if rx_test.findall(s):
 print rx_test.findall(s)
 
 i expect the output to be ['abc'] however it gives me only the last
 single character in the group: ['c']

 As Josiah already pointed out, the * needs to be inside the grouping
 parens.

 so my question remains, why doesn't the star quantifier seem to grab
 all the data.

Because you didn't use it *inside* the group, as has been said twice. 
Let's take a simpler example:

  import re
  text = xabc
  re_test1 = re.compile(x([^x])*)
  re_test2 = re.compile(x([^x]*))
  re_test1.match(text).groups()
('c',)
  re_test2.match(text).groups()
('abc',)

There are three places that match ([^x]) in text. But each time you find 
one you overwrite the previous example.

 isn't findall() intended to return all matches?

It returns all matches of the WHOLE pattern, /x([^x])*x/. Since you used 
a grouping parenthesis in there, it only returns one group from each 
pattern.

Back to my example:

  re_test1.findall(xabcxaaaxabc)
['c', 'a', 'c']

Here it finds multiple matches, but only because the x occurs multiple 
times as well. In your example there is only one match.

 i would expect either 'abc' or 'a', 'b', 'c' or at least just
 'a' (because that would be the first match).

You are essentially doing this:

group1 = a
group1 = b
group1 = c

After those three statements, you wouldn't expect group1 to be abc or 
a. You'd expect it to be c.
-- 
Michael Hoffman
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2007-04-27 Thread Duncan Booth

proctor [EMAIL PROTECTED] wrote:

 so my question remains, why doesn't the star quantifier seem to grab
 all the data.  isn't findall() intended to return all matches?  i
 would expect either 'abc' or 'a', 'b', 'c' or at least just
 'a' (because that would be the first match).  why does it give only
 one letter, and at that, the /last/ letter in the sequence??
 
findall returns the matched groups. You get one group for each 
parenthesised sub-expression, and (the important bit) if a single 
parenthesised expression matches more than once the group only contains 
the last string which matched it.

Putting a star after a subexpression means that subexpression can match 
zero or more times, but each time it only matches a single character 
which is why your findall only returned the last character it matched.

You need to move the * inside the parentheses used to define the group, 
then the group will match only once but will include everything that it 
matched.

Consider:

 re.findall('(.)', 'abc')
['a', 'b', 'c']
 re.findall('(.)*', 'abc')
['c', '']
 re.findall('(.*)', 'abc')
['abc', '']

The first pattern finds a single character which findall manages to 
match 3 times.

The second pattern finds a group with a single character zero or more 
times in the pattern, so the first time it matches each of a,b,c in turn 
and returns the c, and then next time around we get an empty string when 
group matched zero times.

In the third pattern we are looking for a group with any number of 
characters in it. First time we get all of the string, then we get 
another empty match.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2007-04-27 Thread Paul McGuire

On Apr 27, 9:10 am, proctor [EMAIL PROTECTED] wrote:
 On Apr 27, 1:33 am, Paul McGuire [EMAIL PROTECTED] wrote:





  On Apr 27, 1:33 am, proctor [EMAIL PROTECTED] wrote:

   hello,

   i have a regex:  rx_test = re.compile('/x([^x])*x/')

   which is part of this test program:

   

   import re

   rx_test = re.compile('/x([^x])*x/')

   s = '/xabcx/'

   if rx_test.findall(s):
   print rx_test.findall(s)

   

   i expect the output to be ['abc'] however it gives me only the last
   single character in the group: ['c']

   C:\testpython retest.py
   ['c']

   can anyone point out why this is occurring?  i can capture the entire
   group by doing this:

   rx_test = re.compile('/x([^x]+)*x/')
   but why isn't the 'star' grabbing the whole group?  and why isn't each
   letter 'a', 'b', and 'c' present, either individually, or as a group
   (group is expected)?

   any clarification is appreciated!

   sincerely,
   proctor

  As Josiah already pointed out, the * needs to be inside the grouping
  parens.

  Since re's do lookahead/backtracking, you can also write:

  rx_test = re.compile('/x(.*?)x/')

  The '?' is there to make sure the .* repetition stops at the first
  occurrence of x/.

  -- Paul

 i am working through an example from the oreilly book mastering
 regular expressions (2nd edition) by jeffrey friedl.  my post was a
 snippet from a regex to match C comments.   every 'x' in the regex
 represents a 'star' in actual usage, so that backslash escaping is not
 needed in the example (on page 275).  it looks like this:

 ===

 /x([^x]|x+[^/x])*x+/

 it is supposed to match '/x', the opening delimiter, then

 (
 either anything that is 'not x',

 or,

 'x' one or more times, 'not followed by a slash or an x'
 ) any number of times (the 'star')

 followed finally by the closing delimiter.

 ===

 this does not seem to work in python the way i understand it should
 from the book, and i simplified the example in my first post to
 concentrate on just one part of the alternation that i felt was not
 acting as expected.

 so my question remains, why doesn't the star quantifier seem to grab
 all the data.  isn't findall() intended to return all matches?  i
 would expect either 'abc' or 'a', 'b', 'c' or at least just
 'a' (because that would be the first match).  why does it give only
 one letter, and at that, the /last/ letter in the sequence??

 thanks again for replying!

 sincerely,
 proctor- Hide quoted text -

 - Show quoted text -

Again, I'll repeat some earlier advice:  you need to move the '*'
inside the parens - you are still leaving it outside.  Also, get in
the habit of using raw literal notation (that is rslkjdfljf instead
of lsjdlfkjs) when defining re strings - you don't have backslash
issues yet, but you will as soon as you start putting real '*'
characters in your expression.

However, when I test this,

restr = r'/x(([^x]|x+[^/])*)x+/'
re_ = re.compile(restr)
print re_.findall(/xabxxcx/ /x123xxx/)

findall now starts to give a tuple for each comment,

[('abxxc', 'xxc'), ('123xx', 'xx')]

so you have gone beyond my limited re skill, and will need help from
someone else.

But I suggest you add some tests with multiple consecutive 'x'
characters in the middle of your comment, and multiple consecutive 'x'
characters before the trailing comment.  In fact, from my
recollections of trying to implement this type of comment recognizer
by hand a long time ago in a job far, far away, test with both even
and odd numbers of 'x' characters.

-- Paul

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2007-04-27 Thread proctor

On Apr 27, 8:26 am, Michael Hoffman [EMAIL PROTECTED] wrote:
 proctor wrote:
  On Apr 27, 1:33 am, Paul McGuire [EMAIL PROTECTED] wrote:
  On Apr 27, 1:33 am, proctor [EMAIL PROTECTED] wrote:
  rx_test = re.compile('/x([^x])*x/')
  s = '/xabcx/'
  if rx_test.findall(s):
  print rx_test.findall(s)
  
  i expect the output to be ['abc'] however it gives me only the last
  single character in the group: ['c']

  As Josiah already pointed out, the * needs to be inside the grouping
  parens.
  so my question remains, why doesn't the star quantifier seem to grab
  all the data.

 Because you didn't use it *inside* the group, as has been said twice.
 Let's take a simpler example:

   import re
   text = xabc
   re_test1 = re.compile(x([^x])*)
   re_test2 = re.compile(x([^x]*))
   re_test1.match(text).groups()
 ('c',)
   re_test2.match(text).groups()
 ('abc',)

 There are three places that match ([^x]) in text. But each time you find
 one you overwrite the previous example.

  isn't findall() intended to return all matches?

 It returns all matches of the WHOLE pattern, /x([^x])*x/. Since you used
 a grouping parenthesis in there, it only returns one group from each
 pattern.

 Back to my example:

   re_test1.findall(xabcxaaaxabc)
 ['c', 'a', 'c']

 Here it finds multiple matches, but only because the x occurs multiple
 times as well. In your example there is only one match.

  i would expect either 'abc' or 'a', 'b', 'c' or at least just
  'a' (because that would be the first match).

 You are essentially doing this:

 group1 = a
 group1 = b
 group1 = c

 After those three statements, you wouldn't expect group1 to be abc or
 a. You'd expect it to be c.
 --
 Michael Hoffman

ok, thanks michael.

so i am now assuming that either the book's example assumes perl, and
perl is different from python in this regard, or, that the book's
example is faulty.  i understand all the examples given since my
question, and i know what i need to do to make it work.  i am raising
the question because the book says one thing, but the example is not
working for me.  i am searching for the source of the discrepancy.

i will try to research the differences between perl's and python's
regex engines.

thanks again,

sincerely,
proctor

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2007-04-27 Thread proctor

On Apr 27, 8:37 am, Duncan Booth [EMAIL PROTECTED] wrote:
 proctor [EMAIL PROTECTED] wrote:
  so my question remains, why doesn't the star quantifier seem to grab
  all the data.  isn't findall() intended to return all matches?  i
  would expect either 'abc' or 'a', 'b', 'c' or at least just
  'a' (because that would be the first match).  why does it give only
  one letter, and at that, the /last/ letter in the sequence??

 findall returns the matched groups. You get one group for each
 parenthesised sub-expression, and (the important bit) if a single
 parenthesised expression matches more than once the group only contains
 the last string which matched it.

 Putting a star after a subexpression means that subexpression can match
 zero or more times, but each time it only matches a single character
 which is why your findall only returned the last character it matched.

 You need to move the * inside the parentheses used to define the group,
 then the group will match only once but will include everything that it
 matched.

 Consider:

  re.findall('(.)', 'abc')
 ['a', 'b', 'c']
  re.findall('(.)*', 'abc')
 ['c', '']
  re.findall('(.*)', 'abc')

 ['abc', '']

 The first pattern finds a single character which findall manages to
 match 3 times.

 The second pattern finds a group with a single character zero or more
 times in the pattern, so the first time it matches each of a,b,c in turn
 and returns the c, and then next time around we get an empty string when
 group matched zero times.

 In the third pattern we are looking for a group with any number of
 characters in it. First time we get all of the string, then we get
 another empty match.

thank you this is interesting.  in the second example, where does the
'nothingness' match, at the end?  why does the regex 'run again' when
it has already matched everything?  and if it reports an empty match
along with a non-empty match, why only the two?

sincerely,
proctor

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2007-04-27 Thread Duncan Booth

proctor [EMAIL PROTECTED] wrote:

  re.findall('(.)*', 'abc')
 ['c', '']

 thank you this is interesting.  in the second example, where does the
 'nothingness' match, at the end?  why does the regex 'run again' when
 it has already matched everything?  and if it reports an empty match
 along with a non-empty match, why only the two?
 

There are 4 possible starting points for a regular expression to match in a 
three character string. The regular expression would match at any starting 
point so in theory you could find 4 possible matches in the string. In this 
case they would be 'abc', 'bc', 'c', ''.

However findall won't get any overlapping matches, so there are only two 
possible matches and it returns both of them: 'abc' and '' (or rather it 
returns the matching group within the match so you only see the 'c' 
although it matched 'abc'.

If you use a regex which doesn't match an empty string (e.g. '/x(.*?)x/' 
then you won't get the empty match.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2007-04-27 Thread proctor

On Apr 27, 8:50 am, Paul McGuire [EMAIL PROTECTED] wrote:
 On Apr 27, 9:10 am, proctor [EMAIL PROTECTED] wrote:



  On Apr 27, 1:33 am, Paul McGuire [EMAIL PROTECTED] wrote:

   On Apr 27, 1:33 am, proctor [EMAIL PROTECTED] wrote:

hello,

i have a regex:  rx_test = re.compile('/x([^x])*x/')

which is part of this test program:



import re

rx_test = re.compile('/x([^x])*x/')

s = '/xabcx/'

if rx_test.findall(s):
print rx_test.findall(s)



i expect the output to be ['abc'] however it gives me only the last
single character in the group: ['c']

C:\testpython retest.py
['c']

can anyone point out why this is occurring?  i can capture the entire
group by doing this:

rx_test = re.compile('/x([^x]+)*x/')
but why isn't the 'star' grabbing the whole group?  and why isn't each
letter 'a', 'b', and 'c' present, either individually, or as a group
(group is expected)?

any clarification is appreciated!

sincerely,
proctor

   As Josiah already pointed out, the * needs to be inside the grouping
   parens.

   Since re's do lookahead/backtracking, you can also write:

   rx_test = re.compile('/x(.*?)x/')

   The '?' is there to make sure the .* repetition stops at the first
   occurrence of x/.

   -- Paul

  i am working through an example from the oreilly book mastering
  regular expressions (2nd edition) by jeffrey friedl.  my post was a
  snippet from a regex to match C comments.   every 'x' in the regex
  represents a 'star' in actual usage, so that backslash escaping is not
  needed in the example (on page 275).  it looks like this:

  ===

  /x([^x]|x+[^/x])*x+/

  it is supposed to match '/x', the opening delimiter, then

  (
  either anything that is 'not x',

  or,

  'x' one or more times, 'not followed by a slash or an x'
  ) any number of times (the 'star')

  followed finally by the closing delimiter.

  ===

  this does not seem to work in python the way i understand it should
  from the book, and i simplified the example in my first post to
  concentrate on just one part of the alternation that i felt was not
  acting as expected.

  so my question remains, why doesn't the star quantifier seem to grab
  all the data.  isn't findall() intended to return all matches?  i
  would expect either 'abc' or 'a', 'b', 'c' or at least just
  'a' (because that would be the first match).  why does it give only
  one letter, and at that, the /last/ letter in the sequence??

  thanks again for replying!

  sincerely,
  proctor- Hide quoted text -

  - Show quoted text -

 Again, I'll repeat some earlier advice:  you need to move the '*'
 inside the parens - you are still leaving it outside.  Also, get in
 the habit of using raw literal notation (that is rslkjdfljf instead
 of lsjdlfkjs) when defining re strings - you don't have backslash
 issues yet, but you will as soon as you start putting real '*'
 characters in your expression.

 However, when I test this,

 restr = r'/x(([^x]|x+[^/])*)x+/'
 re_ = re.compile(restr)
 print re_.findall(/xabxxcx/ /x123xxx/)

 findall now starts to give a tuple for each comment,

 [('abxxc', 'xxc'), ('123xx', 'xx')]

 so you have gone beyond my limited re skill, and will need help from
 someone else.

 But I suggest you add some tests with multiple consecutive 'x'
 characters in the middle of your comment, and multiple consecutive 'x'
 characters before the trailing comment.  In fact, from my
 recollections of trying to implement this type of comment recognizer
 by hand a long time ago in a job far, far away, test with both even
 and odd numbers of 'x' characters.

 -- Paul

thanks paul,

the reason the regex now give tuples is that there are now 2 groups,
the inner and outer parens.  so group 1 matches with the star, and
group 2 matches without the star.

sincerely,
proctor

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2007-01-18 Thread Bill Mill

Gabriel Genellina wrote:
 At Tuesday 16/1/2007 16:36, Bill  Mill wrote:

   py import re
   py rgx = re.compile('1?')
   py rgx.search('a1').groups()
   (None,)
   py rgx = re.compile('(1)+')
   py rgx.search('a1').groups()
 
 But shouldn't the ? be greedy, and thus prefer the one match to the
 zero? This is my sticking point - I've seen that plus works, and this
 just confuses me more.

 Perhaps you have misunderstood what search does.
 search( pattern, string[, flags])
  Scan through string looking for a location where the regular
 expression pattern produces a match

 '1?' means 0 or 1 times '1', i.e., nothing or a single '1'.
 At the start of the target string, 'a1', we have nothing, so the re
 matches, and returns that occurrence. It doesnt matter that a few
 characters later there is *another* match, even if it is longer; once
 a match is found, the scan is done.
 If you want the longest match of all possible matches along the
 string, you should use findall() instead of search().


That is exactly what I misunderstood. Thank you very much.

-Bill Mill
bill.mill at gmail.com

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2007-01-17 Thread Gabriel Genellina


At Tuesday 16/1/2007 16:36, Bill  Mill wrote:


 py import re
 py rgx = re.compile('1?')
 py rgx.search('a1').groups()
 (None,)
 py rgx = re.compile('(1)+')
 py rgx.search('a1').groups()

But shouldn't the ? be greedy, and thus prefer the one match to the
zero? This is my sticking point - I've seen that plus works, and this
just confuses me more.


Perhaps you have misunderstood what search does.
search( pattern, string[, flags])
Scan through string looking for a location where the regular 
expression pattern produces a match


'1?' means 0 or 1 times '1', i.e., nothing or a single '1'.
At the start of the target string, 'a1', we have nothing, so the re 
matches, and returns that occurrence. It doesnt matter that a few 
characters later there is *another* match, even if it is longer; once 
a match is found, the scan is done.
If you want the longest match of all possible matches along the 
string, you should use findall() instead of search().



--
Gabriel Genellina
Softlab SRL 







__ 
Preguntá. Respondé. Descubrí. 
Todo lo que querías saber, y lo que ni imaginabas, 
está en Yahoo! Respuestas (Beta). 
¡Probalo ya! 
http://www.yahoo.com.ar/respuestas 

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2007-01-16 Thread Bill Mill

James Stroud wrote:
 Bill Mill wrote:
  Hello all,
 
  I've got a test script:
 
   start python code =
 
  tests2 = [item1: alpha; item2: beta. item3 - gamma--,
  item1: alpha; item3 - gamma--]
 
  def test_re(regex):
 r = re.compile(regex, re.MULTILINE)
 for test in tests2:
 res = r.search(test)
 if res:
 print res.groups()
 else:
 print Failed
 
   end python code 
 
  And a simple question:
 
  Why does the first regex that follows successfully grab beta, while
  the second one doesn't?
 
  In [131]: test_re(r(?:item2: (.*?)\.))
  ('beta',)
  Failed
 
  In [132]: test_re(r(?:item2: (.*?)\.)?)
  (None,)
  (None,)
 
  Shouldn't the '?' greedily grab the group match?
 
  Thanks
  Bill Mill
  bill.mill at gmail.com

 The question-mark matches at zero or one. The first match will be a
 group with nothing in it, which satisfies the zero condition. Perhaps
 you mean +?

 e.g.

 py import re
 py rgx = re.compile('1?')
 py rgx.search('a1').groups()
 (None,)
 py rgx = re.compile('(1)+')
 py rgx.search('a1').groups()

But shouldn't the ? be greedy, and thus prefer the one match to the
zero? This is my sticking point - I've seen that plus works, and this
just confuses me more.

-Bill Mill
bill.mill at gmail.com

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2007-01-12 Thread James Stroud

Bill Mill wrote:
 Hello all,
 
 I've got a test script:
 
  start python code =
 
 tests2 = [item1: alpha; item2: beta. item3 - gamma--,
 item1: alpha; item3 - gamma--]
 
 def test_re(regex):
r = re.compile(regex, re.MULTILINE)
for test in tests2:
res = r.search(test)
if res:
print res.groups()
else:
print Failed
 
  end python code 
 
 And a simple question:
 
 Why does the first regex that follows successfully grab beta, while
 the second one doesn't?
 
 In [131]: test_re(r(?:item2: (.*?)\.))
 ('beta',)
 Failed
 
 In [132]: test_re(r(?:item2: (.*?)\.)?)
 (None,)
 (None,)
 
 Shouldn't the '?' greedily grab the group match?
 
 Thanks
 Bill Mill
 bill.mill at gmail.com

The question-mark matches at zero or one. The first match will be a 
group with nothing in it, which satisfies the zero condition. Perhaps 
you mean +?

e.g.

py import re
py rgx = re.compile('1?')
py rgx.search('a1').groups()
(None,)
py rgx = re.compile('(1)+')
py rgx.search('a1').groups()

James
-- 
http://mail.python.org/mailman/listinfo/python-list

Regex Question

2007-01-10 Thread Bill Mill

Hello all,

I've got a test script:

 start python code =

tests2 = [item1: alpha; item2: beta. item3 - gamma--,
item1: alpha; item3 - gamma--]

def test_re(regex):
r = re.compile(regex, re.MULTILINE)
for test in tests2:
res = r.search(test)
if res:
print res.groups()
else:
print Failed

 end python code 

And a simple question:

Why does the first regex that follows successfully grab beta, while
the second one doesn't?

In [131]: test_re(r(?:item2: (.*?)\.))
('beta',)
Failed

In [132]: test_re(r(?:item2: (.*?)\.)?)
(None,)
(None,)

Shouldn't the '?' greedily grab the group match?

Thanks
Bill Mill
bill.mill at gmail.com
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2007-01-08 Thread proctor


Paul McGuire wrote:
 proctor [EMAIL PROTECTED] wrote in message
 news:[EMAIL PROTECTED]...
  hello,
 
  i hope this is the correct place...
 
  i have an issue with some regex code i wonder if you have any insight:
 
  

 There's nothing actually *wrong* wth your regex.  The problem is your
 misunderstanding of raw string notation.  In building up your regex, do not
 start the string with r' and end it with a '.

 def makeRE(w):
 print w +  length =  + str(len(w))
 # reString = r' + w[:1]
 reString = w[:1]
 w = w[1:]
 if len(w)  0:
 for c in (w):
 reString += | + c
 # reString += '
 print reString =  + reString
 return reString

 Or even better:

 def makeRE(w):
 print w +  length =  + str(len(w))
 reString = |.join(list(w))
 return reString

 Raw string notation is intended to be used when the string literal is in
 your Python code itself, for example, this is a typical use for raw strings:

 ipAddrRe = r'\d{1,3}(\.\d{1,3}){3}'

 If I didn't have raw string notation to use, I'd have to double up all the
 backslashes, as:

 ipAddrRe = '\\d{1,3}(\\.\\d{1,3}){3}'

 But no matter which way I create the string, it does not actually start with
 r' and end with ', those are just notations for literals that are part
 of your Python source.

 Does this give you a better idea of what is happening?

 -- Paul

yes!  thanks so much.

it does work now...however, one more question:  when i type:

rx_a = re.compile(r'a|b|c')
it works correctly!

shouldn't:
rx_a = re.compile(makeRE(test))
give the same result since makeRE(test)) returns the string r'a|b|c'

are you saying that the r' and ' are being interpreted differently
in the second case than in the first?  if so, how would i go about
using raw string notation in such a circumstance (perhaps if i need to
escape \b or the like)?  do i have to double up in this case?

proctor.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2007-01-08 Thread Steven D'Aprano

On Sun, 07 Jan 2007 23:57:00 -0800, proctor wrote:

 it does work now...however, one more question:  when i type:
 
 rx_a = re.compile(r'a|b|c')
 it works correctly!
 
 shouldn't:
 rx_a = re.compile(makeRE(test))
 give the same result since makeRE(test)) returns the string r'a|b|c'

Those two strings are NOT the same.

 s1 = r'a|b|c'
 s2 = r'a|b|c'
 print s1, len(s1)
a|b|c 5
 print s2, len(s2)
r'a|b|c' 8

A string with a leading r *outside* the quotation marks is a raw-string.
The r is not part of the string, but part of the delimiter.

A string with a leading r *inside* the quotation marks is just a string
with a leading r. It has no special meaning.



-- 
Steven D'Aprano 

-- 
http://mail.python.org/mailman/listinfo/python-list

1 2 >

1 - 100 of 132 matches

Mail list logo