Re: string encoding regex problem

2014-08-23 Thread Philipp Kraus

Hi,

On 2014-08-16 09:01:57 +, Peter Otten said:


Philipp Kraus wrote:

The code works till last week correctly, I don't change the pattern. 


Websites' contents and structure change sometimes.

My question is, can it be a problem with string encoding? 


Your regex is all-ascii. So an encoding problem is very unlikely.

found = re.search( a 
href=\/projects/boost/files/latest/download\?source=files\ 
title=\/boost/(.*),

data)



Did I mask the question mark and quotes
correctly?


Yes.

A quick check...

data = 
urllib.urlopen(http://sourceforge.net/projects/boost/files/boost/;).read() 

re.compile(/projects/boost/files/latest/download\?source=files.*?).findall(data) 

['/projects/boost/files/latest/download?source=files 
title=/boost-docs/1.56.0/boost_1_56_pdf.7z:  released on 2014-08-14 
16:35:00 UTC']


...reveals that the matching link has /boost-docs/ in its title, so the
 site contents probably did change. 


I have create a short script:

-
#!/usr/bin/env python

import re, urllib2


def URLReader(url) :
   f = urllib2.urlopen(url)
   data = f.read()
   f.close()
   return data


print re.match( \small\ \.*\\/small\, 
URLReader(http://sourceforge.net/projects/boost/;) )

-

Within the data the string smallboost_1_56_0.tar.gz/small should 
be machted, but I get always a None result on the re.match, re.search 
returns also a None.
I have tested the regex under http://regex101.com/ with the HTML code 
and on the page the regex is matched.


Can you help me please to fix the problem, I don't understand that the 
match returns None


Thanks

Phil-- 
https://mail.python.org/mailman/listinfo/python-list


Re: string encoding regex problem

2014-08-23 Thread Peter Otten
Philipp Kraus wrote:

 I have create a short script:
 
 -
 #!/usr/bin/env python
 
 import re, urllib2
 
 
 def URLReader(url) :
 f = urllib2.urlopen(url)
 data = f.read()
 f.close()
 return data
 
 
 print re.match( \small\ \.*\\/small\,
 URLReader(http://sourceforge.net/projects/boost/;) )
 -
 
 Within the data the string smallboost_1_56_0.tar.gz/small should
 be machted, but I get always a None result on the re.match, re.search
 returns also a None.

 help(re.match)
Help on function match in module re:

match(pattern, string, flags=0)
Try to apply the pattern at the start of the string, returning
a match object, or None if no match was found.

As the string doesn't start with your regex re.match() is clearly wrong, but 
re.search() works for me:

 import re, urllib2
 
 
 def URLReader(url) :
... f = urllib2.urlopen(url)
... data = f.read()
... f.close()
... return data
... 
 data = URLReader(http://sourceforge.net/projects/boost/;)
 re.search(\small\ \.*\\/small\, data)
_sre.SRE_Match object at 0x7f282dd58718
 _.group()
'small boost_1_56_pdf.7z/small'


 I have tested the regex under http://regex101.com/ with the HTML code
 and on the page the regex is matched.
 
 Can you help me please to fix the problem, I don't understand that the
 match returns None


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: string encoding regex problem

2014-08-16 Thread Peter Otten
Philipp Kraus wrote:

 The code works till last week correctly, I don't change the pattern. 

Websites' contents and structure change sometimes.

 My question is, can it be a problem with string encoding? 

Your regex is all-ascii. So an encoding problem is very unlikely.

 found = re.search( a 
 href=\/projects/boost/files/latest/download\?source=files\ 
 title=\/boost/(.*),
 data)

 Did I mask the question mark and quotes
 correctly?

Yes.

A quick check...

 data = 
 urllib.urlopen(http://sourceforge.net/projects/boost/files/boost/;).read()
 re.compile(/projects/boost/files/latest/download\?source=files.*?).findall(data)
['/projects/boost/files/latest/download?source=files 
title=/boost-docs/1.56.0/boost_1_56_pdf.7z:  released on 2014-08-14 16:35:00 
UTC']

...reveals that the matching link has /boost-docs/ in its title, so the
 site contents probably did change. 


-- 
https://mail.python.org/mailman/listinfo/python-list


string encoding regex problem

2014-08-15 Thread Philipp Kraus

Hello,

I have defined a function with:

def URLReader(url) :
   try :
   f = urllib2.urlopen(url)
   data = f.read()
   f.close()
   except Exception, e :
   raise MyError.StopError(e)
   return data

which get the HTML source code from an URL. I use this to get a part of 
a HTML document without any HTML parsing, so I call (I would like to 
get the download link of the boost library):


found = re.search( a 
href=\/projects/boost/files/latest/download\?source=files\ 
title=\/boost/(.*), 
Utilities.URLReader(http://sourceforge.net/projects/boost/files/boost/;) 
)

if found == None :
raise MyError.StopError(Boost Download URL not found)

But found is always None, so I cannot get the correct match. I didn't 
find the error in my code.


Thanks for help

Phil-- 
https://mail.python.org/mailman/listinfo/python-list


Re: string encoding regex problem

2014-08-15 Thread Roy Smith
In article lsm8ic$j90$1...@online.de,
 Philipp Kraus philipp.kr...@flashpixx.de wrote:

 found = re.search( a 
 href=\/projects/boost/files/latest/download\?source=files\ 
 title=\/boost/(.*), 
 Utilities.URLReader(http://sourceforge.net/projects/boost/files/boost/;) 
 )
 if found == None :
   raise MyError.StopError(Boost Download URL not found)
 
 But found is always None, so I cannot get the correct match. I didn't 
 find the error in my code.

I would start by breaking this down into pieces.  Something like:

 data = 
 Utilities.URLReader(http://sourceforge.net/projects/boost/files/boost/;) 
 )
 print data
 found = re.search( a 
 href=\/projects/boost/files/latest/download\?source=files\ 
 title=\/boost/(.*),
 data)
 if found == None :
  raise MyError.StopError(Boost Download URL not found)

Now at least you get to look at what URLReader() returned.  Did it 
return what you expected?  If not, then there might be something wrong 
in your URLReader() function.  If it is what you expected, then I would 
start looking at the pattern to see if it's correct.  Either way, you've 
managed to halve the size of the problem.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: string encoding regex problem

2014-08-15 Thread Philipp Kraus

On 2014-08-16 00:48:46 +, Roy Smith said:


In article lsm8ic$j90$1...@online.de,
 Philipp Kraus philipp.kr...@flashpixx.de wrote:


found = re.search( a
href=\/projects/boost/files/latest/download\?source=files\
title=\/boost/(.*),
Utilities.URLReader(http://sourceforge.net/projects/boost/files/boost/;)
)
if found == None :
raise MyError.StopError(Boost Download URL not found)

But found is always None, so I cannot get the correct match. I didn't
find the error in my code.


I would start by breaking this down into pieces.  Something like:

data = 
Utilities.URLReader(http://sourceforge.net/projects/boost/files/boost/;) 


)
print data
found = re.search( a
href=\/projects/boost/files/latest/download\?source=files\
title=\/boost/(.*),
data)
if found == None :
raise MyError.StopError(Boost Download URL not found)


Now at least you get to look at what URLReader() returned.  Did it
return what you expected?  If not, then there might be something wrong
in your URLReader() function.


I have check the result of the (sorry, I forgot this information on my 
first post). The URLReader

returns the HTML code of the URL, so this seems to work correctly


 If it is what you expected, then I would
start looking at the pattern to see if it's correct.  Either way, you've
managed to halve the size of the problem.


The code works till last week correctly, I don't change the pattern. My 
question is, can it be
a problem with string encoding? Did I mask the question mark and quotes 
correctly?


Phil


--
https://mail.python.org/mailman/listinfo/python-list


Re: string encoding regex problem

2014-08-15 Thread Roy Smith
In article lsmeej$49n$1...@online.de,
 Philipp Kraus philipp.kr...@flashpixx.de wrote:

 The code works till last week correctly, I don't change the pattern.

OK, so what did you change?  Can you go back to last week's code and 
compare it to what you have now to see what changed?

 My question is, can it be a problem with string encoding? Did I mask 
 the question mark and quotes correctly?

The best thing to do with regular expressions is to use raw strings, 
i.e. r'this is a string'.  The nice thing about that is backslashes are 
not special.  It makes it about 1000% easier to write complicated 
regular expressions.  Simple ones are only 500% easier.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: string encoding regex problem

2014-08-15 Thread Steven D'Aprano
Philipp Kraus wrote:

 The code works till last week correctly, I don't change the pattern. My
 question is, can it be
 a problem with string encoding? Did I mask the question mark and quotes
 correctly?

If you didn't change the code, how could the *exact same code* not mask the
question mark last week, but this week suddenly start masking it, despite
not changing?

There are three things that can cause a change in behaviour:

- the re module has changed;

- the pattern has changed;

- the text you are searching has changed.

Have you removed the re module and replaced it with a different one? Did you
update Python to a new version?

Have you changed the regex search pattern?

Has the text you are searching changed? Websites upgrade their HTML quite
frequently. Perhaps the Boost website has changed enough to break your
regex.


-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: regex (?!..) problem

2009-10-06 Thread Hans Mulder

Stefan Behnel wrote:

Wolfgang Rohdewald wrote:

I want to match a string only if a word (C1 in this example) appears
at most once in it.


def match(s):
if s.count(C1)  1:
return None
return s

If this doesn't fit your requirements, you may want to provide some more
details.


That will return a false value if s is the empty string.

How about:

def match(s):
return s.count(C1) = 1

-- HansM
--
http://mail.python.org/mailman/listinfo/python-list


Re: regex (?!..) problem

2009-10-05 Thread Carl Banks
On Oct 4, 9:34 pm, Wolfgang Rohdewald wolfg...@rohdewald.de wrote:
 Hi,

 I want to match a string only if a word (C1 in this example) appears
 at most once in it. This is what I tried:

  re.match(r'(.*?C1)((?!.*C1))','C1b1b1b1 b3b3b3b3 C1C2C3').groups()

 ('C1b1b1b1 b3b3b3b3 C1', '') re.match(r'(.*?C1)','C1b1b1b1 b3b3b3b3 
 C1C2C3').groups()

 ('C1',)

 but this should not have matched. Why is the .*? behaving greedy
 if followed by (?!.*C1)?

It's not.

 I would have expected that re first
 evaluates (.*?C1) before proceeding at all.

It does.

What you're not realizing is that if a regexp search comes to a dead
end, it won't simply return no match.  Instead it'll throw away part
of the match, and backtrack to a previously-matched variable-length
subexpression, such as .*?, and try again with a different length.

That's what happened above.  At first the group (.*?C1) non-greedily
matched the substring C1, but it couldn't find a match under those
circumstances, so it backtracked to the .*?.  and looked a longer
match, which it found.

Here's something to keep in mind: except for a few corner cases,
greedy versus non-greedy will not affect the substring matched, it'll
only affect the groups.


 I also tried:

  re.search(r'(.*?C1(?!.*C1))','C1b1b1b1 b3b3b3b3

 C1C2C3C4').groups()
 ('C1b1b1b1 b3b3b3b3 C1',)

 with the same problem.

 How could this be done?

Can't be done with regexps.

How you would do this kind of depends on your overall goals, but your
first look should be toward the string methods.  If you share details
with us we can help you choose a better strategy.


Carl Banks
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: regex (?!..) problem

2009-10-05 Thread Wolfgang Rohdewald
On Monday 05 October 2009, Carl Banks wrote:
 What you're not realizing is that if a regexp search comes to a
  dead end, it won't simply return no match.  Instead it'll throw
  away part of the match, and backtrack to a previously-matched
  variable-length subexpression, such as .*?, and try again with a
  different length.

well, that explains it. This is contrary to what the documentation
says, though. Should I fill a bug report?
http://docs.python.org/library/re.html

Now back to my original problem: Would you have any idea how
to solve it?

count() is no solution in my case, I need re.search to either
return None or a match.

-- 
Wolfgang
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: regex (?!..) problem

2009-10-05 Thread Stefan Behnel
Wolfgang Rohdewald wrote:
 I want to match a string only if a word (C1 in this example) appears
 at most once in it.

def match(s):
if s.count(C1)  1:
return None
return s

If this doesn't fit your requirements, you may want to provide some more
details.

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: regex (?!..) problem

2009-10-05 Thread Carl Banks
On Oct 4, 11:17 pm, Wolfgang Rohdewald wolfg...@rohdewald.de wrote:
 On Monday 05 October 2009, Carl Banks wrote:

  What you're not realizing is that if a regexp search comes to a
   dead end, it won't simply return no match.  Instead it'll throw
   away part of the match, and backtrack to a previously-matched
   variable-length subexpression, such as .*?, and try again with a
   different length.

 well, that explains it. This is contrary to what the documentation
 says, though. Should I fill a bug 
 report?http://docs.python.org/library/re.html

If you're referring to the section where it explains greedy
qualifiers, it is not wrong per se.  re.match does exactly what the
documentation says: it matches as few characters as possible to the
non-greedy pattern.

However, since it's easy to misconstrue that if you don't know about
regexp backtracking, perhaps a little mention of backtracking is is
warranted.  IMO it's not a documentation bug, so if you want to file a
bug report I'd recommend filing as a wishlist item.

I will mention that my followup contained an error (which you didn't
quote).  I said greedy versus non-greedy doesn't affect the substring
matched.  That was wrong, it does affect the substring matched; what
it doesn't affect is whether there is a match found.


 Now back to my original problem: Would you have any idea how
 to solve it?

 count() is no solution in my case, I need re.search to either
 return None or a match.

Why do you have to use a regexp at all?

In Python we recommend using string operations and methods whenever
reasonable, and avoiding regexps unless you specifically need their
extra power.  String operations can easily do the examples you posted,
so I see no reason to use regexps.

Depending on what you want to do with the result, one of the following
functions should be close to what you need.  (I am using word to
refer to the string being matched against, token to be the thing you
don't want to appear more than once.)


def token_appears_once(word,token):
return word.count(token) == 1

def parts(word,token):
head,sep,tail = word.partition(C1)
if sep ==  or C1 in tail:
return None
return (head,sep,tail)


If you really need a match object, you should do a search, and then
call the .count method on the matched substring to see if there is
more than one occurrence, like this:

def match_only_if_token_appears_once(pattern,wotd,token):
m = re.search(pattern,word)
if m.group(0).count(C1) != 1:
m = None
return m


Carl Banks
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: regex (?!..) problem

2009-10-05 Thread Wolfgang Rohdewald
On Monday 05 October 2009, Stefan Behnel wrote:
 Wolfgang Rohdewald wrote:
  I want to match a string only if a word (C1 in this example)
  appears at most once in it.
 
 def match(s):
 if s.count(C1)  1:
 return None
 return s
 
 If this doesn't fit your requirements, you may want to provide some
  more details.

Well - the details are simple and already given: I need re.search
to either return None or a match. But I will try to state it
differently:

I have a string representing the results for a player of a board
game (Mah Jongg - not the solitaire but the real one, played by
4 players), and I have a list of scoring rules. Those rules
can be modified by the user, he can also add new rules. Mah Jongg
is played with very different rulesets worldwide.

The rules are written as regular expressions. Since what they
do varies greatly I do not want do treat some of them in a special
way. That would theoretically be possible but it would really
complificate things.

For each rule I simply need to check whether it applies or not.
I do that by calling re.search(rule, gamestring) and by checking
the result against None.

Here you can look at all rules I currently have.
http://websvn.kde.org/trunk/playground/games/kmj/src/predefined.py?view=markup
The rule I want to rewrite is called Robbing the Kong. Of
course it is more complicated than my example with C1.

Here you can find the documentation for the gamestring:
http://websvn.kde.org/trunk/playground/games/doc/kmj/index.docbook?revision=1030476view=markup
(get HTML files with meinproc index.docbook) 

-- 
Wolfgang
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: regex (?!..) problem

2009-10-05 Thread Wolfgang Rohdewald
On Monday 05 October 2009, MRAB wrote:
 (?!.*?(C1).*?\1) will succeed only if .*?(C1).*?\1 has failed,
  in which case the group (group 1) will be undefined (no capture).

I see. 

I should have moved the (C1) out of this expression anyway:

 re.match(r'L(?Ptile..)(?!.*?(?P=tile).*?(?P=tile))(.*?
(?P=tile))','LC1 C1B1B1B1 b3b3b3b3 C2C2C3').groups()
('C1', ' C1')

this solves my problem, thank you!


-- 
Wolfgang
-- 
http://mail.python.org/mailman/listinfo/python-list


regex (?!..) problem

2009-10-04 Thread Wolfgang Rohdewald
Hi,

I want to match a string only if a word (C1 in this example) appears
at most once in it. This is what I tried:

 re.match(r'(.*?C1)((?!.*C1))','C1b1b1b1 b3b3b3b3 C1C2C3').groups()
('C1b1b1b1 b3b3b3b3 C1', '')
 re.match(r'(.*?C1)','C1b1b1b1 b3b3b3b3 C1C2C3').groups()
('C1',)

but this should not have matched. Why is the .*? behaving greedy
if followed by (?!.*C1)? I would have expected that re first 
evaluates (.*?C1) before proceeding at all.

I also tried:

 re.search(r'(.*?C1(?!.*C1))','C1b1b1b1 b3b3b3b3 
C1C2C3C4').groups()
('C1b1b1b1 b3b3b3b3 C1',)

with the same problem.

How could this be done?

-- 
Wolfgang
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: regex (?!..) problem

2009-10-04 Thread n00m
Why not check it simply by count()?

 s = '1234C156789'
 s.count('C1')
1

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: regex problem ..

2008-12-17 Thread Steve Holden
Analog Kid wrote:
 Hi guys:
 Thanks for your responses. Points taken. Basically, I am looking for a
 combination of the following ...
 [^\w] and %(?!20) ... How do I do this in a single RE?
 
 Thanks for all you help.
 Regards,
 AK
 
 On Mon, Dec 15, 2008 at 10:54 PM, Steve Holden st...@holdenweb.com
 mailto:st...@holdenweb.com wrote:
 
 Analog Kid wrote:
  Hi All:
  I am new to regular expressions in general, and not just re in python.
  So, apologies if you find my question stupid :) I need some help with
  forming a regex. Here is my scenario ...
  I have strings coming in from a list, each of which I want to check
  against a regular expression and see whether or not it qualifies. By
  that I mean I have a certain set of characters that are
 permissible and
  if the string has characters which are not permissible, I need to flag
  that string ... here is a snip ...
 
  flagged = list()
  strs = ['HELLO', 'Hi%20There', '123...@#@']
  p =  re.compile(r[^a-zA-Z0-9], re.UNICODE)
  for s in strs:
  if len(p.findall(s))  0:
  flagged.append(s)
 
  print flagged
 
  my question is ... if I wanted to allow '%20' but not '%', how
 would my
  current regex (r[^a-zA-Z0-9]) be modified?
 
 The essence of the approach is to observe that each element is a
 sequence of zero or more character, where character is either
 letter/digit or escape. So you would use a pattern like
 
 ([a-zA-Z0-9]|%[0-9a-f][0-9a-f])+
 
 
Did you *try* the above pattern?

regards
 Steve
-- 
Steve Holden+1 571 484 6266   +1 800 494 3119
Holden Web LLC  http://www.holdenweb.com/

--
http://mail.python.org/mailman/listinfo/python-list


Re: regex problem ..

2008-12-17 Thread Analog Kid
Hi guys:
Thanks for your responses. Points taken. Basically, I am looking for a
combination of the following ...
[^\w] and %(?!20) ... How do I do this in a single RE?

Thanks for all you help.
Regards,
AK

On Mon, Dec 15, 2008 at 10:54 PM, Steve Holden st...@holdenweb.com wrote:

 Analog Kid wrote:
  Hi All:
  I am new to regular expressions in general, and not just re in python.
  So, apologies if you find my question stupid :) I need some help with
  forming a regex. Here is my scenario ...
  I have strings coming in from a list, each of which I want to check
  against a regular expression and see whether or not it qualifies. By
  that I mean I have a certain set of characters that are permissible and
  if the string has characters which are not permissible, I need to flag
  that string ... here is a snip ...
 
  flagged = list()
  strs = ['HELLO', 'Hi%20There', '123...@#@']
  p =  re.compile(r[^a-zA-Z0-9], re.UNICODE)
  for s in strs:
  if len(p.findall(s))  0:
  flagged.append(s)
 
  print flagged
 
  my question is ... if I wanted to allow '%20' but not '%', how would my
  current regex (r[^a-zA-Z0-9]) be modified?
 
 The essence of the approach is to observe that each element is a
 sequence of zero or more character, where character is either
 letter/digit or escape. So you would use a pattern like

 ([a-zA-Z0-9]|%[0-9a-f][0-9a-f])+


 regards
  Steve
 --
 Steve Holden+1 571 484 6266   +1 800 494 3119
 Holden Web LLC  http://www.holdenweb.com/

 --
 http://mail.python.org/mailman/listinfo/python-list

--
http://mail.python.org/mailman/listinfo/python-list


Re: regex problem ..

2008-12-15 Thread Tino Wildenhain

Analog Kid wrote:

Hi All:
I am new to regular expressions in general, and not just re in python. 
So, apologies if you find my question stupid :) I need some help with 
forming a regex. Here is my scenario ...
I have strings coming in from a list, each of which I want to check 
against a regular expression and see whether or not it qualifies. By 
that I mean I have a certain set of characters that are permissible and 
if the string has characters which are not permissible, I need to flag 
that string ... here is a snip ...


flagged = list()
strs = ['HELLO', 'Hi%20There', '123...@#@']
p =  re.compile(r[^a-zA-Z0-9], re.UNICODE)
for s in strs:
if len(p.findall(s))  0:
flagged.append(s)

print flagged

my question is ... if I wanted to allow '%20' but not '%', how would my 
current regex (r[^a-zA-Z0-9]) be modified?


You might want to normalize before checking, e.g.

from urllib import unquote

p=re.compile([^a-zA-Z0-9 ])
flagged=[]

for s in strs:
if p.search(unquote(s)):
   flagged.append(s)

be carefull however if you want to show the
flagged ones back to the user. Best is always
quote/unquote at the boundaries as appropriate.

Regards
Tino




smime.p7s
Description: S/MIME Cryptographic Signature
--
http://mail.python.org/mailman/listinfo/python-list


regex problem ..

2008-12-15 Thread Analog Kid
Hi All:
I am new to regular expressions in general, and not just re in python. So,
apologies if you find my question stupid :) I need some help with forming a
regex. Here is my scenario ...
I have strings coming in from a list, each of which I want to check against
a regular expression and see whether or not it qualifies. By that I mean I
have a certain set of characters that are permissible and if the string has
characters which are not permissible, I need to flag that string ... here is
a snip ...

flagged = list()
strs = ['HELLO', 'Hi%20There', '123...@#@']
p =  re.compile(r[^a-zA-Z0-9], re.UNICODE)
for s in strs:
if len(p.findall(s))  0:
flagged.append(s)

print flagged

my question is ... if I wanted to allow '%20' but not '%', how would my
current regex (r[^a-zA-Z0-9]) be modified?

TIA,
AK
--
http://mail.python.org/mailman/listinfo/python-list


Re: regex problem ..

2008-12-15 Thread Steve Holden
Analog Kid wrote:
 Hi All:
 I am new to regular expressions in general, and not just re in python.
 So, apologies if you find my question stupid :) I need some help with
 forming a regex. Here is my scenario ...
 I have strings coming in from a list, each of which I want to check
 against a regular expression and see whether or not it qualifies. By
 that I mean I have a certain set of characters that are permissible and
 if the string has characters which are not permissible, I need to flag
 that string ... here is a snip ...
 
 flagged = list()
 strs = ['HELLO', 'Hi%20There', '123...@#@']
 p =  re.compile(r[^a-zA-Z0-9], re.UNICODE)
 for s in strs:
 if len(p.findall(s))  0:
 flagged.append(s)
 
 print flagged
 
 my question is ... if I wanted to allow '%20' but not '%', how would my
 current regex (r[^a-zA-Z0-9]) be modified?
 
The essence of the approach is to observe that each element is a
sequence of zero or more character, where character is either
letter/digit or escape. So you would use a pattern like

([a-zA-Z0-9]|%[0-9a-f][0-9a-f])+


regards
 Steve
-- 
Steve Holden+1 571 484 6266   +1 800 494 3119
Holden Web LLC  http://www.holdenweb.com/

--
http://mail.python.org/mailman/listinfo/python-list


Re: regex problem with re and fnmatch

2007-11-21 Thread Fabian Braennstroem
Hi John,

John Machin schrieb am 11/20/2007 09:40 PM:
 On Nov 21, 8:05 am, Fabian Braennstroem [EMAIL PROTECTED] wrote:
 Hi,

 I would like to use re to search for lines in a files with
 the word README_x.org, where x is any number.
 E.g. the structure would look like this:
 [[file:~/pfm_v99/README_1.org]]

 I tried to use these kind of matchings:
 #org_files='.*README\_1.org]]'
 org_files='.*README\_*.org]]'
 if re.match(org_files,line):
 
 First tip is to drop the leading '.*' and use search() instead of
 match(). The second tip is to use raw strings always for your
 patterns.
 
 Unfortunately, it matches all entries with README.org, but
 not the wanted number!?
 
 \_* matches 0 or more occurrences of _ (the \ is redundant). You need
 to specify one or more digits -- use \d+ or [0-9]+
 
 The . in .org matches ANY character except a newline. You need to
 escape it with a \.
 
 pat = r'README_\d+\.org'
 re.search(pat, 'README.org')
 re.search(pat, 'README_.org')
 re.search(pat, 'README_1.org')
 _sre.SRE_Match object at 0x00B899C0
 re.search(pat, 'README_.org')
 _sre.SRE_Match object at 0x00B899F8
 re.search(pat, 'README_Zorg')


Thanks a lot, works really nice!

 After some splitting and replacing I am able to check, if
 the above file exists. If it does not, I start to search for
 it using the 'walk' procedure:
 
 I presume that you mean something like: .. check if the above file
 exists in some directory. If it does not, I start to search for  it
 somewhere else ...
 
 for root, dirs, files in
 os.walk(/home/fab/org):
 
 for name in dirs:
 dirs=os.path.join(root, name) + '/'
 
 The above looks rather suspicious ...
 for thing in container:
 container = something_else
 
 What are you trying to do?
 
 
 for name in files:
  files=os.path.join(root, name)
 
 and again 
 
 if fnmatch.fnmatch(str(files), README*):
 
 Why str(name) ?
 
 print File Found
 print str(files)
 break
 
 
 fnmatch is not as capable as re; in particular it can't express one
 or more digits. To search a directory tree for the first file whose
 name matches a pattern, you need something like this:
 def find_one(top, pat):
for root, dirs, files in os.walk(top):
   for fname in files:
  if re.match(pat + '$', fname):
 return os.path.join(root, fname)
 
 
 As soon as it finds the file,
 
 the file or a file???
 
 Ummm ... aren't you trying to locate a file whose EXACT name you found
 in the first exercise??
 
 def find_it(top, required):
for root, dirs, files in os.walk(top):
   if required in files:
 return os.path.join(root, required)

Great :-) Thanks a lot for your help... it can be so easy :-)
Fabian


-- 
http://mail.python.org/mailman/listinfo/python-list


regex problem with re and fnmatch

2007-11-20 Thread Fabian Braennstroem
Hi,

I would like to use re to search for lines in a files with
the word README_x.org, where x is any number.
E.g. the structure would look like this:
[[file:~/pfm_v99/README_1.org]]

I tried to use these kind of matchings:
#org_files='.*README\_1.org]]'
org_files='.*README\_*.org]]'
if re.match(org_files,line):

Unfortunately, it matches all entries with README.org, but
not the wanted number!?

After some splitting and replacing I am able to check, if
the above file exists. If it does not, I start to search for
it using the 'walk' procedure:

for root, dirs, files in
os.walk(/home/fab/org):
for name in dirs:
dirs=os.path.join(root, name) + '/'
for name in files:
 files=os.path.join(root, name)
if fnmatch.fnmatch(str(files), README*):
print File Found
print str(files)
break

As soon as it finds the file, it should stop the searching
process; but there is the same matching problem like above.
Does anyone have any suggestions about the regex problem?
Greetings!
Fabian

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: regex problem with re and fnmatch

2007-11-20 Thread John Machin
On Nov 21, 8:05 am, Fabian Braennstroem [EMAIL PROTECTED] wrote:
 Hi,

 I would like to use re to search for lines in a files with
 the word README_x.org, where x is any number.
 E.g. the structure would look like this:
 [[file:~/pfm_v99/README_1.org]]

 I tried to use these kind of matchings:
 #org_files='.*README\_1.org]]'
 org_files='.*README\_*.org]]'
 if re.match(org_files,line):

First tip is to drop the leading '.*' and use search() instead of
match(). The second tip is to use raw strings always for your
patterns.


 Unfortunately, it matches all entries with README.org, but
 not the wanted number!?

\_* matches 0 or more occurrences of _ (the \ is redundant). You need
to specify one or more digits -- use \d+ or [0-9]+

The . in .org matches ANY character except a newline. You need to
escape it with a \.

 pat = r'README_\d+\.org'
 re.search(pat, 'README.org')
 re.search(pat, 'README_.org')
 re.search(pat, 'README_1.org')
_sre.SRE_Match object at 0x00B899C0
 re.search(pat, 'README_.org')
_sre.SRE_Match object at 0x00B899F8
 re.search(pat, 'README_Zorg')



 After some splitting and replacing I am able to check, if
 the above file exists. If it does not, I start to search for
 it using the 'walk' procedure:

I presume that you mean something like: .. check if the above file
exists in some directory. If it does not, I start to search for  it
somewhere else ...


 for root, dirs, files in
 os.walk(/home/fab/org):

 for name in dirs:
 dirs=os.path.join(root, name) + '/'

The above looks rather suspicious ...
for thing in container:
container = something_else

What are you trying to do?


 for name in files:
  files=os.path.join(root, name)

and again 

 if fnmatch.fnmatch(str(files), README*):

Why str(name) ?

 print File Found
 print str(files)
 break


fnmatch is not as capable as re; in particular it can't express one
or more digits. To search a directory tree for the first file whose
name matches a pattern, you need something like this:
def find_one(top, pat):
   for root, dirs, files in os.walk(top):
  for fname in files:
 if re.match(pat + '$', fname):
return os.path.join(root, fname)


 As soon as it finds the file,

the file or a file???

Ummm ... aren't you trying to locate a file whose EXACT name you found
in the first exercise??

def find_it(top, required):
   for root, dirs, files in os.walk(top):
  if required in files:
return os.path.join(root, required)


 it should stop the searching
 process; but there is the same matching problem like above.

HTH,
John
-- 
http://mail.python.org/mailman/listinfo/python-list


newb: Simple regex problem headache

2007-09-21 Thread crybaby
import re

s1 ='nbsp;25000nbsp;'
s2 = 'nbsp;5.5910nbsp;'

mypat = re.compile('[0-9]*(\.[0-9]*|$)')
rate= mypat.search(s1)
print rate.group()

rate=mypat.search(s2)
print rate.group()
rate = mypat.search(s1)
price = float(rate.group())
print price

I get an error when it hits the whole number, that is in this format:
s1 ='nbsp;25000nbsp;'
For whole number s2, mypat catching empty string.  I want it to give
me 25000.
I am getting this error:

price = float(rate.group())
ValueError: empty string for float()

Anyone knows, how I can get 25000 out of s2 = 'nbsp;5.5910nbsp;'
using regex pattern, mypat = re.compile('[0-9]*(\.[0-9]*|$)').  mypat
works fine for real numbers, but doesn't work for whole numbers.

thanks

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: newb: Simple regex problem headache

2007-09-21 Thread chris . monsanto
On Sep 21, 5:04 pm, crybaby [EMAIL PROTECTED] wrote:
 import re

 s1 ='nbsp;25000nbsp;'
 s2 = 'nbsp;5.5910nbsp;'

 mypat = re.compile('[0-9]*(\.[0-9]*|$)')
 rate= mypat.search(s1)
 print rate.group()

 rate=mypat.search(s2)
 print rate.group()
 rate = mypat.search(s1)
 price = float(rate.group())
 print price

 I get an error when it hits the whole number, that is in this format:
 s1 ='nbsp;25000nbsp;'
 For whole number s2, mypat catching empty string.  I want it to give
 me 25000.
 I am getting this error:

 price = float(rate.group())
 ValueError: empty string for float()

 Anyone knows, how I can get 25000 out of s2 = 'nbsp;5.5910nbsp;'
 using regex pattern, mypat = re.compile('[0-9]*(\.[0-9]*|$)').  mypat
 works fine for real numbers, but doesn't work for whole numbers.

 thanks

Your pattern matches the empty string a bit too well, if you know what
I mean!

Changing the regex slightly to '[0-9]+(\.[0-9]+)?' yields the results
you want:

25000
5.5910
25000.0


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: newb: Simple regex problem headache

2007-09-21 Thread Ian Clark
crybaby wrote:
 import re
 
 s1 ='nbsp;25000nbsp;'
 s2 = 'nbsp;5.5910nbsp;'
 
 mypat = re.compile('[0-9]*(\.[0-9]*|$)')
 rate= mypat.search(s1)
 print rate.group()
 
 rate=mypat.search(s2)
 print rate.group()
 rate = mypat.search(s1)
 price = float(rate.group())
 print price
 
 I get an error when it hits the whole number, that is in this format:
 s1 ='nbsp;25000nbsp;'
 For whole number s2, mypat catching empty string.  I want it to give
 me 25000.
 I am getting this error:
 
 price = float(rate.group())
 ValueError: empty string for float()
 
 Anyone knows, how I can get 25000 out of s2 = 'nbsp;5.5910nbsp;'
 using regex pattern, mypat = re.compile('[0-9]*(\.[0-9]*|$)').  mypat
 works fine for real numbers, but doesn't work for whole numbers.
 
 thanks
 

Try this:
  import re
  s1 ='nbsp;25000nbsp;'
  s2 = 'nbsp;5.5910nbsp;'
  num_pat = re.compile(r'([0-9]+(\.[0-9]+)?)')
  num_pat.search(s1).group(1)
 '25000'
  num_pat.search(s2).group(1)
 '5.5910'

Ian

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: newb: Simple regex problem headache

2007-09-21 Thread Erik Jones

On Sep 21, 2007, at 4:04 PM, crybaby wrote:

 import re

 s1 ='nbsp;25000nbsp;'
 s2 = 'nbsp;5.5910nbsp;'

 mypat = re.compile('[0-9]*(\.[0-9]*|$)')
 rate= mypat.search(s1)
 print rate.group()

 rate=mypat.search(s2)
 print rate.group()
 rate = mypat.search(s1)
 price = float(rate.group())
 print price

 I get an error when it hits the whole number, that is in this format:
 s1 ='nbsp;25000nbsp;'
 For whole number s2, mypat catching empty string.  I want it to give
 me 25000.
 I am getting this error:

 price = float(rate.group())
 ValueError: empty string for float()

 Anyone knows, how I can get 25000 out of s2 = 'nbsp;5.5910nbsp;'
 using regex pattern, mypat = re.compile('[0-9]*(\.[0-9]*|$)').  mypat
 works fine for real numbers, but doesn't work for whole numbers.

I'm not sure what you just said makes a lot of sense, but if all your  
looking for is a regex that will match number strings with or without  
a decimal point, try '\d*\.?\d*'

Erik Jones

Software Developer | Emma®
[EMAIL PROTECTED]
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate  market in style.
Visit us online at http://www.myemma.com


-- 
http://mail.python.org/mailman/listinfo/python-list


regex problem

2006-11-22 Thread km

Hi all,

line is am trying to match is
1959400|Q2BYK3|Q2BYK3_9GAMM Hypothetical outer membra29.90.00011   1

regex i have written is
re.compile
(r'(\d+?)\|((P|O|Q)\w{5})\|\w{3,6}\_\w{3,5}\s+?.{25}\s{3}(\d+?\.\d)\s+?(\d\.\d+?)')

I am trying to extract 0.0011 value from the above line.
why doesnt it match the group(4) item of the match ?

any idea whats wrong  with it ?

regards,
KM
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex problem

2006-11-22 Thread Tim Chase
 line is am trying to match is
 1959400|Q2BYK3|Q2BYK3_9GAMM Hypothetical outer membra29.90.00011   1
 
 regex i have written is
 re.compile
 (r'(\d+?)\|((P|O|Q)\w{5})\|\w{3,6}\_\w{3,5}\s+?.{25}\s{3}(\d+?\.\d)\s+?(\d\.\d+?)')
 
 I am trying to extract 0.0011 value from the above line.
 why doesnt it match the group(4) item of the match ?
 
 any idea whats wrong  with it ?

Well, your .{25}\s{3} portion only gets you to one space short 
of your 29.9, so your (\d+... fails to match  29.9 because 
there's an extra space there.  My guess (from only one datum, so 
this could be /way/ off base) would be that you mean \s{4} or 
possibly \s{3,4}

It seems like a very overconstrained regexp, but it might be just 
what you need to isolate the single line (or class of line) 
amongst the chaff of thousand others of similar form.

-tkc





-- 
http://mail.python.org/mailman/listinfo/python-list


Re: regex problem

2006-11-22 Thread km

HI Tim,

oof!
thats true!

thanks a lot.
Is there any tool to simplify building the regex  ?

regards,
KM

On 11/23/06, Tim Chase [EMAIL PROTECTED] wrote:


 line is am trying to match is
 1959400|Q2BYK3|Q2BYK3_9GAMM Hypothetical outer membra29.90.00011
1

 regex i have written is
 re.compile

(r'(\d+?)\|((P|O|Q)\w{5})\|\w{3,6}\_\w{3,5}\s+?.{25}\s{3}(\d+?\.\d)\s+?(\d\.\d+?)')

 I am trying to extract 0.0011 value from the above line.
 why doesnt it match the group(4) item of the match ?

 any idea whats wrong  with it ?

Well, your .{25}\s{3} portion only gets you to one space short
of your 29.9, so your (\d+... fails to match  29.9 because
there's an extra space there.  My guess (from only one datum, so
this could be /way/ off base) would be that you mean \s{4} or
possibly \s{3,4}

It seems like a very overconstrained regexp, but it might be just
what you need to isolate the single line (or class of line)
amongst the chaff of thousand others of similar form.

-tkc






-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex problem

2006-11-22 Thread bearophileHUGS
  line is am trying to match is
  1959400|Q2BYK3|Q2BYK3_9GAMM Hypothetical outer membra29.90.00011   1
 
  regex i have written is
  re.compile
  (r'(\d+?)\|((P|O|Q)\w{5})\|\w{3,6}\_\w{3,5}\s+?.{25}\s{3}(\d+?\.\d)\s+?(\d\.\d+?)')
 
  I am trying to extract 0.0011 value from the above line.
  why doesnt it match the group(4) item of the match ?
 
  any idea whats wrong  with it ?

I am not expert about REs yet, but I suggest you to use the re.VERBOSE
and split your RE in parts, like this:

example = re.compile(r^  \s* # must start at the beginning +
optional whitespaces
 ( [\[\(] ) # Group 1: opening bracket
 \s*# optional whitespaces
 ( [-+]? \d+ )  # Group 2: first number
 \s* , \s*  # optional space + comma +
optional whitespaces
 ( [-+]? \d+ )  # Group 3: second number
 \s*# optional whitespaces
 ( [\)\]] ) # Group 4: closing bracket
 \s*  $ # optional whitespaces + must
end at the end
  , flags=re.VERBOSE)

This way you can debug and mantain it much better.

Bye,
bearophile

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: regex problem

2005-07-27 Thread Odd-R.
On 2005-07-26, Duncan Booth [EMAIL PROTECTED] wrote:
 rx1=re.compile(r\b\d{4}(?:-\d{4})?,)
 rx1.findall(1234,-,4567,)
 ['1234,', '-,', '4567,']

Thanks all for good advice. However this last expression
also matches the first four digits when the input is more
than four digits. To resolve this problem, I first do a 
match of this,

regex=re.compile(r\A(\b\d{4},|\d{4}-\d{4},)*(\b\d{4}|\d{4}-\d{4})\Z)

If this turns out ok, I do a find all with your expression, and then I get
the desired result.


-- 
Har du et kjøleskap, har du en TV
så har du alt du trenger for å leve

-Jokke  Valentinerne
-- 
http://mail.python.org/mailman/listinfo/python-list


regex problem

2005-07-26 Thread Odd-R.
Input is a string of four digit sequences, possibly
separated by a -, for instance like this

1234,-,4567,

My regular expression is like this:

rx1=re.compile(r\A(\b\d\d\d\d,|\b\d\d\d\d-\d\d\d\d,)*\Z)

When running rx1.findall(1234,-,4567,)

I only get the last match as the result. Isn't
findall suppose to return all the matches?

Thanks in advance.


-- 
Har du et kjøleskap, har du en TV
så har du alt du trenger for å leve

-Jokke  Valentinerne
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: regex problem

2005-07-26 Thread Thomas Guettler
Am Tue, 26 Jul 2005 09:57:23 + schrieb Odd-R.:

 Input is a string of four digit sequences, possibly
 separated by a -, for instance like this
 
 1234,-,4567,
 
 My regular expression is like this:
 
 rx1=re.compile(r\A(\b\d\d\d\d,|\b\d\d\d\d-\d\d\d\d,)*\Z)

Hi,

try it without \A and \Z

import re
rx1=re.compile(r(\b\d\d\d\d,|\b\d\d\d\d-\d\d\d\d,))
print rx1.findall(1234,-,4567,)
# -- ['1234,', '-,', '4567,']

 Thomas

-- 
Thomas Güttler, http://www.thomas-guettler.de/


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: regex problem

2005-07-26 Thread John Machin
Odd-R. wrote:
 Input is a string of four digit sequences, possibly
 separated by a -, for instance like this
 
 1234,-,4567,
 
 My regular expression is like this:
 
 rx1=re.compile(r\A(\b\d\d\d\d,|\b\d\d\d\d-\d\d\d\d,)*\Z)
 
 When running rx1.findall(1234,-,4567,)
 
 I only get the last match as the result. Isn't
 findall suppose to return all the matches?

For a start, an expression that starts with \A and ends with \Z will 
match the whole string (or not match at all). You have only one match.

Secondly, as you have a group in your expression, findall returns what 
the group matches. Your expression matches zero or more of what your 
group matches, provided there is nothing else at the start/end of the 
string. The zero or more makes the re engine waltz about a bit; when 
the music stopped, the group was matching 4567,.

Thirdly, findall should be thought of as merely a wrapper around a loop 
using the search method -- it finds all non-overlapping matches of a 
pattern. So the clue to get from this is that you need a really simple 
pattern, like the following. You *don't* have to write an expression 
that does the looping.

So here's the mean lean no-flab version -- you don't even need the 
parentheses (sorry, Thomas).

  rx1=re.compile(r\b\d\d\d\d,|\b\d\d\d\d-\d\d\d\d,)
  rx1.findall(1234,-,4567,)
['1234,', '-,', '4567,']

HTH,
John
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: regex problem

2005-07-26 Thread Duncan Booth
John Machin wrote:

 So here's the mean lean no-flab version -- you don't even need the 
 parentheses (sorry, Thomas).
 
  rx1=re.compile(r\b\d\d\d\d,|\b\d\d\d\d-\d\d\d\d,)
  rx1.findall(1234,-,4567,)
 ['1234,', '-,', '4567,']

No flab? What about all that repetition of \d? A less flabby version:

 rx1=re.compile(r\b\d{4}(?:-\d{4})?,)
 rx1.findall(1234,-,4567,)
['1234,', '-,', '4567,']

-- 
http://mail.python.org/mailman/listinfo/python-list



Re: regex problem

2005-07-26 Thread John Machin
Duncan Booth wrote:
 John Machin wrote:
 
 
So here's the mean lean no-flab version -- you don't even need the 
parentheses (sorry, Thomas).


rx1=re.compile(r\b\d\d\d\d,|\b\d\d\d\d-\d\d\d\d,)
rx1.findall(1234,-,4567,)

['1234,', '-,', '4567,']
 
 
 No flab? What about all that repetition of \d? A less flabby version:
 
 
rx1=re.compile(r\b\d{4}(?:-\d{4})?,)
rx1.findall(1234,-,4567,)
 
 ['1234,', '-,', '4567,']
 


OK, good idea to factor out the prefix and follow it by optional -1234.
However optimising re engines do common prefix factoring, *and* they 
rewrite stuff like x{4} as .

Cheers,
John
-- 
http://mail.python.org/mailman/listinfo/python-list