Re: [Tutor] regular expression query

2019-06-09 Thread Cameron Simpson

On 08Jun2019 22:27, Sean Murphy  wrote:

Windows 10 OS, Python 3.6


Thanks for this.

I have a couple of  queries  in relation to extracting content using 
regular expressions. I understand [...the regexp syntax...]

The challenge I am finding is getting a pattern to
extract specific word(s). Trying to identify the best method to use and how
to use the \1 when using forward and backward search pattern (Hoping I am
using the right term). Basically I am trying to extract specific phrases or
digits to place in a dictionary within categories. Thus if "ROYaL_BANK
123123123" is found, it is placed in a category called transfer funds. Other
might be a store name which likewise is placed in the store category.


I'll tackle your specific examples lower down, and make some 
suggestions.


Note, I have found a logic error with "ROYAL_BANK 123123123", but that 
isn't a concern. The extraction of the text is.


Line examples:
Royal_bank M-BANKING PAYMENT TRANSFER 123456 to 9922992299
Royal_bank M-BANKING PAYMENT TRANSFER 123456 FROM 9922992299
PAYMENT TO SARWARS-123123123
ROYAL_BANK INTERNET BANKING BPAY Kangaroo Store {123123123}
EFTPOS Amazon
PAY/SALARY FROM foo bar 123123123
PAYMENT TO Tax Man  666


Thanks.

Assuming the below is a cut/paste accident from some code:

 result = re.sub(r'ROYAL_BANK INTERNET BANKING FUNDS TFER TRANSFER \d+ TO ', 
'ROYAL_BANK ', line)
 r'ROYAL_BANK INTERNET BANKING TRANSFER Mouth in foot


And other similar structures. Below is the function I am currently using.
Not sure if the sub, match or search is going to be the best method. The
reason why I am using a sub is to delete the unwanted text. The
searchmatch/findall  could do the same if I use a group. Also I have not
used any tests in the below and logically I think I should. As the code will
override the results if not found in the later tests. If there is a more
elegant  way to do it then having:

If line.startswith('text string to match'):
   Regular expression
el If line.startswith('text string to match'):
   regular expression
return result


There is. How far you take it depends on how variable your input it.  
Banking statement data I would expect to have relatively few formats 
(unless the banking/financ industry is every bit as fragmented as I 
sometimes believe, in which case the structure might be less driven by 
_your_ bank and instead arbitrarily garbled according the the various 
other entities due to getting ad hoc junk as the description).


I would like to know. The different regular expressions I have used 
are:


# this sometimes matches and sometimes does not. I want all the text up to
the from or to, to be replaced with "ROYAL_BANK". Ending up with ROYAL_BANK
123123123

   result= re.sub(r'ROYAL_BANK M-BANKING PAYMENT TRANSFER \d+ (TO|FROM) ',
'ROYAL_BANK ', line)


Looks superficially ok. Got an example input line where it fails? Not 
that the above is case sentitive, so if "to" etc can be in lower case 
(as in your example text earlier) this will fail. See the re.I modifier.



# the below  returns from STARWARS and it shouldn't. I should just get
STARWARS.

   result = re.match(r'PAYMENT TO (SARWARS)-\d+ ', line)


Well, STARWARS seems misseplt above. And you should get a "match" 
object, with "STARWARS" in .group(1).


So earlier you're getting a str in result, and here you're getting an 
re.match object (or None for a failed match).


# the below should (doesn't work the last time I tested it) should 
return the words between the (.)


   result = re.match(r'ROYAL_BANK INTERNET BANKING BPAY (.*) [{].*$', '\1', 
line)


"should" what? It would help to see the input line you expect this to 
match. And re.match is not an re.sub - it looks like you have these 
confused here, based on the following '\`',line parameters.



# the below patterns should remove the text at the beginning of the string
   result = re.sub(r'ROYAL_BANK INTERNET BANKING FUNDS TFER TRANSFER \d+ TO ', 
'ROYAL_BANK ', line)
   result = re.sub(r'ROYAL_BANK INTERNET BANKING TRANSFER ', '', line)
   result = re.sub(r'EFTPOS ', '', line)


Sure. Got an example line where this does not happen?

# The below does not work and I am trying to use the back or forward 
search feature. Is this syntax wrong or the pattern wrong? I cannot work it out

from the information I have read.

result = re.sub(r'PAY/SALARY FROM (*.) \d+$', '\1', line)
   result = re.sub(r'PAYMENT TO (*.) \d+', '\1', line)


You've got "*." You probably mean ".*"

Main issues:

1: Your input data seems to be mixed case, but all your regexps are case 
sensitive. They will not match if the case is different eg "Royal_Bank" 
vs "ROYAL_BANK", "to" vs "TO", etc. Use the re.I modified to make your 
regexps case insensitive.


2: You're using re.sub a lot. I'd be inclined to always use re.match and 
to pull information from the match object you get back. Untested example 
sketch:


 m = re.match('(ROYAL_BANK|COMMONER_CREDIT_UNION) INTERNET BANKING FUNDS TFER 
TRANSFER (\d+) TO 

Re: [Tutor] (regular expression)

2016-12-10 Thread Martin A. Brown

Hello Isaac,

This second posting you have made has provided more information 
about what you are trying to accomplish and how (and also was 
readable, where the first one looked like it got mangled by your 
mail user agent; it's best to try to post only plain text messages 
to this sort of mailing list).

I suspect that we can help you a bit more, now.

If we knew even more about what you were looking to do, we might be 
able to help you further (with all of the usual remarks about how we 
won't do your homework for you, but all of us volunteers will gladly 
help you understand the tools, the systems, the world of Python and 
anything else we can suggest in the realm of computers, computer 
science and problem solving).

I will credit the person who assigned this task for you, as this is 
not dissimilar from the sort of problem that one often has when 
facing a new practical computing problem.  Often (and in your case) 
there is opaque structure and hidden assumptions in the question 
which need to be understood.  See further below

These were your four lines of code:

>with 
>urllib.request.urlopen("https://www.sdstate.edu/electrical-engineering-and-computer-science;)
> as cs:
>cs_page = cs.read()
>soup = BeautifulSoup(cs_page, "html.parser")
>print(len(soup.body.find_all(string = ["Engineering","engineering"])))

The fourth line is an impressive attempt at compressing all of the 
searching, finding, counting and reporting steps into a single line.  

Your task (I think), is more complicated than that single line can 
express.  So, that will need to be expanded to a few more lines of 
code.

You may have heard these aphorisms before:

  * brevity is the soul of wit
  * fewer lines of code are better
  * prefer a short elegant solution

But, when complexity intrudes into brevity, the human mind 
struggles.  As a practitioner, I will say that I spend more of my 
time reading and understanding code than writing it, so writing 
simple, self-contained and understandable units of code leads to 
intelligibility for humans and composability for systems.

Try this at a Python console [1].

  import this

>i used control + f on the link in the code and i get 11 for ctrl + 
>f and 3 for the code

Applause!  Look at the raw data!  Study the raw data!  That is an 
excellent way to start to try to understand the raw data.  You must 
always go back to the raw input data and then consider whether your 
tooling or the data model in your program matches what you are 
trying to extract/compute/transform.

The answer (for number of occurrences of the word 'engineering', 
case-insensitive) that I get is close to your answer when searching 
with control + f, but is a bit larger than 11.

Anyway, here are my thoughts.  I will start with some tips that are 
relevant to your 4-line pasted program:

  * BeautifulSoup is wonderfully convenient, but also remember it 
is another high-level tool; it is often forgiving where other 
tools are more rigorous, however it is excellent for learning 
and (I hope you see below) that it is a great tool for the 
problem you are trying to solve

  * in your code, soup.body is a handle that points to the 
tag of the HTML document you have fetched; so why can't you 
simply find_all of the strings "Engineering" and "engineering" 
in the text and count them?

  - find_all is a method that returns all of the tags in the
structured document below (in this case) soup.body

  - your intent is not to count tags with the string
'engineering' but rather , you are looking for that string 
in the text (I think)

  * it is almost always a mistake to try to process HTML with 
regular expressions, however, it seems that you are trying to 
find all matches of the (case-insensitive) word 'engineering' in 
the text of this document; that is something tailor-made for 
regular expressions, so there's the Python regular expression 
library, too:  'import re'

  * and on a minor note, since you are using urllib.request.open()
in a with statement (using contexts this way is wonderful), you
could collect the data from the network socket, then drop out of 
the 'with' block to allow the context to close, so if your block 
worked as you wanted, you could adjust it as follows:

  with urllib.request.urlopen(uri as cs:
  cs_page = cs.read()
  soup = BeautifulSoup(cs_page, "html.parser")
  print(len(soup.body.find_all(string = ["Engineering","engineering"])))

  * On a much more minor point, I'll mention that urllib / urllib2 
are available with the main Python releases but there are other 
libraries for handling fetching; I often recommend the 
third-party requests [0] library, as it is both very Pythonic, 
reasonably high-level and frightfully flexible

So, connecting the Zen of Python [1] to your problem, I would 
suggest making shorter, simpler lines and separating the 

Re: [Tutor] (regular expression)

2016-12-10 Thread isaac tetteh
this is the real code


with 
urllib.request.urlopen("https://www.sdstate.edu/electrical-engineering-and-computer-science;)
 as cs:
cs_page = cs.read()
soup = BeautifulSoup(cs_page, "html.parser")
print(len(soup.body.find_all(string = ["Engineering","engineering"])))

i used control + f on the link in the code and i get 11 for ctrl + f and 3 for 
the code

THanks





From: Tutor  on behalf of Bob 
Gailer 
Sent: Saturday, December 10, 2016 7:54 PM
To: Tetteh, Isaac - SDSU Student
Cc: Python Tutor
Subject: Re: [Tutor] (no subject)

On Dec 10, 2016 12:15 PM, "Tetteh, Isaac - SDSU Student" <
isaac.tet...@jacks.sdstate.edu> wrote:
>
> Hello,
>
> I am trying to find the number of times a word occurs on a webpage so I
used bs4 code below
>
> Let assume html contains the "html code"
> soup = BeautifulSoup(html, "html.parser")
> print(len(soup.find_all(string=["Engineering","engineering"])))
> But the result is different from when i use control + f on my keyboard to
find
>
> Please help me understand why it's different results. Thanks
> I am using Python 3.5
>
What is the URL of the web page?
To what are you applying control-f?
What are the two different counts you're getting?
Is it possible that the page is being dynamically altered after it's loaded?
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Tutor Info Page - Python
mail.python.org
This list is for folks who want to ask questions regarding how to learn 
computer programming with the Python language and its standard library.



___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression on python

2015-04-15 Thread Alan Gauld

On 15/04/15 09:24, Peter Otten wrote:


function call. I've never seen (or noticed?) the embedded form,
and don't see it described in the docs anywhere


Quoting https://docs.python.org/dev/library/re.html:


(?aiLmsux)
(One or more letters from the set 'a', 'i', 'L', 'm', 's', 'u', 'x'.) The
group matches the empty string; the letters set the corresponding flags:


Aha. The trick is knowing the correct search string... I tried 'flag' 
and 'verbose' but missed this entry.



Again, where is that described?



(?#...)
A comment; the contents of the parentheses are simply ignored.



OK, I missed that too.
Maybe I just wasn't awake enough this morning! :-)

Thanks Peter.

--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression on python

2015-04-15 Thread Alex Kleider

On 2015-04-14 16:49, Alan Gauld wrote:


New one on me. Where does one find out about verbose mode?
I don't see it in the re docs?


This is where I go whenever I find myself having to (re)learn the 
details of regex:

https://docs.python.org/3/howto/regex.html

I believe a '2' can be substituted for the '3' but I've not found any 
difference between the two.


(I submit this not so much for Alan (tutor) as for those like me who are 
learning.)


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression on python

2015-04-15 Thread Alan Gauld

On 15/04/15 02:02, Steven D'Aprano wrote:

New one on me. Where does one find out about verbose mode?
I don't see it in the re docs?




or embed the flag in the pattern. The flags that I know of are:

(?x) re.X re.VERBOSE

The flag can appear anywhere in the pattern and applies to the whole
pattern, but it is good practice to put them at the front, and in the
future it may be an error to put the flags elsewhere.


I've always applied flags as separate params at the end of the
function call. I've never seen (or noticed?) the embedded form,
and don't see it described in the docs anywhere (although it
probably is). But the re module descriptions of the flags only goive the 
re.X/re.VERBOSE options, no mention of the embedded form.

Maybe you are just supposed to infer the (?x) form from the re.X...

However, that still doesn't explain the difference in your comment
syntax.

The docs say the verbose syntax looks like:

a = re.compile(r\d +  # the integral part
   \.# the decimal point
   \d *  # some fractional digits, re.X)

Whereas your syntax is like:

a = re.compile(r(?x)  (?# turn on verbose mode)
   \d +  (?# the integral part)
   \.(?# the decimal point)
   \d *  (?# some fractional digits))

Again, where is that described?

--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression on python

2015-04-15 Thread Albert-Jan Roskam

On Tue, 4/14/15, Peter Otten __pete...@web.de wrote:

 Subject: Re: [Tutor] Regular expression on python
 To: tutor@python.org
 Date: Tuesday, April 14, 2015, 4:37 PM
 
 Steven D'Aprano wrote:
 
  On Tue, Apr 14, 2015 at 10:00:47AM +0200, Peter Otten
 wrote:
  Steven D'Aprano wrote:
  
   I swear that Perl has been a blight on an
 entire generation of
   programmers. All they know is regular
 expressions, so they turn every
   data processing problem into a regular
 expression. Or at least they
   *try* to. As you have learned, regular
 expressions are hard to read,
   hard to write, and hard to get correct.
   
   Let's write some Python code instead.
  [...]
  
  The tempter took posession of me and dictated:
  
   pprint.pprint(
  ... [(k, int(v)) for k, v in
  ...
 re.compile(r(.+?):\s+(\d+)(?:\s+\(.*?\))?\s*).findall(line)])
  [('Input Read Pairs', 2127436),
   ('Both Surviving', 1795091),
   ('Forward Only Surviving', 17315),
   ('Reverse Only Surviving', 6413),
   ('Dropped', 308617)]
  
  Nicely done :-)
  


Yes, nice, but why do you use 
re.compile(regex).findall(line) 
and not
re.findall(regex, line)

I know what re.compile is for. I often use it outside a loop and then actually 
use the compiled regex inside a loop, I just haven't see the way you use it 
before.



  I didn't say that it *couldn't* be done with a regex. 
 
 I didn't claim that.
 
  Only that it is
  harder to read, write, etc. Regexes are good tools, but
 they aren't the
  only tool and as a beginner, which would you rather
 debug? The extract()
  function I wrote, or
 r(.+?):\s+(\d+)(?:\s+\(.*?\))?\s* ?
 
 I know a rhetorical question when I see one ;)
 
  Oh, and for the record, your solution is roughly 4-5
 times faster than
  the extract() function on my computer. 
 
 I wouldn't be bothered by that. See below if you are.
 
  If I knew the requirements were
  not likely to change (that is, the maintenance burden
 was likely to be
  low), I'd be quite happy to use your regex solution in
 production code,
  although I would probably want to write it out in
 verbose mode just in
  case the requirements did change:
  
  
  r(?x)    (?# verbose mode)

personally, I prefer to be verbose about being verbose, ie use the re.VERBOSE 
flag. But perhaps that's just a matter of taste. Are there any use cases when 
the ?iLmsux operators are clearly a better choice than the equivalent flag? For 
me, the mental burden of a regex is big enough already without these operators. 


      (.+?):  (?# capture one or
 more character, followed by a colon)
      \s+     (?#
 one or more whitespace)
      (\d+)   (?#
 capture one or more digits)
      (?:     (?#
 don't capture ... )
        \s+   
    (?# one or more whitespace)
    
    \(.*?\)   (?# anything
 inside round brackets)
        )?     
   (?# ... and optional)
      \s*     (?#
 ignore trailing spaces)
      

snip
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression on python

2015-04-15 Thread Peter Otten
Alan Gauld wrote:

 On 15/04/15 02:02, Steven D'Aprano wrote:
 New one on me. Where does one find out about verbose mode?
 I don't see it in the re docs?

 
 or embed the flag in the pattern. The flags that I know of are:

 (?x) re.X re.VERBOSE

 The flag can appear anywhere in the pattern and applies to the whole
 pattern, but it is good practice to put them at the front, and in the
 future it may be an error to put the flags elsewhere.
 
 I've always applied flags as separate params at the end of the
 function call. I've never seen (or noticed?) the embedded form,
 and don't see it described in the docs anywhere (although it
 probably is). 

Quoting https://docs.python.org/dev/library/re.html:


(?aiLmsux)
(One or more letters from the set 'a', 'i', 'L', 'm', 's', 'u', 'x'.) The 
group matches the empty string; the letters set the corresponding flags: 
re.A (ASCII-only matching), re.I (ignore case), re.L (locale dependent), 
re.M (multi-line), re.S (dot matches all), and re.X (verbose), for the 
entire regular expression. (The flags are described in Module Contents.) 
This is useful if you wish to include the flags as part of the regular 
expression, instead of passing a flag argument to the re.compile() function.

Note that the (?x) flag changes how the expression is parsed. It should be 
used first in the expression string, or after one or more whitespace 
characters. If there are non-whitespace characters before the flag, the 
results are undefined.


 But the re module descriptions of the flags only goive the
 re.X/re.VERBOSE options, no mention of the embedded form.
 Maybe you are just supposed to infer the (?x) form from the re.X...
 
 However, that still doesn't explain the difference in your comment
 syntax.
 
 The docs say the verbose syntax looks like:
 
 a = re.compile(r\d +  # the integral part
 \.# the decimal point
 \d *  # some fractional digits, re.X)
 
 Whereas your syntax is like:
 
 a = re.compile(r(?x)  (?# turn on verbose mode)
 \d +  (?# the integral part)
 \.(?# the decimal point)
 \d *  (?# some fractional digits))
 
 Again, where is that described?


(?#...)
A comment; the contents of the parentheses are simply ignored.


Let's try it out:

 re.compile(\d+(?# sequence of digits)).findall(alpha 123 beta 456)
['123', '456']
 re.compile(\d+# sequence of digits).findall(alpha 123 beta 456)
[]
 re.compile(\d+# sequence of digits, re.VERBOSE).findall(alpha 123 
beta 456)
['123', '456']

So (?#...)-style comments work in non-verbose mode, too, and Steven is 
wearing belt and braces (almost, the verbose flag is still necessary to 
ignore the extra whitespace).

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression on python

2015-04-15 Thread Peter Otten
Albert-Jan Roskam wrote:

 On Tue, 4/14/15, Peter Otten __pete...@web.de wrote:

  pprint.pprint(
 ... [(k, int(v)) for k, v in
 ...
re.compile(r(.+?):\s+(\d+)(?:\s+\(.*?\))?\s*).findall(line)])
 [('Input Read Pairs', 2127436),
('Both Surviving', 1795091),
('Forward Only Surviving', 17315),
('Reverse Only Surviving', 6413),
('Dropped', 308617)]

 Yes, nice, but why do you use
 re.compile(regex).findall(line)
 and not
 re.findall(regex, line)
 
 I know what re.compile is for. I often use it outside a loop and then
 actually use the compiled regex inside a loop, I just haven't see the way
 you use it before.

What you describe here is how I use regular expressions most of the time.
Also, re.compile() behaves the same over different Python versions while the 
shortcuts for the pattern methods changed signature over time. 
Finally, some have a gotcha. Compare:

 re.compile(a, re.IGNORECASE).sub(b, aAAaa)
'b'
 re.sub(a, b, aAAaa, re.IGNORECASE)
'bAAba'

Did you expect that? Congrats for thorough reading of the docs ;)

 personally, I prefer to be verbose about being verbose, ie use the
 re.VERBOSE flag. But perhaps that's just a matter of taste. Are there any
 use cases when the ?iLmsux operators are clearly a better choice than the
 equivalent flag? For me, the mental burden of a regex is big enough
 already without these operators. 

I pass flags separately myself, but

 re.sub((?i)a, b, aAAaa)
'b'

might serve as an argument for inlined flags.

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression on python

2015-04-14 Thread Peter Otten
Steven D'Aprano wrote:

 On Mon, Apr 13, 2015 at 02:29:07PM +0200, jarod...@libero.it wrote:
 Dear all.
 I would like to extract from some file some data.
 The line I'm interested is this:
 
 Input Read Pairs: 2127436 Both Surviving: 1795091 (84.38%) Forward
 Only Surviving: 17315 (0.81%) Reverse Only Surviving: 6413 (0.30%)
 Dropped: 308617 (14.51%)
 
 
 Some people, when confronted with a problem, think I know, I'll
 use regular expressions. Now they have two problems.
 -- Jamie Zawinski
 ‎
 I swear that Perl has been a blight on an entire generation of
 programmers. All they know is regular expressions, so they turn every
 data processing problem into a regular expression. Or at least they
 *try* to. As you have learned, regular expressions are hard to read,
 hard to write, and hard to get correct.
 
 Let's write some Python code instead.
 
 
 def extract(line):
 # Extract key:number values from the string.
 line = line.strip()  # Remove leading and trailing whitespace.
 words = line.split()
 accumulator = []  # Collect parts of the string we care about.
 for word in words:
 if word.startswith('(') and word.endswith('%)'):
 # We don't care about percentages in brackets.
 continue
 try:
 n = int(word)
 except ValueError:
 accumulator.append(word)
 else:
 accumulator.append(n)
 # Now accumulator will be a list of strings and ints:
 # e.g. ['Input', 'Read', 'Pairs:', 1234, 'Both', 'Surviving:', 1000]
 # Collect consecutive strings as the key, int to be the value.
 results = {}
 keyparts = []
 for item in accumulator:
 if isinstance(item, int):
 key = ' '.join(keyparts)
 keyparts = []
 if key.endswith(':'):
 key = key[:-1]
 results[key] = item
 else:
 keyparts.append(item)
 # When we have finished processing, the keyparts list should be empty.
 if keyparts:
 extra = ' '.join(keyparts)
 print('Warning: found extra text at end of line %s.' % extra)
 return results
 
 
 
 Now let me test it:
 
 py line = ('Input Read Pairs: 2127436 Both Surviving: 1795091'
 ... ' (84.38%) Forward Only Surviving: 17315 (0.81%)'
 ... ' Reverse Only Surviving: 6413 (0.30%) Dropped:'
 ... ' 308617 (14.51%)\n')
 py
 py print(line)
 Input Read Pairs: 2127436 Both Surviving: 1795091 (84.38%) Forward
 Only Surviving: 17315 (0.81%) Reverse Only Surviving: 6413 (0.30%)
 Dropped: 308617 (14.51%)
 
 py extract(line)
 {'Dropped': 308617, 'Both Surviving': 1795091, 'Reverse Only Surviving':
 6413, 'Forward Only Surviving': 17315, 'Input Read Pairs': 2127436}
 
 
 Remember that dicts are unordered. All the data is there, but in
 arbitrary order. Now that you have a nice function to extract the data,
 you can apply it to the lines of a data file in a simple loop:
 
 with open(255.trim.log) as p:
 for line in p:
 if line.startswith(Input ):
 d = extract(line)
 print(d)  # or process it somehow

The tempter took posession of me and dictated:

 pprint.pprint(
... [(k, int(v)) for k, v in
... re.compile(r(.+?):\s+(\d+)(?:\s+\(.*?\))?\s*).findall(line)])
[('Input Read Pairs', 2127436),
 ('Both Surviving', 1795091),
 ('Forward Only Surviving', 17315),
 ('Reverse Only Surviving', 6413),
 ('Dropped', 308617)]


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression on python

2015-04-14 Thread Steven D'Aprano
On Tue, Apr 14, 2015 at 10:00:47AM +0200, Peter Otten wrote:
 Steven D'Aprano wrote:

  I swear that Perl has been a blight on an entire generation of
  programmers. All they know is regular expressions, so they turn every
  data processing problem into a regular expression. Or at least they
  *try* to. As you have learned, regular expressions are hard to read,
  hard to write, and hard to get correct.
  
  Let's write some Python code instead.
[...]

 The tempter took posession of me and dictated:
 
  pprint.pprint(
 ... [(k, int(v)) for k, v in
 ... re.compile(r(.+?):\s+(\d+)(?:\s+\(.*?\))?\s*).findall(line)])
 [('Input Read Pairs', 2127436),
  ('Both Surviving', 1795091),
  ('Forward Only Surviving', 17315),
  ('Reverse Only Surviving', 6413),
  ('Dropped', 308617)]

Nicely done :-)

I didn't say that it *couldn't* be done with a regex. Only that it is 
harder to read, write, etc. Regexes are good tools, but they aren't the 
only tool and as a beginner, which would you rather debug? The extract() 
function I wrote, or r(.+?):\s+(\d+)(?:\s+\(.*?\))?\s* ?

Oh, and for the record, your solution is roughly 4-5 times faster than 
the extract() function on my computer. If I knew the requirements were 
not likely to change (that is, the maintenance burden was likely to be 
low), I'd be quite happy to use your regex solution in production code, 
although I would probably want to write it out in verbose mode just in 
case the requirements did change:


r(?x)(?# verbose mode)
(.+?):  (?# capture one or more character, followed by a colon)
\s+ (?# one or more whitespace)
(\d+)   (?# capture one or more digits)
(?: (?# don't capture ... )
  \s+   (?# one or more whitespace)
  \(.*?\)   (?# anything inside round brackets)
  )?(?# ... and optional)
\s* (?# ignore trailing spaces)



That's a hint to people learning regular expressions: start in verbose 
mode, then de-verbose it if you must.


-- 
Steve
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression on python

2015-04-14 Thread Peter Otten
Steven D'Aprano wrote:

 On Tue, Apr 14, 2015 at 10:00:47AM +0200, Peter Otten wrote:
 Steven D'Aprano wrote:
 
  I swear that Perl has been a blight on an entire generation of
  programmers. All they know is regular expressions, so they turn every
  data processing problem into a regular expression. Or at least they
  *try* to. As you have learned, regular expressions are hard to read,
  hard to write, and hard to get correct.
  
  Let's write some Python code instead.
 [...]
 
 The tempter took posession of me and dictated:
 
  pprint.pprint(
 ... [(k, int(v)) for k, v in
 ... re.compile(r(.+?):\s+(\d+)(?:\s+\(.*?\))?\s*).findall(line)])
 [('Input Read Pairs', 2127436),
  ('Both Surviving', 1795091),
  ('Forward Only Surviving', 17315),
  ('Reverse Only Surviving', 6413),
  ('Dropped', 308617)]
 
 Nicely done :-)
 
 I didn't say that it *couldn't* be done with a regex. 

I didn't claim that.

 Only that it is
 harder to read, write, etc. Regexes are good tools, but they aren't the
 only tool and as a beginner, which would you rather debug? The extract()
 function I wrote, or r(.+?):\s+(\d+)(?:\s+\(.*?\))?\s* ?

I know a rhetorical question when I see one ;)

 Oh, and for the record, your solution is roughly 4-5 times faster than
 the extract() function on my computer. 

I wouldn't be bothered by that. See below if you are.

 If I knew the requirements were
 not likely to change (that is, the maintenance burden was likely to be
 low), I'd be quite happy to use your regex solution in production code,
 although I would probably want to write it out in verbose mode just in
 case the requirements did change:
 
 
 r(?x)(?# verbose mode)
 (.+?):  (?# capture one or more character, followed by a colon)
 \s+ (?# one or more whitespace)
 (\d+)   (?# capture one or more digits)
 (?: (?# don't capture ... )
   \s+   (?# one or more whitespace)
   \(.*?\)   (?# anything inside round brackets)
   )?(?# ... and optional)
 \s* (?# ignore trailing spaces)
 
 
 
 That's a hint to people learning regular expressions: start in verbose
 mode, then de-verbose it if you must.

Regarding the speed of the Python approach: you can easily improve that by 
relatively minor modifications. The most important one is to avoid the 
exception:

$ python parse_jarod.py
$ python3 parse_jarod.py

The regex for reference:

$ python3 -m timeit -s from parse_jarod import extract_re as extract 
extract()
10 loops, best of 3: 18.6 usec per loop

Steven's original extract():

$ python3 -m timeit -s from parse_jarod import extract_daprano as extract 
extract()
1 loops, best of 3: 92.6 usec per loop

Avoid raising ValueError (This won't work with negative numbers):

$ python3 -m timeit -s from parse_jarod import extract_daprano2 as extract 
extract()
1 loops, best of 3: 44.3 usec per loop

Collapse the two loops into one, thus avoiding the accumulator list and the 
isinstance() checks:

$ python3 -m timeit -s from parse_jarod import extract_daprano3 as extract 
extract()
1 loops, best of 3: 29.6 usec per loop

Ok, this is still slower than the regex, a result that I cannot accept. 
Let's try again:

$ python3 -m timeit -s from parse_jarod import extract_py as extract 
extract()
10 loops, best of 3: 15.1 usec per loop

Heureka? The winning code is brittle and probably as hard to understand as 
the regex. You can judge for yourself if you're interested:

$ cat parse_jarod.py   
import re

line = (Input Read Pairs: 2127436 
Both Surviving: 1795091 (84.38%) 
Forward Only Surviving: 17315 (0.81%) 
Reverse Only Surviving: 6413 (0.30%) 
Dropped: 308617 (14.51%))
_findall = re.compile(r(.+?):\s+(\d+)(?:\s+\(.*?\))?\s*).findall


def extract_daprano(line=line):
# Extract key:number values from the string.
line = line.strip()  # Remove leading and trailing whitespace.
words = line.split()
accumulator = []  # Collect parts of the string we care about.
for word in words:
if word.startswith('(') and word.endswith('%)'):
# We don't care about percentages in brackets.
continue
try:
n = int(word)
except ValueError:
accumulator.append(word)
else:
accumulator.append(n)
# Now accumulator will be a list of strings and ints:
# e.g. ['Input', 'Read', 'Pairs:', 1234, 'Both', 'Surviving:', 1000]
# Collect consecutive strings as the key, int to be the value.
results = {}
keyparts = []
for item in accumulator:
if isinstance(item, int):
key = ' '.join(keyparts)
keyparts = []
if key.endswith(':'):
key = key[:-1]
results[key] = item
else:
keyparts.append(item)
# When we have finished processing, the keyparts list should be empty.
if keyparts:
extra = ' '.join(keyparts)
print('Warning: found 

Re: [Tutor] Regular expression on python

2015-04-14 Thread Alan Gauld

On 14/04/15 13:21, Steven D'Aprano wrote:


although I would probably want to write it out in verbose mode just in
case the requirements did change:


r(?x)(?# verbose mode)
 (.+?):  (?# capture one or more character, followed by a colon)
 \s+ (?# one or more whitespace)
 (\d+)   (?# capture one or more digits)
 (?: (?# don't capture ... )
   \s+   (?# one or more whitespace)
   \(.*?\)   (?# anything inside round brackets)
   )?(?# ... and optional)
 \s* (?# ignore trailing spaces)
 

That's a hint to people learning regular expressions: start in verbose
mode, then de-verbose it if you must.


New one on me. Where does one find out about verbose mode?
I don't see it in the re docs?

I see an re.X flag but while it seems to be similar in purpose
yet it is different to your style above (no parens for example)?

--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression on python

2015-04-14 Thread Mark Lawrence

On 15/04/2015 00:49, Alan Gauld wrote:

On 14/04/15 13:21, Steven D'Aprano wrote:


although I would probably want to write it out in verbose mode just in
case the requirements did change:


r(?x)(?# verbose mode)
 (.+?):  (?# capture one or more character, followed by a colon)
 \s+ (?# one or more whitespace)
 (\d+)   (?# capture one or more digits)
 (?: (?# don't capture ... )
   \s+   (?# one or more whitespace)
   \(.*?\)   (?# anything inside round brackets)
   )?(?# ... and optional)
 \s* (?# ignore trailing spaces)
 

That's a hint to people learning regular expressions: start in verbose
mode, then de-verbose it if you must.


New one on me. Where does one find out about verbose mode?
I don't see it in the re docs?

I see an re.X flag but while it seems to be similar in purpose
yet it is different to your style above (no parens for example)?



https://docs.python.org/3/library/re.html#module-contents re.X and 
re.VERBOSE are together.


--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression on python

2015-04-14 Thread Steven D'Aprano
On Wed, Apr 15, 2015 at 12:49:26AM +0100, Alan Gauld wrote:

 New one on me. Where does one find out about verbose mode?
 I don't see it in the re docs?
 
 I see an re.X flag but while it seems to be similar in purpose
 yet it is different to your style above (no parens for example)?

I presume it is documented in the main docs, but I actually found this 
in the Python Pocket Reference by Mark Lutz :-)

All of the regex flags have three forms:

- a numeric flag with a long name;
- the same numeric flag with a short name;
- a regular expression pattern.

So you can either do:

re.compile(pattern, flags)

or embed the flag in the pattern. The flags that I know of are:

(?i) re.I re.IGNORECASE
(?L) re.L re.LOCALE
(?M) re.M re.MULTILINE
(?s) re.S re.DOTALL
(?x) re.X re.VERBOSE

The flag can appear anywhere in the pattern and applies to the whole 
pattern, but it is good practice to put them at the front, and in the 
future it may be an error to put the flags elsewhere.

When provided as a separate argument, you can combine flags like this:

re.I|re.X


-- 
Steve
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression on python

2015-04-13 Thread Alan Gauld

On 13/04/15 13:29, jarod...@libero.it wrote:


Input Read Pairs: 2127436 Both Surviving: 1795091 (84.38%) Forward Only 
Surviving: 17315 (0.81%) Reverse Only Surviving: 6413 (0.30%) Dropped: 308617 
(14.51%)


Its not clear where the tabs are in this line.
But if they are after the numbers, like so:

Input Read Pairs: 2127436 \t
Both Surviving: 1795091 (84.38%) \t
Forward Only Surviving: 17315 (0.81%) \t
Reverse Only Surviving: 6413 (0.30%) \t
Dropped: 308617 (14.51%)

Then you may not need to use regular expressions.
Simply split by tab then split by :
And if the 'number' contains parens split again by space


  with open(255.trim.log,r) as p:
 for i in p:
 lines= i.strip(\t)


lines is a bad name here since its only a single line. In fact I'd lose 
the 'i' variable and just use


for line in p:


 if lines.startswith(Input):
 tp = lines.split(\t)
 print re.findall(Input\d,str(tp))


Input is not followed by a number. You need a more powerful pattern.
Which is why I recommend trying to solve it as far as possible
without using regex.


So I started to find : from the row:
  with open(255.trim.log,r) as p:
 for i in p:
 lines= i.strip(\t)
 if lines.startswith(Input):
 tp = lines.split(\t)
 print re.findall(:,str(tp[0]))


Does finding the colons really help much?
Or at least, does it help any more than splitting by colon would?


And I'm able to find, but when I try to take the number using \d not work.
Someone can explain why?


Because your pattern doesn't match the string.

HTH
--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression on python

2015-04-13 Thread Alan Gauld

On 13/04/15 19:42, Alan Gauld wrote:


 if lines.startswith(Input):
 tp = lines.split(\t)
 print re.findall(Input\d,str(tp))


Input is not followed by a number. You need a more powerful pattern.
Which is why I recommend trying to solve it as far as possible
without using regex.


I also just realised that you call split there then take the str() of 
the result. That means you are searching the string representation

of a list, which doesn't seem to make much sense?


--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression on python

2015-04-13 Thread Steven D'Aprano
On Mon, Apr 13, 2015 at 02:29:07PM +0200, jarod...@libero.it wrote:
 Dear all.
 I would like to extract from some file some data.
 The line I'm interested is this:
 
 Input Read Pairs: 2127436 Both Surviving: 1795091 (84.38%) Forward 
 Only Surviving: 17315 (0.81%) Reverse Only Surviving: 6413 (0.30%) 
 Dropped: 308617 (14.51%)


Some people, when confronted with a problem, think I know, I'll 
use regular expressions. Now they have two problems.
-- Jamie Zawinski
‎
I swear that Perl has been a blight on an entire generation of 
programmers. All they know is regular expressions, so they turn every 
data processing problem into a regular expression. Or at least they 
*try* to. As you have learned, regular expressions are hard to read, 
hard to write, and hard to get correct.

Let's write some Python code instead.


def extract(line):
# Extract key:number values from the string.
line = line.strip()  # Remove leading and trailing whitespace.
words = line.split()
accumulator = []  # Collect parts of the string we care about.
for word in words:
if word.startswith('(') and word.endswith('%)'):
# We don't care about percentages in brackets.
continue
try:
n = int(word)
except ValueError:
accumulator.append(word)
else:
accumulator.append(n)
# Now accumulator will be a list of strings and ints:
# e.g. ['Input', 'Read', 'Pairs:', 1234, 'Both', 'Surviving:', 1000]
# Collect consecutive strings as the key, int to be the value.
results = {}
keyparts = []
for item in accumulator:
if isinstance(item, int):
key = ' '.join(keyparts)
keyparts = []
if key.endswith(':'):
key = key[:-1]
results[key] = item
else:
keyparts.append(item)
# When we have finished processing, the keyparts list should be empty.
if keyparts:
extra = ' '.join(keyparts)
print('Warning: found extra text at end of line %s.' % extra)
return results



Now let me test it:

py line = ('Input Read Pairs: 2127436 Both Surviving: 1795091'
... ' (84.38%) Forward Only Surviving: 17315 (0.81%)'
... ' Reverse Only Surviving: 6413 (0.30%) Dropped:'
... ' 308617 (14.51%)\n')
py
py print(line)
Input Read Pairs: 2127436 Both Surviving: 1795091 (84.38%) Forward 
Only Surviving: 17315 (0.81%) Reverse Only Surviving: 6413 (0.30%) 
Dropped: 308617 (14.51%)

py extract(line)
{'Dropped': 308617, 'Both Surviving': 1795091, 'Reverse Only Surviving': 
6413, 'Forward Only Surviving': 17315, 'Input Read Pairs': 2127436}


Remember that dicts are unordered. All the data is there, but in 
arbitrary order. Now that you have a nice function to extract the data, 
you can apply it to the lines of a data file in a simple loop:

with open(255.trim.log) as p:
for line in p:
if line.startswith(Input ):
d = extract(line)
print(d)  # or process it somehow



-- 
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression

2014-09-23 Thread Steven D'Aprano
On Tue, Sep 23, 2014 at 11:40:25AM +0200, jarod...@libero.it wrote:
 Hi there!!
 
 I need to read this file:
 
 pippo.count :
  10566 ZXDC
2900 ZYG11A
7909 ZYG11B
3584 ZYX
9614 ZZEF1
   17704 ZZZ3

 How can extract only the number and the work in array? Thanks for any help

There is no need for the nuclear-powered bulldozer of regular 
expressions just to crack this peanut.

with open('pippo.count') as f:
for line in f:
num, word = line.split()
num = int(num)
print num, word


Or, if you prefer the old-fashioned way:

f = open('pippo.count')
for line in f:
num, word = line.split()
num = int(num)
print num, word
f.close()


but the first way with the with-statement is better.


-- 
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression - I

2014-02-18 Thread Santosh Kumar
Steve,

i am trying to under r - raw string notation. Am i understanding it wrong.
Rather than using \, it says we can use the r option.

http://docs.python.org/2/library/re.html

Check the first paragraph for the above link.

Thanks,
santosh



On Tue, Feb 18, 2014 at 11:33 PM, Steve Willoughby st...@alchemy.comwrote:

 Because the regular expression H* means “match an angle-bracket
 character, zero or more H characters, followed by a close angle-bracket
 character” and your string does not match that pattern.

 This is why it’s best to check that the match succeeded before going ahead
 to call group() on the result (since in this case there is no result).


 On 18-Feb-2014, at 09:52, Santosh Kumar rhce@gmail.com wrote:

 
  Hi All,
 
  If you notice the below example, case I is working as expected.
 
  Case I:
  In [41]: string = H*testH*
 
  In [42]: re.match('H\*',string).group()
  Out[42]: 'H*'
 
  But why is the raw string 'r' not working as expected ?
 
  Case II:
 
  In [43]: re.match(r'H*',string).group()
 
 ---
  AttributeErrorTraceback (most recent call
 last)
  ipython-input-43-d66b47f01f1c in module()
   1 re.match(r'H*',string).group()
 
  AttributeError: 'NoneType' object has no attribute 'group'
 
  In [44]: re.match(r'H*',string)
 
 
 
  Thanks,
  santosh
 
  ___
  Tutor maillist  -  Tutor@python.org
  To unsubscribe or change subscription options:
  https://mail.python.org/mailman/listinfo/tutor




-- 
D. Santosh Kumar
RHCE | SCSA
+91-9703206361


Every task has a unpleasant side .. But you must focus on the end result
you are producing.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression - I

2014-02-18 Thread Steve Willoughby
The problem is not the use of the raw string, but rather the regular expression 
inside it.

In regular expressions, the * means that whatever appears before it may be 
repeated zero or more times.  So if you say H* that means zero or more H’s in a 
row.  I think you mean an H followed by any number of other characters which 
would be H.*  (the . matches any single character, so .* means zero or more of 
any characters).

On the other hand, H\* means to match an H followed by a literal asterisk 
character.

Does that help clarify why one matched and the other doesn’t?

steve

On 18-Feb-2014, at 10:09, Santosh Kumar rhce@gmail.com wrote:

 Steve,
 
 i am trying to under r - raw string notation. Am i understanding it wrong.
 Rather than using \, it says we can use the r option.
 
 http://docs.python.org/2/library/re.html
 
 Check the first paragraph for the above link.
 
 Thanks,
 santosh
 
 
 
 On Tue, Feb 18, 2014 at 11:33 PM, Steve Willoughby st...@alchemy.com wrote:
 Because the regular expression H* means “match an angle-bracket character, 
 zero or more H characters, followed by a close angle-bracket character” and 
 your string does not match that pattern.
 
 This is why it’s best to check that the match succeeded before going ahead to 
 call group() on the result (since in this case there is no result).
 
 
 On 18-Feb-2014, at 09:52, Santosh Kumar rhce@gmail.com wrote:
 
 
  Hi All,
 
  If you notice the below example, case I is working as expected.
 
  Case I:
  In [41]: string = H*testH*
 
  In [42]: re.match('H\*',string).group()
  Out[42]: 'H*'
 
  But why is the raw string 'r' not working as expected ?
 
  Case II:
 
  In [43]: re.match(r'H*',string).group()
  ---
  AttributeErrorTraceback (most recent call last)
  ipython-input-43-d66b47f01f1c in module()
   1 re.match(r'H*',string).group()
 
  AttributeError: 'NoneType' object has no attribute 'group'
 
  In [44]: re.match(r'H*',string)
 
 
 
  Thanks,
  santosh
 
  ___
  Tutor maillist  -  Tutor@python.org
  To unsubscribe or change subscription options:
  https://mail.python.org/mailman/listinfo/tutor
 
 
 
 
 -- 
 D. Santosh Kumar
 RHCE | SCSA 
 +91-9703206361
 
 
 Every task has a unpleasant side .. But you must focus on the end result you 
 are producing.
 



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression - I

2014-02-18 Thread Steve Willoughby
Because the regular expression H* means “match an angle-bracket character, 
zero or more H characters, followed by a close angle-bracket character” and 
your string does not match that pattern.

This is why it’s best to check that the match succeeded before going ahead to 
call group() on the result (since in this case there is no result).


On 18-Feb-2014, at 09:52, Santosh Kumar rhce@gmail.com wrote:

 
 Hi All,
 
 If you notice the below example, case I is working as expected.
 
 Case I:
 In [41]: string = H*testH*
 
 In [42]: re.match('H\*',string).group()
 Out[42]: 'H*'
 
 But why is the raw string 'r' not working as expected ?
 
 Case II:
 
 In [43]: re.match(r'H*',string).group()
 ---
 AttributeErrorTraceback (most recent call last)
 ipython-input-43-d66b47f01f1c in module()
  1 re.match(r'H*',string).group()
 
 AttributeError: 'NoneType' object has no attribute 'group'
 
 In [44]: re.match(r'H*',string)
 
 
 
 Thanks,
 santosh
 
 ___
 Tutor maillist  -  Tutor@python.org
 To unsubscribe or change subscription options:
 https://mail.python.org/mailman/listinfo/tutor

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression - I

2014-02-18 Thread Zachary Ware
Hi Santosh,

On Tue, Feb 18, 2014 at 9:52 AM, Santosh Kumar rhce@gmail.com wrote:

 Hi All,

 If you notice the below example, case I is working as expected.

 Case I:
 In [41]: string = H*testH*

 In [42]: re.match('H\*',string).group()
 Out[42]: 'H*'

 But why is the raw string 'r' not working as expected ?

 Case II:

 In [43]: re.match(r'H*',string).group()
 ---
 AttributeErrorTraceback (most recent call last)
 ipython-input-43-d66b47f01f1c in module()
  1 re.match(r'H*',string).group()

 AttributeError: 'NoneType' object has no attribute 'group'

 In [44]: re.match(r'H*',string)

It is working as expected, but you're not expecting the right thing
;).  Raw strings don't escape anything, they just prevent backslash
escapes from expanding.  Case I works because \* is not a special
character to Python (like \n or \t), so it leaves the backslash in
place:

'H\*'
   'H\*'

The equivalent raw string is exactly the same in this case:

r'H\*'
   'H\*'

The raw string you provided doesn't have the backslash, and Python
will not add backslashes for you:

r'H*'
   'H*'

The purpose of raw strings is to prevent Python from recognizing
backslash escapes.  For example:

path = 'C:\temp\new\dir' # Windows paths are notorious...
path   # it looks mostly ok... [1]
   'C:\temp\new\\dir'
print(path)  # until you try to use it
   C:  emp
   ew\dir
path = r'C:\temp\new\dir'  # now try a raw string
path   # Now it looks like it's stuffed full of backslashes [2]
   'C:\\temp\\new\\dir'
print(path)  # but it works properly!
   C:\temp\new\dir

[1] Count the backslashes in the repr of 'path'.  Notice that there is
only one before the 't' and the 'n', but two before the 'd'.  \d is
not a special character, so Python didn't do anything to it.  There
are two backslashes in the repr of \d, because that's the only way
to distinguish a real backslash; the \t and \n are actually the
TAB and LINE FEED characters, as seen when printing 'path'.

[2] Because they are all real backslashes now, so they have to be
shown escaped (\\) in the repr.

In your regex, since you're looking for, literally, H*, you'll
need to backslash escape the * since it is a special character *in
regular expressions*.  To avoid having to keep track of what's special
to Python as well as regular expressions, you'll need to make sure the
backslash itself is escaped, to make sure the regex sees \*, and the
easiest way to do that is a raw string:

re.match(r'H\*', string).group()
   'H*'

I hope this makes some amount of sense; I've had to write it up
piecemeal and will never get it posted at all if I don't go ahead and
post :).  If you still have questions, I'm happy to try again.  You
may also want to have a look at the Regex HowTo in the Python docs:
http://docs.python.org/3/howto/regex.html

Hope this helps,

-- 
Zach
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression - I

2014-02-18 Thread Mark Lawrence

On 18/02/2014 18:03, Steve Willoughby wrote:

Because the regular expression H* means “match an angle-bracket character, 
zero or more H characters, followed by a close angle-bracket character” and your 
string does not match that pattern.

This is why it’s best to check that the match succeeded before going ahead to 
call group() on the result (since in this case there is no result).


On 18-Feb-2014, at 09:52, Santosh Kumar rhce@gmail.com wrote:



Hi All,

If you notice the below example, case I is working as expected.

Case I:
In [41]: string = H*testH*

In [42]: re.match('H\*',string).group()
Out[42]: 'H*'

But why is the raw string 'r' not working as expected ?

Case II:

In [43]: re.match(r'H*',string).group()
---
AttributeErrorTraceback (most recent call last)
ipython-input-43-d66b47f01f1c in module()
 1 re.match(r'H*',string).group()

AttributeError: 'NoneType' object has no attribute 'group'

In [44]: re.match(r'H*',string)



Thanks,
santosh



Please do not top post on this list.

--
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.


Mark Lawrence

---
This email is free from viruses and malware because avast! Antivirus protection 
is active.
http://www.avast.com


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression - I

2014-02-18 Thread Zachary Ware
On Tue, Feb 18, 2014 at 11:39 AM, Zachary Ware
zachary.ware+py...@gmail.com wrote:
snip
 'H\*'
'H\*'

 The equivalent raw string is exactly the same in this case:

 r'H\*'
'H\*'

Oops, I mistyped both of these.  The repr should be 'H\\*' in both cases.

Sorry for the confusion!

-- 
Zach
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression - I

2014-02-18 Thread S Tareq
does any one know how to use 2to3 program to convert 2.7 coding 3.X please i 
need help sorry 



On Tuesday, 18 February 2014, 19:50, Zachary Ware 
zachary.ware+py...@gmail.com wrote:
 
On Tue, Feb 18, 2014 at 11:39 AM, Zachary Ware
zachary.ware+py...@gmail.com wrote:
snip
     'H\*'
    'H\*'

 The equivalent raw string is exactly the same in this case:

     r'H\*'
    'H\*'

Oops, I mistyped both of these.  The repr should be 'H\\*' in both cases.

Sorry for the confusion!

-- 
Zach
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression - I

2014-02-18 Thread Albert-Jan Roskam


_
 From: Steve Willoughby st...@alchemy.com
To: Santosh Kumar rhce@gmail.com 
Cc: python mail list tutor@python.org 
Sent: Tuesday, February 18, 2014 7:03 PM
Subject: Re: [Tutor] Regular expression - I
 

Because the regular expression H* means “match an angle-bracket character, 
zero or more H characters, followed by a close angle-bracket character” and 
your string does not match that pattern.

This is why it’s best to check that the match succeeded before going ahead to 
call group() on the result (since in this case there is no result).


On 18-Feb-2014, at 09:52, Santosh Kumar rhce@gmail.com wrote:


You also might want to consider making it a non-greedy match. The explanation 
http://docs.python.org/2/howto/regex.html covers an example almost identical to 
yours:

Greedy versus Non-Greedy
When repeating a regular expression, as in a*, the resulting action is to
consume as much of the pattern as possible.  This fact often bites you when
you’re trying to match a pair of balanced delimiters, such as the angle brackets
surrounding an HTML tag.  The naive pattern for matching a single HTML tag
doesn’t work because of the greedy nature of .*.

 s = 'htmlheadtitleTitle/title'  len(s) 32  print 
 re.match('.*', s).span() (0, 32)  print re.match('.*', s).group() 
 htmlheadtitleTitle/title 
The RE matches the '' in html, and the .* consumes the rest of
the string.  There’s still more left in the RE, though, and the  can’t
match at the end of the string, so the regular expression engine has to
backtrack character by character until it finds a match for the .   The
final match extends from the '' in html to the '' in /title, which isn’t 
what you want.
In this case, the solution is to use the non-greedy qualifiers *?, +?, ??, or 
{m,n}?, which match as little text as possible.  In the above
example, the '' is tried immediately after the first '' matches, and
when it fails, the engine advances a character at a time, retrying the '' at 
every step.  This produces just the right result:

 print re.match('.*?', s).group() html 
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression - I

2014-02-18 Thread Emile van Sebille

On 2/18/2014 11:42 AM, Mark Lawrence wrote:

On 18/02/2014 18:03, Steve Willoughby wrote:

Because the regular expression H* means “match an angle-bracket


SNIP


Please do not top post on this list.


Appropriate trimming is also appreciated.

Emile




___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression - I

2014-02-18 Thread spir

On 02/18/2014 08:39 PM, Zachary Ware wrote:

Hi Santosh,

On Tue, Feb 18, 2014 at 9:52 AM, Santosh Kumar rhce@gmail.com wrote:


Hi All,

If you notice the below example, case I is working as expected.

Case I:
In [41]: string = H*testH*

In [42]: re.match('H\*',string).group()
Out[42]: 'H*'

But why is the raw string 'r' not working as expected ?

Case II:

In [43]: re.match(r'H*',string).group()
---
AttributeErrorTraceback (most recent call last)
ipython-input-43-d66b47f01f1c in module()
 1 re.match(r'H*',string).group()

AttributeError: 'NoneType' object has no attribute 'group'

In [44]: re.match(r'H*',string)


It is working as expected, but you're not expecting the right thing
;).  Raw strings don't escape anything, they just prevent backslash
escapes from expanding.  Case I works because \* is not a special
character to Python (like \n or \t), so it leaves the backslash in
place:

 'H\*'
'H\*'

The equivalent raw string is exactly the same in this case:

 r'H\*'
'H\*'

The raw string you provided doesn't have the backslash, and Python
will not add backslashes for you:

 r'H*'
'H*'

The purpose of raw strings is to prevent Python from recognizing
backslash escapes.  For example:

 path = 'C:\temp\new\dir' # Windows paths are notorious...
 path   # it looks mostly ok... [1]
'C:\temp\new\\dir'
 print(path)  # until you try to use it
C:  emp
ew\dir
 path = r'C:\temp\new\dir'  # now try a raw string
 path   # Now it looks like it's stuffed full of backslashes [2]
'C:\\temp\\new\\dir'
 print(path)  # but it works properly!
C:\temp\new\dir

[1] Count the backslashes in the repr of 'path'.  Notice that there is
only one before the 't' and the 'n', but two before the 'd'.  \d is
not a special character, so Python didn't do anything to it.  There
are two backslashes in the repr of \d, because that's the only way
to distinguish a real backslash; the \t and \n are actually the
TAB and LINE FEED characters, as seen when printing 'path'.

[2] Because they are all real backslashes now, so they have to be
shown escaped (\\) in the repr.

In your regex, since you're looking for, literally, H*, you'll
need to backslash escape the * since it is a special character *in
regular expressions*.  To avoid having to keep track of what's special
to Python as well as regular expressions, you'll need to make sure the
backslash itself is escaped, to make sure the regex sees \*, and the
easiest way to do that is a raw string:

 re.match(r'H\*', string).group()
'H*'

I hope this makes some amount of sense; I've had to write it up
piecemeal and will never get it posted at all if I don't go ahead and
post :).  If you still have questions, I'm happy to try again.  You
may also want to have a look at the Regex HowTo in the Python docs:
http://docs.python.org/3/howto/regex.html


In addition to all this:
* You may confuse raw strings with regex escaping (a tool func that escapes 
special regex characters for you).
* For simplicity, always use raw strings for regex formats (as in your second 
example); this does not prevent you to escape special characters, but you only 
have to do it once!


d
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression - I

2014-02-18 Thread Santosh Kumar
Thank you all. I got it. :)
I need to read more between lines .


On Wed, Feb 19, 2014 at 4:25 AM, spir denis.s...@gmail.com wrote:

 On 02/18/2014 08:39 PM, Zachary Ware wrote:

 Hi Santosh,

 On Tue, Feb 18, 2014 at 9:52 AM, Santosh Kumar rhce@gmail.com
 wrote:


 Hi All,

 If you notice the below example, case I is working as expected.

 Case I:
 In [41]: string = H*testH*

 In [42]: re.match('H\*',string).group()
 Out[42]: 'H*'

 But why is the raw string 'r' not working as expected ?

 Case II:

 In [43]: re.match(r'H*',string).group()
 
 ---
 AttributeErrorTraceback (most recent call
 last)
 ipython-input-43-d66b47f01f1c in module()
  1 re.match(r'H*',string).group()

 AttributeError: 'NoneType' object has no attribute 'group'

 In [44]: re.match(r'H*',string)


 It is working as expected, but you're not expecting the right thing
 ;).  Raw strings don't escape anything, they just prevent backslash
 escapes from expanding.  Case I works because \* is not a special
 character to Python (like \n or \t), so it leaves the backslash in
 place:

  'H\*'
 'H\*'

 The equivalent raw string is exactly the same in this case:

  r'H\*'
 'H\*'

 The raw string you provided doesn't have the backslash, and Python
 will not add backslashes for you:

  r'H*'
 'H*'

 The purpose of raw strings is to prevent Python from recognizing
 backslash escapes.  For example:

  path = 'C:\temp\new\dir' # Windows paths are notorious...
  path   # it looks mostly ok... [1]
 'C:\temp\new\\dir'
  print(path)  # until you try to use it
 C:  emp
 ew\dir
  path = r'C:\temp\new\dir'  # now try a raw string
  path   # Now it looks like it's stuffed full of backslashes [2]
 'C:\\temp\\new\\dir'
  print(path)  # but it works properly!
 C:\temp\new\dir

 [1] Count the backslashes in the repr of 'path'.  Notice that there is
 only one before the 't' and the 'n', but two before the 'd'.  \d is
 not a special character, so Python didn't do anything to it.  There
 are two backslashes in the repr of \d, because that's the only way
 to distinguish a real backslash; the \t and \n are actually the
 TAB and LINE FEED characters, as seen when printing 'path'.

 [2] Because they are all real backslashes now, so they have to be
 shown escaped (\\) in the repr.

 In your regex, since you're looking for, literally, H*, you'll
 need to backslash escape the * since it is a special character *in
 regular expressions*.  To avoid having to keep track of what's special
 to Python as well as regular expressions, you'll need to make sure the
 backslash itself is escaped, to make sure the regex sees \*, and the
 easiest way to do that is a raw string:

  re.match(r'H\*', string).group()
 'H*'

 I hope this makes some amount of sense; I've had to write it up
 piecemeal and will never get it posted at all if I don't go ahead and
 post :).  If you still have questions, I'm happy to try again.  You
 may also want to have a look at the Regex HowTo in the Python docs:
 http://docs.python.org/3/howto/regex.html


 In addition to all this:
 * You may confuse raw strings with regex escaping (a tool func that
 escapes special regex characters for you).
 * For simplicity, always use raw strings for regex formats (as in your
 second example); this does not prevent you to escape special characters,
 but you only have to do it once!


 d
 ___
 Tutor maillist  -  Tutor@python.org
 To unsubscribe or change subscription options:
 https://mail.python.org/mailman/listinfo/tutor




-- 
D. Santosh Kumar
RHCE | SCSA
+91-9703206361


Every task has a unpleasant side .. But you must focus on the end result
you are producing.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] regular expression wildcard search

2012-12-11 Thread Emma Birath
Hi there

Do you want your * to represent a single letter, or what is your intent?

If you want only a single letter between the V and VVP, use \w
instead of *.

re.search('v\wVVP',myseq)

Emma

On Tue, Dec 11, 2012 at 8:54 AM, Hs Hs ilhs...@yahoo.com wrote:

 Dear group:

 I have 50 thousand lists. My aim is to search a pattern in the
 alphabetical strings (these are protein sequence strings).


 MMSASRLAGTLIPAMAFLSCVRPESWEPC VEVVP
 NITYQCMELNFYKIPDNLPFSTKNLDLSFNPLRHLGSYSFFSFPELQVLDLSRCEIQTIED

 my aim is to find the list of string that has V*VVP.

 myseq = 'MMSASRLAGTLIPAMAFLSCVRPESWEPC VEVVP
 NITYQCMELNFYKIPDNLPFSTKNLDLSFNPLRHLGSYSFFSFPELQVLDLSRCEIQTIED'

 if re.search('V*VVP',myseq):
 print myseq

 the problem with this is, I am also finding junk with just VVP or VP etc.

 How can I strictly search for V*VVP only.

 Thanks for help.

 Hs

 ___
 Tutor maillist  -  Tutor@python.org
 To unsubscribe or change subscription options:
 http://mail.python.org/mailman/listinfo/tutor


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] regular expression wildcard search

2012-12-11 Thread Joel Goldstick
On Tue, Dec 11, 2012 at 10:54 AM, Hs Hs ilhs...@yahoo.com wrote:

 Dear group:


Please send mail as plain text.  It is easier to read


 I have 50 thousand lists. My aim is to search a pattern in the
 alphabetical strings (these are protein sequence strings).


 MMSASRLAGTLIPAMAFLSCVRPESWEPC VEVVP
 NITYQCMELNFYKIPDNLPFSTKNLDLSFNPLRHLGSYSFFSFPELQVLDLSRCEIQTIED

 my aim is to find the list of string that has V*VVP.


Asterisk

The * matches 0 or more instances of the previous element.

I am not sure what you want, but I don't think it is this.  Do you want V
then any characters followed by VVP?  In that case perhaps

V.+VP


There are many tutorials about how to create regular expressions

**
**


 myseq = 'MMSASRLAGTLIPAMAFLSCVRPESWEPC VEVVP
 NITYQCMELNFYKIPDNLPFSTKNLDLSFNPLRHLGSYSFFSFPELQVLDLSRCEIQTIED'

 if re.search('V*VVP',myseq):
 print myseq

 the problem with this is, I am also finding junk with just VVP or VP etc.

 How can I strictly search for V*VVP only.

 Thanks for help.

 Hs

 ___
 Tutor maillist  -  Tutor@python.org
 To unsubscribe or change subscription options:
 http://mail.python.org/mailman/listinfo/tutor




-- 
Joel Goldstick
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] regular expression wildcard search

2012-12-11 Thread Alan Gauld

On 11/12/12 15:54, Hs Hs wrote:


myseq = 'MMSASRLAGTLIPAMAFLSCVRPESWEPC VEVVP
NITYQCMELNFYKIPDNLPFSTKNLDLSFNPLRHLGSYSFFSFPELQVLDLSRCEIQTIED'

if re.search('V*VVP',myseq):
print myseq


I hope this is just a typo but you are printing your original string not 
the things found...



--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression grouping insert thingy

2010-06-08 Thread Matthew Wood
re.sub(r'(\d+)x', r'\1*x', input_text)

--

I enjoy haiku
but sometimes they don't make sense;
refrigerator?


On Tue, Jun 8, 2010 at 10:11 PM, Lang Hurst l...@tharin.com wrote:

 This is so trivial (or should be), but I can't figure it out.

 I'm trying to do what in vim is

 :s/\([0-9]\)x/\1*x/

 That is, find a number followed by an x and put a * in between the
 number and the x

 So, if the string is 6443x - 3, I'll get back 6443*x - 3

 I won't write down all the things I've tried, but suffice it to say,
 nothing has done it.  I just found myself figuring out how to call sed and
 realized that this should be a one-liner in python too.  Any ideas?  I've
 read a lot of documentation, but I just can't figure it out.  Thanks.

 --
 There are no stupid questions, just stupid people.

 ___
 Tutor maillist  -  Tutor@python.org
 To unsubscribe or change subscription options:
 http://mail.python.org/mailman/listinfo/tutor

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression grouping insert thingy

2010-06-08 Thread Lang Hurst
Oh.  Crap, I knew it would be something simple, but honestly, I don't 
think that I would have gotten there.  Thank you so much.  Seriously 
saved me more grey hair.


Matthew Wood wrote:

re.sub(r'(\d+)x', r'\1*x', input_text)

--

I enjoy haiku
but sometimes they don't make sense;
refrigerator?


On Tue, Jun 8, 2010 at 10:11 PM, Lang Hurst l...@tharin.com 
mailto:l...@tharin.com wrote:


This is so trivial (or should be), but I can't figure it out.

I'm trying to do what in vim is

:s/\([0-9]\)x/\1*x/

That is, find a number followed by an x and put a * in between
the number and the x

So, if the string is 6443x - 3, I'll get back 6443*x - 3

I won't write down all the things I've tried, but suffice it to
say, nothing has done it.  I just found myself figuring out how to
call sed and realized that this should be a one-liner in python
too.  Any ideas?  I've read a lot of documentation, but I just
can't figure it out.  Thanks.

-- 
There are no stupid questions, just stupid people.


___
Tutor maillist  -  Tutor@python.org mailto:Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor





--
There are no stupid questions, just stupid people.

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] regular expression question

2009-04-28 Thread =?UTF-8?Q?Marek_Spoci=C5=84ski
 Hello,
 
 The following code returns 'abc123abc45abc789jk'. How do I revise the pattern 
 so
 that the return value will be 'abc789jk'? In other words, I want to find the
 pattern 'abc' that is closest to 'jk'. Here the string '123', '45' and '789' 
 are
 just examples. They are actually quite different in the string that I'm 
 working
 with. 
 
 import re
 s = 'abc123abc45abc789jk'
 p = r'abc.+jk'
 lst = re.findall(p, s)
 print lst[0]

I suggest using r'abc.+?jk' instead.

the additional ? makes the preceeding '.+' non-greedy so instead of matching as 
long string as it can it matches as short string as possible.


___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] regular expression question

2009-04-28 Thread Marek Spociński , Poland
Dnia 28 kwietnia 2009 11:16 Andre Engels andreeng...@gmail.com napisał(a):
 2009/4/28 Marek spociń...@go2.pl,Poland :
  Hello,
 
  The following code returns 'abc123abc45abc789jk'. How do I revise the 
  pattern so
  that the return value will be 'abc789jk'? In other words, I want to find 
  the
  pattern 'abc' that is closest to 'jk'. Here the string '123', '45' and 
  '789' are
  just examples. They are actually quite different in the string that I'm 
  working
  with.
 
  import re
  s = 'abc123abc45abc789jk'
  p = r'abc.+jk'
  lst = re.findall(p, s)
  print lst[0]
 
  I suggest using r'abc.+?jk' instead.
 
  the additional ? makes the preceeding '.+' non-greedy so instead of 
  matching as long string as it can it matches as short string as possible.
 
 That was my first idea too, but it does not work for this case,
 because Python will still try to _start_ the match as soon as
 possible. To use .+? one would have to revert the string, then use the
 reverse regular expression on the result, which looks like a rather
 roundabout way of doing things.

I don't have access to python right now so i cannot test my ideas...
And i don't really want to give you wrong idea too.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] regular expression question

2009-04-28 Thread spir
Le Tue, 28 Apr 2009 11:06:16 +0200,
Marek spociń...@go2.pl,  Poland marek...@10g.pl s'exprima ainsi:

  Hello,
  
  The following code returns 'abc123abc45abc789jk'. How do I revise the
  pattern so that the return value will be 'abc789jk'? In other words, I
  want to find the pattern 'abc' that is closest to 'jk'. Here the string
  '123', '45' and '789' are just examples. They are actually quite
  different in the string that I'm working with. 
  
  import re
  s = 'abc123abc45abc789jk'
  p = r'abc.+jk'
  lst = re.findall(p, s)
  print lst[0]
 
 I suggest using r'abc.+?jk' instead.
 
 the additional ? makes the preceeding '.+' non-greedy so instead of
 matching as long string as it can it matches as short string as possible.

Non-greedy repetition will not work in this case, I guess:

from re import compile as Pattern
s = 'abc123abc45abc789jk'
p = Pattern(r'abc.+?jk')
print p.match(s).group()
==
abc123abc45abc789jk

(Someone explain why?)

My solution would be to explicitely exclude 'abc' from the sequence of chars 
matched by '.+'. To do this, use negative lookahead (?!...) before '.':
p = Pattern(r'(abc((?!abc).)+jk)')
print p.findall(s)
==
[('abc789jk', '9')]

But it's not exactly what you want. Because the internal () needed to express 
exclusion will be considered by findall as a group to be returned, so that you 
also get the last char matched in there.
To avoid that, use non-grouping parens (?:...). This also avoids the need for 
parens around the whole format:
p = Pattern(r'abc(?:(?!abc).)+jk')
print p.findall(s)
['abc789jk']

Denis
--
la vita e estrany
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] regular expression question

2009-04-28 Thread Kelie
Andre Engels andreengels at gmail.com writes:

 
 2009/4/28 Marek Spociński at go2.pl,Poland marek_sp at 10g.pl:

  I suggest using r'abc.+?jk' instead.
 

 
 That was my first idea too, but it does not work for this case,
 because Python will still try to _start_ the match as soon as
 possible. 

yeah, i tried the '?' as well and realized it would not work.


___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] regular expression question

2009-04-28 Thread Kent Johnson
2009/4/28 Marek spociń...@go2.pl,Poland marek...@10g.pl:

 import re
 s = 'abc123abc45abc789jk'
 p = r'abc.+jk'
 lst = re.findall(p, s)
 print lst[0]

 I suggest using r'abc.+?jk' instead.

 the additional ? makes the preceeding '.+' non-greedy so instead of matching 
 as long string as it can it matches as short string as possible.

Did you try it? It doesn't do what you expect, it still matches at the
beginning of the string.

The re engine searches for a match at a location and returns the first
one it finds. A non-greedy match doesn't mean Find the shortest
possible match anywhere in the string, it means, find the shortest
possible match starting at this location.

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] regular expression question

2009-04-28 Thread Kent Johnson
On Tue, Apr 28, 2009 at 4:03 AM, Kelie kf9...@gmail.com wrote:
 Hello,

 The following code returns 'abc123abc45abc789jk'. How do I revise the pattern 
 so
 that the return value will be 'abc789jk'? In other words, I want to find the
 pattern 'abc' that is closest to 'jk'. Here the string '123', '45' and '789' 
 are
 just examples. They are actually quite different in the string that I'm 
 working
 with.

 import re
 s = 'abc123abc45abc789jk'
 p = r'abc.+jk'
 lst = re.findall(p, s)
 print lst[0]

re.findall() won't work because it finds non-overlapping matches.

If there is a character in the initial match which cannot occur in the
middle section, change .+ to exclude that character. For example,
r'abc[^a]+jk' works with your example.

Another possibility is to look for the match starting at different
locations, something like this:
p = re.compile(r'abc.+jk')
lastMatch = None
i = 0
while i  len(s):
  m = p.search(s, i)
  if m is None:
break
  lastMatch = m.group()
  i = m.start() + 1

print lastMatch

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] regular expression question

2009-04-28 Thread Andre Engels
2009/4/28 Marek spociń...@go2.pl,Poland marek...@10g.pl:
 Hello,

 The following code returns 'abc123abc45abc789jk'. How do I revise the 
 pattern so
 that the return value will be 'abc789jk'? In other words, I want to find the
 pattern 'abc' that is closest to 'jk'. Here the string '123', '45' and '789' 
 are
 just examples. They are actually quite different in the string that I'm 
 working
 with.

 import re
 s = 'abc123abc45abc789jk'
 p = r'abc.+jk'
 lst = re.findall(p, s)
 print lst[0]

 I suggest using r'abc.+?jk' instead.

 the additional ? makes the preceeding '.+' non-greedy so instead of matching 
 as long string as it can it matches as short string as possible.

That was my first idea too, but it does not work for this case,
because Python will still try to _start_ the match as soon as
possible. To use .+? one would have to revert the string, then use the
reverse regular expression on the result, which looks like a rather
roundabout way of doing things.



-- 
André Engels, andreeng...@gmail.com
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] regular expression question

2009-04-28 Thread Kelie
spir denis.spir at free.fr writes:

 To avoid that, use non-grouping parens (?:...). This also avoids the need for
parens around the whole format:
 p = Pattern(r'abc(?:(?!abc).)+jk')
 print p.findall(s)
 ['abc789jk']
 
 Denis


This one works! Thank you Denis. I'll try it out on the actual much longer
(multiline) string and see what happens.

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] regular expression problem

2009-04-15 Thread bob gailer

Spencer Parker wrote:

I have a python script that takes a text file as an argument.  It then loops
through the text file pulling out specific lines of text that I want.  I
have a regular expression that evaluates the text to see if it matches a

specific phrase.  Right now I have it writing to another text file that
output.  The problem I am having is that it finds the phrase prints it, but
then it continuously prints the statement.  There is only 1 entries in the

file for the result it finds, but it prints it multiple times...several
hundred before it moves onto the next one.  But it appends the first one to
the next entry...and does this till it finds everything.

http://dpaste.com/33982/


Any Help?
  


dedent the 2nd for loop.

--
Bob Gailer
Chapel Hill NC
919-636-4239
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] regular expression problem

2009-04-15 Thread Spencer Parker
After he said that...I realized where I was being dumb...

On Wed, Apr 15, 2009 at 10:29 AM, bob gailer bgai...@gmail.com wrote:

 Spencer Parker wrote:

 I have a python script that takes a text file as an argument.  It then
 loops
 through the text file pulling out specific lines of text that I want.  I
 have a regular expression that evaluates the text to see if it matches a

 specific phrase.  Right now I have it writing to another text file that
 output.  The problem I am having is that it finds the phrase prints it,
 but
 then it continuously prints the statement.  There is only 1 entries in the

 file for the result it finds, but it prints it multiple times...several
 hundred before it moves onto the next one.  But it appends the first one
 to
 the next entry...and does this till it finds everything.

 http://dpaste.com/33982/


 Any Help?



 dedent the 2nd for loop.

 --
 Bob Gailer
 Chapel Hill NC
 919-636-4239

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression oddity

2008-11-23 Thread spir

bob gailer a écrit :

Emmanuel Ruellan wrote:

Hi tutors!

While trying to write a regular expression that would split a string
the way I want, I noticed a behaviour I didn't expect.

 

re.findall('.?', 'some text')


['s', 'o', 'm', 'e', ' ', 't', 'e', 'x', 't', '']

Where does the last string, the empty one, come from?
I find this behaviour rather annoying: I'm getting one group too many.
  
The ? means 0 or 1 occurrence. I think re is matching the null string at 
the end.


Drop the ? and you'll get what you want.

Of course you can get the same thing using list('some text') at lower cost.


I find this fully consistent, for your regex means matching
* either any char
* or no char at all
Logically, you first get n chars, then one 'nothing'. Only after that will 
parsing be stopped because of end of string. Maybe clearer:

print re.findall('.?', '')
== ['']
print re.findall('.', '')
== []
denis

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression oddity

2008-11-22 Thread bob gailer

Emmanuel Ruellan wrote:

Hi tutors!

While trying to write a regular expression that would split a string
the way I want, I noticed a behaviour I didn't expect.

  

re.findall('.?', 'some text')


['s', 'o', 'm', 'e', ' ', 't', 'e', 'x', 't', '']

Where does the last string, the empty one, come from?
I find this behaviour rather annoying: I'm getting one group too many.
  
The ? means 0 or 1 occurrence. I think re is matching the null string at 
the end.


Drop the ? and you'll get what you want.

Of course you can get the same thing using list('some text') at lower cost.

--
Bob Gailer
Chapel Hill NC 
919-636-4239


___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression to match \f in groff input?

2008-08-21 Thread Danny Yoo
On Thu, Aug 21, 2008 at 1:40 PM, Bill Campbell [EMAIL PROTECTED] wrote:
 I've been beating my head against the wall try to figure out how
 to get regular expressions to match the font change sequences in
 *roff input (e.g. \fB for bold, \fP to revert to previous font).
 The re library maps r'\f' to the single form-feed character (as
 it does other common single-character sequences like
r'\n').


Does this example help?

###
 sampleText = r\This\ has \backslashes\.
 print sampleText
\This\ has \backslashes\.
 import re
 re.findall(r\\\w+\\, sampleText)
['\\This\\', '\\backslashes\\']
###
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression to match \f in groff input?

2008-08-21 Thread Alan Gauld


Bill Campbell [EMAIL PROTECTED] wrote 


to get regular expressions to match the font change sequences in
*roff input (e.g. \fB for bold, \fP to revert to previous font).
The re library maps r'\f' to the single form-feed character (as
it does other common single-character sequences like r'\n').


I think all you need is an extra \ to escape the \ character in \f


This does not work in puthon:

s = re.sub(r'\f[1NR]', '/emphasis, sinput)


Try

s = re.sub(r'\\f[1NR]', '/emphasis, sinput)

HTH,

--
Alan Gauld
Author of the Learn to Program web site
http://www.freenetpages.co.uk/hp/alan.gauld

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular Expression

2007-12-21 Thread Tiger12506
I need to pull the highligted data from a similar file and can't seem to 
get
 my script to work:

 Script:
 import re
 infile = open(filter.txt,r)
 outfile = open(out.txt,w)
 patt = re.compile(r~02([\d{10}]))

You have to allow for the characters at the beginning and end too.
Try this.
re.compile(r.*~02(\d{10})~.*)

Also outfile.write(%s\n) literally writes %s\n
You need this I believe

outfile.write(%s\n % m.group(1))


 for line in infile:
  m = patt.match(line)
  if m:
outfile.write(%s\n)
 infile.close()
 outfile.close()


 File:
 200~02001491~05070
 200~02001777~05070
 200~02001995~05090
 200~02002609~05090
 200~02002789~05070
 200~012~02004169~0
 200~02004247~05090
 200~02008623~05090
 200~02010957~05090
 200~02 011479~05090
 200~0199~02001237~
 200~02011600~05090
 200~012~02 022305~0
 200~02023546~05090
 200~02025427~05090






 ___
 Tutor maillist  -  Tutor@python.org
 http://mail.python.org/mailman/listinfo/tutor
 

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular Expression

2007-12-21 Thread Michael Langford
You need to pass a parameter to the string in the following line:

outfile.write(%s\n % m.string[m.start():m.end()])

And you need to use m.search, not m.match in the line where you're
actually apply the expression to the string

 m = patt.search(line)

   --Michael

On 12/21/07, Que Prime [EMAIL PROTECTED] wrote:


 I need to pull the highligted data from a similar file and can't seem to get
 my script to work:

 Script:
 import re
 infile = open(filter.txt,r)
 outfile = open(out.txt,w)
 patt = re.compile(r~02([\d{10}]))
 for line in infile:
   m = patt.match(line)
   if m:
 outfile.write(%s\n)
 infile.close()
 outfile.close()


 File:
 200~02001491~05070
 200~02001777~05070
 200~02001995~05090
 200~02002609~05090
 200~02002789~05070
 200~012~02004169~0
  200~02004247~05090
 200~02008623~05090
 200~02010957~05090
 200~02 011479~05090
 200~0199~02001237~
 200~02011600~05090
 200~012~02 022305~0
 200~02023546~05090
 200~02025427~05090


 ___
 Tutor maillist  -  Tutor@python.org
 http://mail.python.org/mailman/listinfo/tutor




-- 
Michael Langford
Phone: 404-386-0495
Consulting: http://www.RowdyLabs.com
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular Expression help - parsing AppleScript Lists as Strings

2007-11-01 Thread Kent Johnson
Andrew Wu wrote:

pattern3 = '''
   ^{
   (
   %s
   | {%s}   # Possible to have 1 level of nested lists
   ,?)* # Items are comma-delimited, except for the last item
   }$
''' % (pattern2, pattern2)

The above doesn't allow comma after the first instance of pattern2 and 
it doesn't allow space after either instance. Here is a version that 
passes your tests:

pattern3 = '''
   ^{
   (
   (%s
   | {%s})   # Possible to have 1 level of nested lists
   ,?\s*)* # Items are comma-delimited, except for the last item
   }$
''' % (pattern2, pattern2)

You might want to look at doing this with pyparsing, I think it will 
make it easier to get the data out vs just recognizing the correct pattern.

Kent

PS Please post in plain text, not HTML.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular Expression help - parsing AppleScript Lists as Strings

2007-11-01 Thread Kent Johnson
Kent Johnson wrote:
 You might want to look at doing this with pyparsing, I think it will 
 make it easier to get the data out vs just recognizing the correct pattern.

Here is a pyparsing version that correctly recognizes all of your 
patterns and returns a (possibly nested) Python list in case of a match.

Note that this version will parse lists that are nested arbitrarily 
deeply. If you don't want that you will have to define two kinds of 
lists, a singly-nested list and a non-nested list.

Kent

from pyparsing import *

List = Forward()
T = Literal('true').setParseAction( lambda s,l,t: [ True ] )
F = Literal('false').setParseAction( lambda s,l,t: [ False ] )
String = QuotedString('')
Number = Word(nums).setParseAction( lambda s,l,t: [ int(t[0]) ] )
List  Literal('{').suppress() + 
delimitedList(T|F|String|Number|Group(List)) + Literal('}').suppress()

def IsASList(s):
# AppleScript lists are bracked by curly braces with items separate 
by commas
# Each item is an alphanumeric label(?) or a string enclosed by
# double quotes or a list itself
# e.g. {2, True, hello}
try:
parsed = List.parseString(s)
return parsed.asList()
except Exception, e:
return None

sample_strs = [
'{}',  # Empty list
'{a}', # Should not match
'{a, b, c}', # Should not match
'{hello}',
'{hello, kitty}',
'{true}',
'{false}',
'{true, false}',
'{9}',
'{9,10, 11}',
'{93214, true, false, hello, kitty}',
'{{1, 2, 3}}',  # This matches
'{{1, 2, cat}, 1}',  # This matches

 # These don't match:
'{{1,2,3},1,{4,5,6},2}',
'{1, {2, 3, 4}, 3}',
'{{1, 2, 3}, {4, 5, 6}, 1}',
'{1, {1, 2, 3}}',  # Should match but doesn't
'{93214, true, false, hello, kitty, {1, 2, 3}}',  # Should match 
but doesn't
'{label: hello, value: false, num: 2}',  # AppleScript dictionary 
- should not match
]

for sample in sample_strs:
result = IsASList(sample)
print 'Is AppleScript List:  %s;   String:  %s' % (bool(result), sample)
if result:
print result
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular Expression help - parsing AppleScript Lists as Strings

2007-11-01 Thread Andrew Wu
Ah - thanks for the correction!  I missed the extra grouping and the
extra spacing ... doh!  Sorry about the HTML-formatted e-mail ...

Thanks also for the pyparsing variant as well - I didn't know the
module existed before!



Andrew

On 11/1/07, Kent Johnson [EMAIL PROTECTED] wrote:
 Andrew Wu wrote:

 pattern3 = '''
^{
(
%s
| {%s}   # Possible to have 1 level of nested lists
,?)* # Items are comma-delimited, except for the last item
}$
 ''' % (pattern2, pattern2)

 The above doesn't allow comma after the first instance of pattern2 and
 it doesn't allow space after either instance. Here is a version that
 passes your tests:

 pattern3 = '''
^{
(
(%s
| {%s})   # Possible to have 1 level of nested lists
,?\s*)* # Items are comma-delimited, except for the last item
}$
 ''' % (pattern2, pattern2)

 You might want to look at doing this with pyparsing, I think it will
 make it easier to get the data out vs just recognizing the correct pattern.

 Kent

 PS Please post in plain text, not HTML.

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular Expression help

2007-06-29 Thread Gardner, Dean
Thanks everyone for the replies all worked well, I adopted the string
splitting approach in favour of the regex one as it seemed to miss less
of the edge cases. I would like to thank everyone for their help once
again 




-Original Message-
From: Kent Johnson [mailto:[EMAIL PROTECTED] 
Sent: 27 June 2007 14:55
To: tutor@python.org; Gardner, Dean
Subject: Re: [Tutor] Regular Expression help

Gardner, Dean wrote:
 Hi
 
 I have a text file that I would like to split up so that I can use it 
 in Excel to filter a certain field. However as it is a flat text file 
 I need to do some processing on it so that Excel can correctly import
it.
 
 File Example:
 tag descVR  VM
 (0012,0042) Clinical Trial Subject Reading ID LO 1
 (0012,0050) Clinical Trial Time Point ID LO 1
 (0012,0051) Clinical Trial Time Point Description ST 1
 (0012,0060) Clinical Trial Coordinating Center Name LO 1
 (0018,0010) Contrast/Bolus Agent LO 1
 (0018,0012) Contrast/Bolus Agent Sequence SQ 1
 (0018,0014) Contrast/Bolus Administration Route Sequence SQ 1
 (0018,0015) Body Part Examined CS 1
 
 What I essentially want is to use python to process this file to give 
 me
 
 
 (0012,0042); Clinical Trial Subject Reading ID; LO; 1 (0012,0050); 
 Clinical Trial Time Point ID; LO; 1 (0012,0051); Clinical Trial Time 
 Point Description; ST; 1 (0012,0060); Clinical Trial Coordinating 
 Center Name; LO; 1 (0018,0010); Contrast/Bolus Agent; LO; 1 
 (0018,0012); Contrast/Bolus Agent Sequence; SQ ;1 (0018,0014); 
 Contrast/Bolus Administration Route Sequence; SQ; 1 (0018,0015); Body 
 Part Examined; CS; 1
 
 so that I can import to excel using a delimiter.
 
 This file is extremely long and all I essentially want to do is to 
 break it into it 'fields'
 
 Now I suspect that regular expressions are the way to go but I have 
 only basic experience of using these and I have no idea what I should
be doing.

This seems to work:

data = '''\
(0012,0042) Clinical Trial Subject Reading ID LO 1
(0012,0050) Clinical Trial Time Point ID LO 1
(0012,0051) Clinical Trial Time Point Description ST 1
(0012,0060) Clinical Trial Coordinating Center Name LO 1
(0018,0010) Contrast/Bolus Agent LO 1
(0018,0012) Contrast/Bolus Agent Sequence SQ 1
(0018,0014) Contrast/Bolus Administration Route Sequence SQ 1
(0018,0015) Body Part Examined CS 1'''.splitlines()

import re
fieldsRe = re.compile(r'^(\(\d+,\d+\)) (.*?) (\w+) (\d+)$')

for line in data:
match = fieldsRe.match(line)
if match:
print ';'.join(match.group(1, 2, 3, 4))


I don't think you want the space after the ; that you put in your
example; Excel wants a single-character delimiter.

Kent


DISCLAIMER:
Unless indicated otherwise, the information contained in this message is 
privileged and confidential, and is intended only for the use of the 
addressee(s) named above and others who have been specifically authorized to 
receive it. If you are not the intended recipient, you are hereby notified that 
any dissemination, distribution or copying of this message and/or attachments 
is strictly prohibited. The company accepts no liability for any damage caused 
by any virus transmitted by this email. Furthermore, the company does not 
warrant a proper and complete transmission of this information, nor does it 
accept liability for any delays. If you have received this message in error, 
please contact the sender and delete the message. Thank you.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular Expression help

2007-06-27 Thread Kent Johnson
Gardner, Dean wrote:
 Hi
 
 I have a text file that I would like to split up so that I can use it in 
 Excel to filter a certain field. However as it is a flat text file I 
 need to do some processing on it so that Excel can correctly import it.
 
 File Example:
 tag descVR  VM
 (0012,0042) Clinical Trial Subject Reading ID LO 1
 (0012,0050) Clinical Trial Time Point ID LO 1
 (0012,0051) Clinical Trial Time Point Description ST 1
 (0012,0060) Clinical Trial Coordinating Center Name LO 1
 (0018,0010) Contrast/Bolus Agent LO 1
 (0018,0012) Contrast/Bolus Agent Sequence SQ 1
 (0018,0014) Contrast/Bolus Administration Route Sequence SQ 1
 (0018,0015) Body Part Examined CS 1
 
 What I essentially want is to use python to process this file to give me
 
 
 (0012,0042); Clinical Trial Subject Reading ID; LO; 1
 (0012,0050); Clinical Trial Time Point ID; LO; 1
 (0012,0051); Clinical Trial Time Point Description; ST; 1
 (0012,0060); Clinical Trial Coordinating Center Name; LO; 1
 (0018,0010); Contrast/Bolus Agent; LO; 1
 (0018,0012); Contrast/Bolus Agent Sequence; SQ ;1
 (0018,0014); Contrast/Bolus Administration Route Sequence; SQ; 1
 (0018,0015); Body Part Examined; CS; 1
 
 so that I can import to excel using a delimiter.
 
 This file is extremely long and all I essentially want to do is to break 
 it into it 'fields'
 
 Now I suspect that regular expressions are the way to go but I have only 
 basic experience of using these and I have no idea what I should be doing.

This seems to work:

data = '''\
(0012,0042) Clinical Trial Subject Reading ID LO 1
(0012,0050) Clinical Trial Time Point ID LO 1
(0012,0051) Clinical Trial Time Point Description ST 1
(0012,0060) Clinical Trial Coordinating Center Name LO 1
(0018,0010) Contrast/Bolus Agent LO 1
(0018,0012) Contrast/Bolus Agent Sequence SQ 1
(0018,0014) Contrast/Bolus Administration Route Sequence SQ 1
(0018,0015) Body Part Examined CS 1'''.splitlines()

import re
fieldsRe = re.compile(r'^(\(\d+,\d+\)) (.*?) (\w+) (\d+)$')

for line in data:
match = fieldsRe.match(line)
if match:
print ';'.join(match.group(1, 2, 3, 4))


I don't think you want the space after the ; that you put in your 
example; Excel wants a single-character delimiter.

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular Expression help

2007-06-27 Thread Kent Johnson
Gardner, Dean wrote:
 Hi
 
 I have a text file that I would like to split up so that I can use it in 
 Excel to filter a certain field. However as it is a flat text file I 
 need to do some processing on it so that Excel can correctly import it.
 
 File Example:
 tag descVR  VM
 (0012,0042) Clinical Trial Subject Reading ID LO 1
 (0012,0050) Clinical Trial Time Point ID LO 1
 (0012,0051) Clinical Trial Time Point Description ST 1
 (0012,0060) Clinical Trial Coordinating Center Name LO 1
 (0018,0010) Contrast/Bolus Agent LO 1
 (0018,0012) Contrast/Bolus Agent Sequence SQ 1
 (0018,0014) Contrast/Bolus Administration Route Sequence SQ 1
 (0018,0015) Body Part Examined CS 1
 
 What I essentially want is to use python to process this file to give me
 
 
 (0012,0042); Clinical Trial Subject Reading ID; LO; 1
 (0012,0050); Clinical Trial Time Point ID; LO; 1
 (0012,0051); Clinical Trial Time Point Description; ST; 1
 (0012,0060); Clinical Trial Coordinating Center Name; LO; 1
 (0018,0010); Contrast/Bolus Agent; LO; 1
 (0018,0012); Contrast/Bolus Agent Sequence; SQ ;1
 (0018,0014); Contrast/Bolus Administration Route Sequence; SQ; 1
 (0018,0015); Body Part Examined; CS; 1
 
 so that I can import to excel using a delimiter.
 
 This file is extremely long and all I essentially want to do is to break 
 it into it 'fields'
 
 Now I suspect that regular expressions are the way to go but I have only 
 basic experience of using these and I have no idea what I should be doing.

This seems to work:

data = '''\
(0012,0042) Clinical Trial Subject Reading ID LO 1
(0012,0050) Clinical Trial Time Point ID LO 1
(0012,0051) Clinical Trial Time Point Description ST 1
(0012,0060) Clinical Trial Coordinating Center Name LO 1
(0018,0010) Contrast/Bolus Agent LO 1
(0018,0012) Contrast/Bolus Agent Sequence SQ 1
(0018,0014) Contrast/Bolus Administration Route Sequence SQ 1
(0018,0015) Body Part Examined CS 1'''.splitlines()

import re
fieldsRe = re.compile(r'^(\(\d+,\d+\)) (.*?) (\w+) (\d+)$')

for line in data:
 match = fieldsRe.match(line)
 if match:
 print ';'.join(match.group(1, 2, 3, 4))


I don't think you want the space after the ; that you put in your 
example; Excel wants a single-character delimiter.

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular Expression help

2007-06-27 Thread Mike Hansen
Argh... My e-mail program really messed up the threads. I didn't notice
that there was already multiple replies to this message.

Doh!

Mike
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular Expression help

2007-06-27 Thread Kent Johnson
Gardner, Dean wrote:
 Hi
 
 I have a text file that I would like to split up so that I can use it in 
 Excel to filter a certain field. However as it is a flat text file I 
 need to do some processing on it so that Excel can correctly import it.
 
 File Example:
 tag descVR  VM
 (0012,0042) Clinical Trial Subject Reading ID LO 1
 (0012,0050) Clinical Trial Time Point ID LO 1
 (0012,0051) Clinical Trial Time Point Description ST 1
 (0012,0060) Clinical Trial Coordinating Center Name LO 1
 (0018,0010) Contrast/Bolus Agent LO 1
 (0018,0012) Contrast/Bolus Agent Sequence SQ 1
 (0018,0014) Contrast/Bolus Administration Route Sequence SQ 1
 (0018,0015) Body Part Examined CS 1
 
 What I essentially want is to use python to process this file to give me
 
 
 (0012,0042); Clinical Trial Subject Reading ID; LO; 1
 (0012,0050); Clinical Trial Time Point ID; LO; 1
 (0012,0051); Clinical Trial Time Point Description; ST; 1
 (0012,0060); Clinical Trial Coordinating Center Name; LO; 1
 (0018,0010); Contrast/Bolus Agent; LO; 1
 (0018,0012); Contrast/Bolus Agent Sequence; SQ ;1
 (0018,0014); Contrast/Bolus Administration Route Sequence; SQ; 1
 (0018,0015); Body Part Examined; CS; 1
 
 so that I can import to excel using a delimiter.
 
 This file is extremely long and all I essentially want to do is to break 
 it into it 'fields'
 
 Now I suspect that regular expressions are the way to go but I have only 
 basic experience of using these and I have no idea what I should be doing.

This seems to work:

data = '''\
(0012,0042) Clinical Trial Subject Reading ID LO 1
(0012,0050) Clinical Trial Time Point ID LO 1
(0012,0051) Clinical Trial Time Point Description ST 1
(0012,0060) Clinical Trial Coordinating Center Name LO 1
(0018,0010) Contrast/Bolus Agent LO 1
(0018,0012) Contrast/Bolus Agent Sequence SQ 1
(0018,0014) Contrast/Bolus Administration Route Sequence SQ 1
(0018,0015) Body Part Examined CS 1'''.splitlines()

import re
fieldsRe = re.compile(r'^(\(\d+,\d+\)) (.*?) (\w+) (\d+)$')

for line in data:
 match = fieldsRe.match(line)
 if match:
 print ';'.join(match.group(1, 2, 3, 4))


I don't think you want the space after the ; that you put in your 
example; Excel wants a single-character delimiter.

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular Expression help

2007-06-27 Thread Kent Johnson
Gardner, Dean wrote:
 Hi
 
 I have a text file that I would like to split up so that I can use it in 
 Excel to filter a certain field. However as it is a flat text file I 
 need to do some processing on it so that Excel can correctly import it.
 
 File Example:
 tag descVR  VM
 (0012,0042) Clinical Trial Subject Reading ID LO 1
 (0012,0050) Clinical Trial Time Point ID LO 1
 (0012,0051) Clinical Trial Time Point Description ST 1
 (0012,0060) Clinical Trial Coordinating Center Name LO 1
 (0018,0010) Contrast/Bolus Agent LO 1
 (0018,0012) Contrast/Bolus Agent Sequence SQ 1
 (0018,0014) Contrast/Bolus Administration Route Sequence SQ 1
 (0018,0015) Body Part Examined CS 1
 
 What I essentially want is to use python to process this file to give me
 
 
 (0012,0042); Clinical Trial Subject Reading ID; LO; 1
 (0012,0050); Clinical Trial Time Point ID; LO; 1
 (0012,0051); Clinical Trial Time Point Description; ST; 1
 (0012,0060); Clinical Trial Coordinating Center Name; LO; 1
 (0018,0010); Contrast/Bolus Agent; LO; 1
 (0018,0012); Contrast/Bolus Agent Sequence; SQ ;1
 (0018,0014); Contrast/Bolus Administration Route Sequence; SQ; 1
 (0018,0015); Body Part Examined; CS; 1
 
 so that I can import to excel using a delimiter.
 
 This file is extremely long and all I essentially want to do is to break 
 it into it 'fields'
 
 Now I suspect that regular expressions are the way to go but I have only 
 basic experience of using these and I have no idea what I should be doing.

This seems to work:

data = '''\
(0012,0042) Clinical Trial Subject Reading ID LO 1
(0012,0050) Clinical Trial Time Point ID LO 1
(0012,0051) Clinical Trial Time Point Description ST 1
(0012,0060) Clinical Trial Coordinating Center Name LO 1
(0018,0010) Contrast/Bolus Agent LO 1
(0018,0012) Contrast/Bolus Agent Sequence SQ 1
(0018,0014) Contrast/Bolus Administration Route Sequence SQ 1
(0018,0015) Body Part Examined CS 1'''.splitlines()

import re
fieldsRe = re.compile(r'^(\(\d+,\d+\)) (.*?) (\w+) (\d+)$')

for line in data:
 match = fieldsRe.match(line)
 if match:
 print ';'.join(match.group(1, 2, 3, 4))


I don't think you want the space after the ; that you put in your 
example; Excel wants a single-character delimiter.

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression questions

2007-05-03 Thread Kent Johnson
Bernard Lebel wrote:
 Hello,
 
 Once again struggling with regular expressions.
 
 I have a string that look like something_shp1.
 I want to replace _shp1 by _shp. I'm never sure if it's going to
 be 1, if there's going to be a number after _shp.
 
 So I'm trying to use regular expression to perform this replacement.
 But I just can't seem to get a match! I always get a None match.
 
 I would think that this would have done the job:
 
 r = re.compile( r(_shp\d)$ )
 
 The only way I have found to get a match, is using
 
 r = re.compile( r(\S+_shp\d)$ )

My guess is you are calling r.match() rather than r.search(). r.match() 
only looks for matches at the start of the string; r.search() will find 
a match anywhere.

 My second question is related more to the actual string replacement.
 Using regular expressions, what would be the way to go? I have tried
 the following:
 
 newstring = r.sub( '_shp', oldstring )
 
 But the new string is always _shp instead of something_shp.

Because your re matches something_shp.

I think
newstring = re.sub('_shp\d' '_shp', oldstring )
will do what you want.

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression questions

2007-05-03 Thread Bernard Lebel
Thanks a lot Kent, that indeed solves the issues altogether.


Cheers
Bernard




On 5/3/07, Kent Johnson [EMAIL PROTECTED] wrote:
 Bernard Lebel wrote:
  Hello,
 
  Once again struggling with regular expressions.
 
  I have a string that look like something_shp1.
  I want to replace _shp1 by _shp. I'm never sure if it's going to
  be 1, if there's going to be a number after _shp.
 
  So I'm trying to use regular expression to perform this replacement.
  But I just can't seem to get a match! I always get a None match.
 
  I would think that this would have done the job:
 
  r = re.compile( r(_shp\d)$ )
 
  The only way I have found to get a match, is using
 
  r = re.compile( r(\S+_shp\d)$ )

 My guess is you are calling r.match() rather than r.search(). r.match()
 only looks for matches at the start of the string; r.search() will find
 a match anywhere.

  My second question is related more to the actual string replacement.
  Using regular expressions, what would be the way to go? I have tried
  the following:
 
  newstring = r.sub( '_shp', oldstring )
 
  But the new string is always _shp instead of something_shp.

 Because your re matches something_shp.

 I think
 newstring = re.sub('_shp\d' '_shp', oldstring )
 will do what you want.

 Kent

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] regular expression

2006-08-03 Thread Kent Johnson
arbaro arbaro wrote:
 Hello,

 I'm trying to mount an usb device from python under linux.
 To do so, I read the kernel log /proc/kmsg and watch for something like:
   6 /dev/scsi/host3/bus0/target0/lun0/:7usb-storage: device scan 
 complete

 When I compile a regular expression like:
   r = re.compile('\d+\s/dev/scsi/host\d+/bus\d+/target\d+/lun\d+')
You should use raw strings for regular expressions that contain \ 
characters.
 It is found. But I don't want the \d+\s or '6 ' in front of the 
 path, so I tried:
r = re.compile('/dev/scsi/host\d+/bus\d+/target\d+/lun\d+')
 But this way the usb device path it is not found.
You are using re.match(), which just looks for a match at the start of 
the string. Try using re.search() instead.
http://docs.python.org/lib/matching-searching.html

Kent

 So what i'm trying to do is:
 - find the usb device path from the kernel log with the regular 
 expression.
 - Determine the start and end positions of the match (and add /disc or 
 /part1 to the match).
 - And use that to mount the usb stick on /mnt/usb - mount -t auto 
 match /mnt/usb

 If anyone can see what i'm doing wrong, please tell me, because I 
 don't understand it anymore.
 Thanks.

 Below is the code:

 # \d+ = 1 or more digits
 # \s  = an empty space

 import re

 def findusbdevice():
 ''' Returns path of usb device '''
 # I did a 'cat /proc/kmsg /log/kmsg' to be able to read the kernel 
 message.
 # Somehow I can't read /proc/kmsg directly.
 kmsg = open('/log/kmsg', 'r')
 r = re.compile('/dev/scsi/host\d+/bus\d+/target\d+/lun\d+')
 #r = re.compile('\d+\s/dev/scsi/host\d+/bus\d+/target\d+/lun\d+')
 for line in kmsg:
 if 'usb-storage' in line and r.match(line):
 print 'Success', line
 

 ___
 Tutor maillist  -  Tutor@python.org
 http://mail.python.org/mailman/listinfo/tutor
   


___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] regular expression

2006-08-03 Thread arbaro arbaro
Hello,

Im just answering my own email, since I just found out what my error was.

From a regular expression howto: 
http://www.amk.ca/python/howto/regex/regex.html

The match() function only checks if the RE matches at the beginning of
the string while search() will scan forward through the string for a
match. It's important to keep this distinction in mind.  Remember,
match() will only report a successful match which will start at 0; if
the match wouldn't start at zero,  match() will not report it.

That was exactly my problem. Replacing r.match(line) for
r.search(line) solved it.

Sorry for having bothered you prematurely.




On 8/3/06, arbaro arbaro [EMAIL PROTECTED] wrote:

 Hello,

 I'm trying to mount an usb device from python under linux.
 To do so, I read the kernel log /proc/kmsg and watch for something like:
   6 /dev/scsi/host3/bus0/target0/lun0/:7usb-storage: device scan 
 complete

 When I compile a regular expression like:
   r = re.compile('\d+\s/dev/scsi/host\d+/bus\d+/target\d+/lun\d+')
 It is found. But I don't want the \d+\s or '6 ' in front of the path, so 
 I tried:
r = re.compile('/dev/scsi/host\d+/bus\d+/target\d+/lun\d+')
 But this way the usb device path it is not found.

 So what i'm trying to do is:
 - find the usb device path from the kernel log with the regular expression.
 - Determine the start and end positions of the match (and add /disc or /part1 
 to the match).
 - And use that to mount the usb stick on /mnt/usb - mount -t auto match 
 /mnt/usb

 If anyone can see what i'm doing wrong, please tell me, because I don't 
 understand it anymore.
 Thanks.

 Below is the code:

 # \d+ = 1 or more digits
  # \s  = an empty space

 import re

 def findusbdevice():
 ''' Returns path of usb device '''
 # I did a 'cat /proc/kmsg /log/kmsg' to be able to read the kernel 
 message.
 # Somehow I can't read /proc/kmsg directly.
 kmsg = open('/log/kmsg', 'r')
 r = re.compile('/dev/scsi/host\d+/bus\d+/target\d+/lun\d+')
 #r = re.compile('\d+\s/dev/scsi/host\d+/bus\d+/target\d+/lun\d+')
 for line in kmsg:
 if 'usb-storage' in line and  r.match(line):
 print 'Success', line

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular Expression Misunderstanding

2006-07-14 Thread John Fouhy
On 14/07/06, Steve Nelson [EMAIL PROTECTED] wrote:
 What I don't understand is how in the end the RE *does* actually match
 - which may indicate a serious misunderstanding on my part.

  re.match(a[bcd]*b, abcbd)
 _sre.SRE_Match object at 0x186b7b10

 I don't see how abcbd matches! It ends with a d and the RE states it
 should end with a b!

 What am I missing?

It doesn't have to match the _whole_ string.

[bcd]* will match, amongst other things, the empty string (ie: 0
repetitions of either a b, a c, or a d).  So a[bcd]*b will match
ab, which is in the string abcbd.

It will also match abcb, which is the longest match, and thus
probably the one it found.

If you look at the match object returned, you should se that the match
starts at position 0 and is four characters long.

Now, if you asked for a[bcd]*b$, that would be a different matter!

HTH :-)

-- 
John.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular Expression Misunderstanding

2006-07-14 Thread Steve Nelson
On 7/14/06, John Fouhy [EMAIL PROTECTED] wrote:

 It doesn't have to match the _whole_ string.

Ah right - yes, so it doesn't say that it has to end with a b - as per
your comment about ending with $.

 If you look at the match object returned, you should se that the match
 starts at position 0 and is four characters long.

How does one query a match object in this way?  I am learning by
fiddling interactively.

 John.

S.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular Expression Misunderstanding

2006-07-14 Thread Kent Johnson
Steve Nelson wrote:
 On 7/14/06, John Fouhy [EMAIL PROTECTED] wrote:

   
 It doesn't have to match the _whole_ string.
 

 Ah right - yes, so it doesn't say that it has to end with a b - as per
 your comment about ending with $.
   
The matched portion must end with b, but it doesn't have to coincide 
with the end of the string. The whole regex must be used for it to 
match; the whole string does not have to be used - the matched portion 
can be a substring.
   
 If you look at the match object returned, you should se that the match
 starts at position 0 and is four characters long.
 

 How does one query a match object in this way?  I am learning by
 fiddling interactively.
The docs for match objects are here:
http://docs.python.org/lib/match-objects.html

match.start() and match.end() will tell you where it matched.

You might like to try the regex demo that comes with Python; on Windows 
it is installed at C:\Python24\Tools\Scripts\redemo.py. It gives you an 
easy way to experiment with regexes.

Kent

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular Expression Misunderstanding

2006-07-14 Thread John Fouhy
On 14/07/06, Steve Nelson [EMAIL PROTECTED] wrote:
 How does one query a match object in this way?  I am learning by
 fiddling interactively.

If you're fiddling interactively, try the dir() command --

ie:

 m = re.match(...)
 dir(m)

It will tell you what attributes the match object has.

Or you can read the documentation --- a combination of both approaches
usually works quite well :-)

-- 
John.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular Expression Misunderstanding

2006-07-14 Thread Steve Nelson
On 7/14/06, John Fouhy [EMAIL PROTECTED] wrote:

  m = re.match(...)
  dir(m)

 It will tell you what attributes the match object has.

Useful - thank you.

I am now confuse on this:

I have a file full of lines beginning with the letter b.  I want a
RE that will return the whole line if it begins with b.

I find if I do eg:

 m = re.search(^b, b spam spam spam)
 m.group()
'b'

How do I get it to return the whole line if it begins with a b?

S.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular Expression Misunderstanding

2006-07-14 Thread Luke Paireepinart

 I have a file full of lines beginning with the letter b.  I want a
 RE that will return the whole line if it begins with b.

 I find if I do eg:

   
 m = re.search(^b, b spam spam spam)
 m.group()
 
 'b'

 How do I get it to return the whole line if it begins with a b?

 S.
for line in file:
if line.strip()[0] == 'b':
   print line

or
print [a for a in file if a.strip()[0] == b]
if you want to use list comprehension.
As for the RE way, I've no idea.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular Expression Misunderstanding

2006-07-14 Thread Kent Johnson
Steve Nelson wrote:
 On 7/14/06, John Fouhy [EMAIL PROTECTED] wrote:

   
 m = re.match(...)
 dir(m)
   
 It will tell you what attributes the match object has.
 

 Useful - thank you.

 I am now confuse on this:

 I have a file full of lines beginning with the letter b.  I want a
 RE that will return the whole line if it begins with b.

 I find if I do eg:

   
 m = re.search(^b, b spam spam spam)
 m.group()
 
 'b'

 How do I get it to return the whole line if it begins with a b?
Use the match object in a test. If the search fails it will return None 
which tests false:

for line in lines:
  m = re.search(...)
  if m:
# do something with line that matches

But for this particular application you might as well use 
line.startswith('b') instead of a regex.

Kent

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular Expression Misunderstanding

2006-07-14 Thread Steve Nelson
On 7/14/06, Kent Johnson [EMAIL PROTECTED] wrote:

 But for this particular application you might as well use
 line.startswith('b') instead of a regex.

Ah yes, that makes sense.

Incidentally continuing my reading of the HOWTO I have sat and puzzled
for about 30 mins on the difference the MULTILINE flag makes.  I can't
quite see the difference.  I *think* it is as follows:

Under normal circumstances, ^ matches the start of a line, only.  On a
line by line basis.

With the re.M flag, we get a match after *any* newline?

Similarly with $ - under normal circumstances, $ matches the end of
the string, or that which precedes a newline.

With the MULTILINE flag, $ matches before *any* newline?

Is this correct?

 Kent

S.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular Expression Misunderstanding

2006-07-14 Thread Kent Johnson
Steve Nelson wrote:
 Incidentally continuing my reading of the HOWTO I have sat and puzzled
 for about 30 mins on the difference the MULTILINE flag makes.  I can't
 quite see the difference.  I *think* it is as follows:

 Under normal circumstances, ^ matches the start of a line, only.  On a
 line by line basis.

 With the re.M flag, we get a match after *any* newline?

 Similarly with $ - under normal circumstances, $ matches the end of
 the string, or that which precedes a newline.

 With the MULTILINE flag, $ matches before *any* newline?

 Is this correct?
I'm not sure, I think you are a little confused. MULTILINE only matters 
if the string you are matching contains newlines. Without MULTILINE, ^ 
will match only at the beginning of the string. With it, ^ will match 
after any newline. For example,
In [1]: import re

A string  containing two lines:
In [2]: s='one\ntwo'

The first line matches without MULTILINE:
In [3]: re.search('^one', s)
Out[3]: _sre.SRE_Match object at 0x00C3E640

The second one does not (result of the search is None so nothing prints):
In [4]: re.search('^two', s)

With MULTILINE ^two will match:
In [5]: re.search('^two', s, re.MULTILINE)
Out[5]: _sre.SRE_Match object at 0x00E901E0

Kent

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] regular expression matching a dot?

2005-10-20 Thread Frank Bloeink
Hi [Christian|List]

This post is not regarding your special problem (which anyway has been
solved by now), but I'd like to share some general tip on working with
regular expressions.
There are some nice regex-debuggers out there that can help clearify
what went wrong when a regex doesn't match when it should or vice versa.

Kodos http://kodos.sourceforge.net/ is one of them, but there are many
others that can make your life easier ; at least in terms of
regex-debugging ;) 

Probably most of you (especially all regex-gurus) know about this
already, but i thought it was worth the post as a hint for all beginners

hth Frank

On Wed, 2005-10-19 at 09:45 +0200, Christian Meesters wrote:
 Hi
 
 I've got the problem that I need to find a certain group of file names 
 within a lot of different file names. Those I want to match with a 
 regular expression are a bit peculiar since they all look like:
 ...
 ...(regex-problem)
 ...
 Any ideas what I could do else?
 TIA
 Christian
 
 PS Hope that I described the problem well enough ...



___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] regular expression matching a dot?

2005-10-20 Thread Hugo González Monteverde
I personally fancy Kiki, that comes with the WxPython installer... It 
has very nice coloring for grouping in Regexes.

Hugo

Kent Johnson wrote:
 Frank Bloeink wrote:
 
There are some nice regex-debuggers out there that can help clearify
what went wrong when a regex doesn't match when it should or vice versa.

Kodos http://kodos.sourceforge.net/ is one of them, but there are many
others that can make your life easier ; at least in terms of
regex-debugging ;) 
 
 
 Yes, these programs can be very helpful. There is even one that ships with 
 Python - see
 C:\Python24\Tools\Scripts\redemo.py
 
 Kent
 
 ___
 Tutor maillist  -  Tutor@python.org
 http://mail.python.org/mailman/listinfo/tutor
 
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] regular expression matching a dot?

2005-10-19 Thread Kent Johnson
Christian Meesters wrote:
 Hi
 
 I've got the problem that I need to find a certain group of file names 
 within a lot of different file names. Those I want to match with a 
 regular expression are a bit peculiar since they all look like:
 07SS.INF , 10SE.INF, 13SS.INF, 02BS.INF, 05SS.INF.
 Unfortunately there are similar file names that shouldn't be matched, 
 like:
 01BE.INF, 02BS.INF
 Any other extension than 'INF' should also be skipped. (There are names 
 like 07SS.E00, wich I don't want to see matched.)
 So I tried the following pattern (using re):
 \d+[SS|SE]\.INF - as there should be at least one digit, the group 'SE' 
 or 'SS' followed by a dot and the extension 'INF'.

Use parentheses () for grouping. Brackets [] define a group of characters. Your 
re says to match
  \d+ one or more digits
  [SS|SE] exactly one of the characters S, |, E
  \.INF literal .INF

Since there are TWO characters S or E, nothing matches. Change the [] to () and 
it works.

Kent

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] regular expression matching a dot?

2005-10-19 Thread Misto .
[ Workaround ]
What about using the glob module?

http://docs.python.org/lib/module-glob.html

you can use something like
glob.glob('./[0-9][0-9]S[E|S].INF')
(Not tested)


Misto
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] regular expression matching a dot?

2005-10-19 Thread Christian Meesters
Actually, your answer did help to open my eyes. The expression is 
\d+S[S|E]\.INF: Ouch!

Thanks a lot,
Christian

On 19 Oct 2005, at 12:11, Misto . wrote:

 [ Workaround ]
 What about using the glob module?

 http://docs.python.org/lib/module-glob.html

 you can use something like
 glob.glob('./[0-9][0-9]S[E|S].INF')
 (Not tested)


 Misto


___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] regular expression matching a dot?

2005-10-19 Thread Kent Johnson
Christian Meesters wrote:
 Actually, your answer did help to open my eyes. The expression is 
 \d+S[S|E]\.INF: Ouch!

That will work, but what you really mean is one of these:
\d+S[SE]\.INF
\d+S(S|E)\.INF

Your regex will match 0S|.INF

Kent

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] regular expression matching a dot?

2005-10-19 Thread Christian Meesters
Thanks, corrected. I was happy now - and then too fast ;-).

Cheers
Christian
On 19 Oct 2005, at 13:50, Kent Johnson wrote:

 Christian Meesters wrote:
 Actually, your answer did help to open my eyes. The expression is 
 \d+S[S|E]\.INF: Ouch!

 That will work, but what you really mean is one of these:
 \d+S[SE]\.INF
 \d+S(S|E)\.INF

 Your regex will match 0S|.INF

 Kent


___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression error

2005-07-28 Thread Kent Johnson
Bernard Lebel wrote:
 Hello,
 
 I'm using regular expressions to check if arguments of a function are
 valid. When I run the code below, I get an error.
 
 Basically, what the regular expression should expect is either an
 integer, or an integer followed by a letter (string). I convert the
 tested argument to a string to make sure. Below is the code.
 
 
 
 # Create regular expression to match
 oPattern = re.compile( r(\d+|\d+\D), re.IGNORECASE )

This will match a string of digits followed by any non-digit, is that what you 
want? If you want to restrict it to digits followed by a letter you should use
r(\d+|\d+[a-z])

Also this will match something like 123A456B, if you want to disallow anything 
after the letter you need to match the end of the string:
r(\d+|\d+[a-z])$

 
 # Iterate provided arguments
 for oArg in aArgs:
   
   # Attempt to match the argument to the regular expression
   oMatch = re.match( str( oArg ), 0 )

The problem is you are calling the module (re) match, not the instance 
(oPattern) match. re.match() expects the second argument to be a string. Just 
use
  oMatch = oPattern.match( str( oArg ), 0 )

The hint in the error is expected string or buffer. So you are not giving the 
expected argument types which should send you to the docs to check...

Kent
 
 
 
 The error I get is this:
 
 #ERROR : Traceback (most recent call last):
 #  File Script Block , line 208, in BuildProjectPaths_Execute
 #aShotInfos = handleArguments( args )
 #  File Script Block , line 123, in handleArguments
 #oMatch = re.match( str( oArg ), 0 )
 #  File D:\Python24\Lib\sre.py, line 129, in match
 #return _compile(pattern, flags).match(string)
 #TypeError: expected string or buffer
 # - [line 122]

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] regular expression question

2005-04-07 Thread Kent Johnson
D Elliott wrote:
I wonder if anyone can help me with an RE. I also wonder if there is an 
RE mailing list anywhere - I haven't managed to find one.

I'm trying to use this regular expression to delete particular strings 
from a file before tokenising it.

I want to delete all strings that have a full stop (period) when it is 
not at the beginning or end of a word, and also when it is not followed 
by a closing bracket. I want to delete file names (eg. fileX.doc), and 
websites (when www/http not given) but not file extensions (eg. this is 
in .jpg format). I also don't want to delete the last word of each 
sentence just because it precedes a fullstop, or if there's a fullstop 
followed by a closing bracket.

fullstopRe = re.compile (r'\S+\.[^)}]]+')
There are two problems with this is:
- The ] inside the [] group must be escaped like this: [^)}\]]
- [^)}\]] matches any whitespace so it will match on the ends of words
It's not clear from your description if the closing bracket must immediately follow the full stop or 
if it can be anywhere after it. If you want it to follow immediately then use
\S+\.[^)}\]\s]\S*

If you want to allow the bracket anywhere after the stop you must force the match to go to a word 
boundary otherwise you will match foo.bar when the word is foo.bar]. I think this works:
(\S+\.[^)}\]\s]+)(\s)

but you have to include the second group in your substitution string.
BTW C:\Python23\pythonw.exe C:\Python24\Tools\Scripts\redemo.py is very helpful with questions like 
this...

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] regular expression question

2005-03-09 Thread Mike Hall
But I only want to ignore B if A is a match. If A is not a match, 
I'd like it to advance on to B.

On Mar 9, 2005, at 12:07 PM, Marcos Mendonça wrote:
Hi
Not and regexp expert. But it seems to me that if you want to ignora
B then it should be
(A) | (^B)
Hope it helps!
On Wed, 9 Mar 2005 11:11:57 -0800, Mike Hall
[EMAIL PROTECTED] wrote:
I'm having some strange results using the or operator.  In every 
test
I do I'm matching both sides of the | metacharacter, not one or the
other as all documentation says it should be (the parser supposedly
scans left to right, using the first match it finds and ignoring the
rest). It should only go beyond the | if there was no match found
before it, no?

Correct me if I'm wrong, but your regex is saying match dog, unless
it's followed by cat. if it is followed by cat there is no match on
this side of the | at which point we advance past it and look at the
alternative expression which says to match in front of cat.
However, if I run a .sub using your regex on a string contain both dog
and cat, both will be replaced.
A simple example will show what I mean:
import re
x = re.compile(r(A) | (B))
s = X R A Y B E
r = x.sub(13, s)
print r
X R 13Y13 E
...so unless I'm understanding it wrong, B is supposed to be ignored
if A is matched, yet I get both matched.  I get the same result if I
put A and B within the same group.
On Mar 8, 2005, at 6:47 PM, Danny Yoo wrote:

Regular expressions are a little evil at times; here's what I think
you're
thinking of:
###
import re
pattern = re.compile(rdog(?!cat)
...| (?=dogcat), re.VERBOSE)
pattern.match('dogman').start()
0
pattern.search('dogcatcher').start()

Hi Mike,
Gaaah, bad copy-and-paste.  The example with 'dogcatcher' actually 
does
come up with a result:

###
pattern.search('dogcatcher').start()
6
###
Sorry about that!
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] regular expression question

2005-03-09 Thread Liam Clarke
Actually, you should get that anyway...


|
Alternation, or the ``or'' operator. If A and B are regular
expressions, A|B will match any string that matches either A or B.
| has very low precedence in order to make it work reasonably when
you're alternating multi-character strings. Crow|Servo will match
either Crow or Servo, not Cro, a w or an S, and ervo.


So, for each letter in that string, it's checking to see if any letter
matches 'A' or 'B' ...
the engine steps through one character at a time.
sorta like - 

for letter in s:
 if letter == 'A':
#Do some string stuff
 elif letter == 'B':
#do some string stuff


i.e. 

k = ['A','B', 'C', 'B']

for i in range(len(k)):
if k[i] == 'A' or k[i]=='B':
   k[i]==13

print k

[13, 13, 'C', 13]

You can limit substitutions using an optional argument, but yeah, it
seems you're expecting it to examine the string as a whole.


Check out the example here - 
http://www.amk.ca/python/howto/regex/regex.html#SECTION00032

Also

http://www.regular-expressions.info/alternation.html

Regards, 

Liam Clarke


On Thu, 10 Mar 2005 09:09:13 +1300, Liam Clarke [EMAIL PROTECTED] wrote:
 Hi Mike,
 
 Do you get the same results for a search pattern of 'A|B'?
 
 
 On Wed, 9 Mar 2005 11:11:57 -0800, Mike Hall
 [EMAIL PROTECTED] wrote:
  I'm having some strange results using the or operator.  In every test
  I do I'm matching both sides of the | metacharacter, not one or the
  other as all documentation says it should be (the parser supposedly
  scans left to right, using the first match it finds and ignoring the
  rest). It should only go beyond the | if there was no match found
  before it, no?
 
  Correct me if I'm wrong, but your regex is saying match dog, unless
  it's followed by cat. if it is followed by cat there is no match on
  this side of the | at which point we advance past it and look at the
  alternative expression which says to match in front of cat.
 
  However, if I run a .sub using your regex on a string contain both dog
  and cat, both will be replaced.
 
  A simple example will show what I mean:
 
import re
x = re.compile(r(A) | (B))
s = X R A Y B E
r = x.sub(13, s)
print r
  X R 13Y13 E
 
  ...so unless I'm understanding it wrong, B is supposed to be ignored
  if A is matched, yet I get both matched.  I get the same result if I
  put A and B within the same group.
 
 
  On Mar 8, 2005, at 6:47 PM, Danny Yoo wrote:
 
  
  
  
   Regular expressions are a little evil at times; here's what I think
   you're
   thinking of:
  
   ###
   import re
   pattern = re.compile(rdog(?!cat)
   ...| (?=dogcat), re.VERBOSE)
   pattern.match('dogman').start()
   0
   pattern.search('dogcatcher').start()
  
  
  
   Hi Mike,
  
   Gaaah, bad copy-and-paste.  The example with 'dogcatcher' actually does
   come up with a result:
  
   ###
   pattern.search('dogcatcher').start()
   6
   ###
  
   Sorry about that!
  
 
  ___
  Tutor maillist  -  Tutor@python.org
  http://mail.python.org/mailman/listinfo/tutor
 
 
 --
 'There is only one basic human right, and that is to do as you damn well 
 please.
 And with it comes the only basic human duty, to take the consequences.
 


-- 
'There is only one basic human right, and that is to do as you damn well please.
And with it comes the only basic human duty, to take the consequences.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] regular expression question

2005-03-09 Thread Mike Hall
but yeah, it
seems you're expecting it to examine the string as a whole.
I guess I was, good point.

On Mar 9, 2005, at 12:28 PM, Liam Clarke wrote:
Actually, you should get that anyway...

|
Alternation, or the ``or'' operator. If A and B are regular
expressions, A|B will match any string that matches either A or B.
| has very low precedence in order to make it work reasonably when
you're alternating multi-character strings. Crow|Servo will match
either Crow or Servo, not Cro, a w or an S, and ervo.

So, for each letter in that string, it's checking to see if any letter
matches 'A' or 'B' ...
the engine steps through one character at a time.
sorta like -
for letter in s:
 if letter == 'A':
#Do some string stuff
 elif letter == 'B':
#do some string stuff
i.e.
k = ['A','B', 'C', 'B']
for i in range(len(k)):
if k[i] == 'A' or k[i]=='B':
   k[i]==13
print k
[13, 13, 'C', 13]
You can limit substitutions using an optional argument, but yeah, it
seems you're expecting it to examine the string as a whole.
Check out the example here -
http://www.amk.ca/python/howto/regex/ 
regex.html#SECTION00032

Also
http://www.regular-expressions.info/alternation.html
Regards,
Liam Clarke
On Thu, 10 Mar 2005 09:09:13 +1300, Liam Clarke [EMAIL PROTECTED]  
wrote:
Hi Mike,
Do you get the same results for a search pattern of 'A|B'?
On Wed, 9 Mar 2005 11:11:57 -0800, Mike Hall
[EMAIL PROTECTED] wrote:
I'm having some strange results using the or operator.  In every  
test
I do I'm matching both sides of the | metacharacter, not one or the
other as all documentation says it should be (the parser supposedly
scans left to right, using the first match it finds and ignoring the
rest). It should only go beyond the | if there was no match found
before it, no?

Correct me if I'm wrong, but your regex is saying match dog, unless
it's followed by cat. if it is followed by cat there is no match on
this side of the | at which point we advance past it and look at  
the
alternative expression which says to match in front of cat.

However, if I run a .sub using your regex on a string contain both  
dog
and cat, both will be replaced.

A simple example will show what I mean:
import re
x = re.compile(r(A) | (B))
s = X R A Y B E
r = x.sub(13, s)
print r
X R 13Y13 E
...so unless I'm understanding it wrong, B is supposed to be  
ignored
if A is matched, yet I get both matched.  I get the same result if  
I
put A and B within the same group.

On Mar 8, 2005, at 6:47 PM, Danny Yoo wrote:

Regular expressions are a little evil at times; here's what I think
you're
thinking of:
###
import re
pattern = re.compile(rdog(?!cat)
...| (?=dogcat), re.VERBOSE)
pattern.match('dogman').start()
0
pattern.search('dogcatcher').start()

Hi Mike,
Gaaah, bad copy-and-paste.  The example with 'dogcatcher' actually  
does
come up with a result:

###
pattern.search('dogcatcher').start()
6
###
Sorry about that!
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor
--
'There is only one basic human right, and that is to do as you damn  
well please.
And with it comes the only basic human duty, to take the consequences.


--
'There is only one basic human right, and that is to do as you damn  
well please.
And with it comes the only basic human duty, to take the consequences.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] regular expression question

2005-03-09 Thread Kent Johnson
Mike Hall wrote:
A simple example will show what I mean:
  import re
  x = re.compile(r(A) | (B))
  s = X R A Y B E
  r = x.sub(13, s)
  print r
X R 13Y13 E
...so unless I'm understanding it wrong, B is supposed to be ignored 
if A is matched, yet I get both matched.  I get the same result if I 
put A and B within the same group.
The problem is with your use of sub(), not with |.
By default, re.sub() substitutes *all* matches. If you just want to substitute the first match, 
include  the optional count parameter:

  import re
  s = X R A Y B E
  re.sub(r(A) | (B), '13', s)
'X R 13Y13 E'
  re.sub(r(A) | (B), '13', s, 1)
'X R 13Y B E'
BTW, there is a very handy interactive regex tester that comes with Python. On Windows, it is 
installed at
C:\Python23\Tools\Scripts\redemo.py

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] regular expression question

2005-03-08 Thread Mike Hall
First, thanks for the response. Using your re:
my_re = re.compile(r'(dog)(cat)?')
...I seem to simply be matching the pattern Dog.  Example:
 str1 = The dog chased the car
 str2 = The dog cat parade was under way
 x1 = re.compile(r'(dog)(cat)?')
 rep1 = x1.sub(REPLACE, str1)
 rep2 = x2.sub(REPLACE, str2)
 print rep1
The REPLACE chased the car
 print rep2
The REPLACE cat parade was under way
...what I'm looking for is a match for the position in front of Cat, 
should it exist.


On Mar 8, 2005, at 5:54 PM, Sean Perry wrote:
Mike Hall wrote:
I'd like to get a match for a position in a string preceded by a 
specified word (let's call it Dog), unless that spot in the string 
(after Dog) is directly followed by a specific word(let's say 
Cat), in which case I want my match to occur directly after Cat, 
and not Dog.
I can easily get the spot after Dog, and I can also get it to 
ignore this spot if Dog is followed by Cat. But what I'm having 
trouble with is how to match the spot after Cat if this word does 
indeed exist in the string.
.  import re
.  my_re = re.compile(r'(dog)(cat)?') # the ? means find one or 
zero of these, in other words cat is optional.
.  m = my_re.search(This is a nice dog is it not?)
.  dir(m)
['__copy__', '__deepcopy__', 'end', 'expand', 'group', 'groupdict', 
'groups', 'span', 'start']
.  m.span()
(15, 18)
.  m = my_re.search(This is a nice dogcat is it not?)
.  m.span()
(15, 21)

If m is None then no match was found. span returns the locations in 
the string where the match occured. So in the dogcat sentence the last 
char is 21.

.  This is a nice dogcat is it not?[21:]
' is it not?'
Hope that helps.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] regular expression question

2005-03-08 Thread Sean Perry
Mike Hall wrote:
First, thanks for the response. Using your re:
my_re = re.compile(r'(dog)(cat)?')

...I seem to simply be matching the pattern Dog.  Example:
  str1 = The dog chased the car
  str2 = The dog cat parade was under way
  x1 = re.compile(r'(dog)(cat)?')
  rep1 = x1.sub(REPLACE, str1)
  rep2 = x2.sub(REPLACE, str2)
  print rep1
The REPLACE chased the car
  print rep2
The REPLACE cat parade was under way
...what I'm looking for is a match for the position in front of Cat, 
should it exist.

Because my regex says 'look for the word dog and remember where you 
found it. If you also find the word cat, remember that too'. Nowhere 
does it say watch out for whitespace.

r'(dog)\s*(cat)?' says match 'dog' followed by zero or more whitespace 
(spaces, tabs, etc.) and maybe 'cat'.

There is a wonderful O'Reilly book called Mastering Regular 
Expressions or as Danny points out the AMK howto is good.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] regular expression question

2005-03-08 Thread Danny Yoo


On Tue, 8 Mar 2005, Mike Hall wrote:

 Yes, my existing regex is using a look behind assertion:

 (?=dog)

 ...it's also checking the existence of Cat:

 (?!Cat)

 ...what I'm stuck on is how to essentially use a lookbehind on Cat,
 but only if it exists.

Hi Mike,



[Note: Please do a reply-to-all next time, so that everyone can help you.]

Regular expressions are a little evil at times; here's what I think you're
thinking of:

###
 import re
 pattern = re.compile(rdog(?!cat)
...| (?=dogcat), re.VERBOSE)
 pattern.match('dogman').start()
0
 pattern.search('dogcatcher').start()
 pattern.search('dogman').start()
0
 pattern.search('catwoman')

###

but I can't be sure without seeing some of the examples you'd like the
regular expression to match against.


Best of wishes to you!

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] regular expression question

2005-03-08 Thread Mike Hall
This will match the position in front of dog:
(?=dog)
This will match the position in front of cat:
(?=cat)
This will not match in front of dog if dog is followed by cat:
(?=dog)\b (?!cat)
Now my question is how to get this:
(?=cat)
...but ONLY if cat is following dog. If dog does not have cat  
following it, then I simply want this:

(?=dog)

...if that makes sense :) thanks.

On Mar 8, 2005, at 6:05 PM, Danny Yoo wrote:

On Tue, 8 Mar 2005, Mike Hall wrote:
I'd like to get a match for a position in a string preceded by a
specified word (let's call it Dog), unless that spot in the string
(after Dog) is directly followed by a specific word(let's say  
Cat),
in which case I want my match to occur directly after Cat, and not
Dog.
Hi Mike,
You may want to look at lookahead assertions.  These are patterns of  
the
form '(?=...)' or '(?!...).  The documentation mentions them here:

   http://www.python.org/doc/lib/re-syntax.html
and AMK's excellent Regular Expression HOWTO covers how one might use
them:
http://www.amk.ca/python/howto/regex/ 
regex.html#SECTION00054

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression re.search() object . Please help

2005-01-13 Thread Danny Yoo


On Thu, 13 Jan 2005, kumar s wrote:

 My list looks like this: List name = probe_pairs
 Name=AFFX-BioB-5_at
 Cell1=96  369 N   control AFFX-BioB-5_at
 Cell2=96  370 N   control AFFX-BioB-5_at
 Cell3=441 3   N   control AFFX-BioB-5_at
 Cell4=441 4   N   control AFFX-BioB-5_at
 Name=223473_at
 Cell1=307 87  N   control 223473_at
 Cell2=307 88  N   control 223473_at
 Cell3=367 84  N   control 223473_at

 My Script:
  name1 = '[N][a][m][e][=]'


Hi Kumar,

The regular expression above can be simplified to:

'Name='

The character-class operator that you're using, with the brackets '[]', is
useful when we want to allow different kind of characters.  Since the code
appears to be looking at a particular string, the regex can be greatly
simplified by not using character classes.



  for i in range(len(probe_pairs)):
   key = re.match(name1,probe_pairs[i])
   key


 _sre.SRE_Match object at 0x00E37A68
 _sre.SRE_Match object at 0x00E37AD8
 _sre.SRE_Match object at 0x00E37A68
 _sre.SRE_Match object at 0x00E37AD8
 _sre.SRE_Match object at 0x00E37A68
 . (cont. 10K
 lines)

 Here it prints a bunch of reg.match objects. However when I say group()
 it prints only one object why?


Is it possible that the edited code may have done something like this?

###
for i in range(len(probe_pairs)):
key = re.match(name1, probe_pairs[i])
print key
###

Without seeing what the literal code looks like, we're doomed to use our
imaginations and make up a reasonable story.  *grin*




  for i in range(len(probe_pairs)):
   key = re.match(name1,probe_pairs[i])
   key.group()


Ok, I think I see what you're trying to do.  You're using the interactive
interpreter, which tries to be nice when we use it as a calculator.  The
interactive interpreter has a special feature that prints out the result
of expressions, even though we have not explicitely put in a print
statement.


When we using a loop, like:

###
 for i in range(10):
... i, i*2, i*3
...
(0, 0, 0)
(1, 2, 3)
(2, 4, 6)
(3, 6, 9)
(4, 8, 12)
(5, 10, 15)
(6, 12, 18)
(7, 14, 21)
(8, 16, 24)
(9, 18, 27)
###

If the body of the loop contains a single expression, then Python's
interactive interpreter will try to be nice and print that expression
through each iteration.


The automatic expression-printing feature of the interactive interpreter
is only for our convenience.  If we're not running in interactive mode,
Python will not automatically print out the values of expressions!


So in a real program, it is much better to explicity write out the command
statement to 'print' the expression to screen, if that's what you want:

###
 for i in range(10):
... print (i, i*2, i*3)
...
(0, 0, 0)
(1, 2, 3)
(2, 4, 6)
(3, 6, 9)
(4, 8, 12)
(5, 10, 15)
(6, 12, 18)
(7, 14, 21)
(8, 16, 24)
(9, 18, 27)
###




 After I get the reg.match object, I tried to remove
 that match object like this:
  for i in range(len(probe_pairs)):
   key = re.match(name1,probe_pairs[i])
   del key
   print probe_pairs[i]


The match object has a separate existance from the string
'probe_pairs[i]'.  Your code does drop the 'match' object, but this has no
effect in making a string change in probe_pairs[i].

The code above, removing those two lines that play with the 'key', reduces
down back to:

###
for i in range(len(probe_pairs)):
print probe_pairs[i]
###

which is why you're not seeing any particular change in the output.

I'm not exactly sure you really need to do regular expression stuff here.
Would the following work for you?

###
for probe_pair in probe_pairs:
if not probe_pair.startswith('Name='):
print probe_pair
###






 Name=AFFX-BioB-5_at
 Cell1=96  369 N   control AFFX-BioB-5_at
 Cell2=96  370 N   control AFFX-BioB-5_at
 Cell3=441 3   N   control AFFX-BioB-5_at

 Result shows that that Name** line has not been deleted.


What do you want to see?  Do you want to see:

###
AFFX-BioB-5_at
Cell1=96369 N   control AFFX-BioB-5_at
Cell2=96370 N   control AFFX-BioB-5_at
Cell3=441   3   N   control AFFX-BioB-5_at
###


or do you want to see this instead?

###
Cell1=96369 N   control AFFX-BioB-5_at
Cell2=96370 N   control AFFX-BioB-5_at
Cell3=441   3   N   control AFFX-BioB-5_at
###


Good luck to you!

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression re.search() object . Please help

2005-01-13 Thread Liam Clarke
...as do I.

openFile=file(probe_pairs.txt,r)
probe_pairs=openFile.readlines()

openFile.close()

indexesToRemove=[]

for lineIndex in range(len(probe_pairs)):

   if probe_pairs[lineIndex].startswith(Name=):
 indexesToRemove.append(lineIndex)

for index in indexesToRemove:
  probe_pairs[index]='

Could just be

openFile=file(probe_pairs.txt,r)
probe_pairs=openFile.readlines()

openFile.close()

indexesToRemove=[]

for lineIndex in range(len(probe_pairs)):

   if probe_pairs[lineIndex].startswith(Name=):
 probe_pairs[lineIndex]=''





On Fri, 14 Jan 2005 09:38:17 +1300, Liam Clarke [EMAIL PROTECTED] wrote:
   name1 = '[N][a][m][e][=]'
   for i in range(len(probe_pairs)):
  key = re.match(name1,probe_pairs[i])
  key
 
  _sre.SRE_Match object at 0x00E37A68
  _sre.SRE_Match object at 0x00E37AD8
  _sre.SRE_Match object at 0x00E37A68
  _sre.SRE_Match object at 0x00E37AD8
  _sre.SRE_Match object at 0x00E37A68
 
 
 You are overwriting key each time you iterate. key.group() gives the
 matched characters in that object, not a group of objects!!!
 
 You want
   name1 = '[N][a][m][e][=]'
   keys=[]
   for i in range(len(probe_pairs)):
  key = re.match(name1,probe_pairs[i])
  keys.append[key]
 
  print keys
 
  'Name='
 
  1. My aim:
  To remove those Name= lines from my probe_pairs
  list
 
 Why are you deleting the object key?
 
   for i in range(len(probe_pairs)):
  key = re.match(name1,probe_pairs[i])
  del key
  print probe_pairs[i]
 
 Here's the easy way. Assuming that probe_pairs is stored in a file callde
 probe_pairs.txt
 
 openFile=file(probe_pairs.txt,r)
 probe_pairs=openFile.readlines()
 
 openFile.close()
 
 indexesToRemove=[]
 
 for lineIndex in range(len(probe_pairs)):
 
 if probe_pairs[lineIndex].startswith(Name=):
   indexesToRemove.append(lineIndex)
 
 for index in indexesToRemove:
probe_pairs[index]='
 
 Try that.
 
 Argh, my head. You do some strange things to Python.
 
 Liam Clarke
 
 On Thu, 13 Jan 2005 10:56:00 -0800 (PST), kumar s [EMAIL PROTECTED] wrote:
  Dear group:
 
  My list looks like this: List name = probe_pairs
  Name=AFFX-BioB-5_at
  Cell1=96369 N   control AFFX-BioB-5_at
  Cell2=96370 N   control AFFX-BioB-5_at
  Cell3=441   3   N   control AFFX-BioB-5_at
  Cell4=441   4   N   control AFFX-BioB-5_at
  Name=223473_at
  Cell1=307   87  N   control 223473_at
  Cell2=307   88  N   control 223473_at
  Cell3=367   84  N   control 223473_at
 
  My Script:
   name1 = '[N][a][m][e][=]'
   for i in range(len(probe_pairs)):
  key = re.match(name1,probe_pairs[i])
  key
 
  _sre.SRE_Match object at 0x00E37A68
  _sre.SRE_Match object at 0x00E37AD8
  _sre.SRE_Match object at 0x00E37A68
  _sre.SRE_Match object at 0x00E37AD8
  _sre.SRE_Match object at 0x00E37A68
  . (cont. 10K
  lines)
 
  Here it prints a bunch of reg.match objects. However
  when I say group() it prints only one object why?
 
  Alternatively:
   for i in range(len(probe_pairs)):
  key = re.match(name1,probe_pairs[i])
  key.group()
 
  'Name='
 
  1. My aim:
  To remove those Name= lines from my probe_pairs
  list
 
  with name1 as the pattern, I asked using re.match()
  method to identify the lines and then remove by using
  re.sub(pat,'',string) method.  I want to substitute
  Name=*** line by an empty string.
 
  After I get the reg.match object, I tried to remove
  that match object like this:
   for i in range(len(probe_pairs)):
  key = re.match(name1,probe_pairs[i])
  del key
  print probe_pairs[i]
 
  Name=AFFX-BioB-5_at
  Cell1=96369 N   control AFFX-BioB-5_at
  Cell2=96370 N   control AFFX-BioB-5_at
  Cell3=441   3   N   control AFFX-BioB-5_at
 
  Result shows that that Name** line has not been
  deleted.
 
  Is the way I am doing a good one. Could you please
  suggest a good simple method.
 
  Thanks in advance
  K
 
 
  __
  Do you Yahoo!?
  Yahoo! Mail - Easier than ever with enhanced search. Learn more.
  http://info.mail.yahoo.com/mail_250
  ___
  Tutor maillist  -  Tutor@python.org
  http://mail.python.org/mailman/listinfo/tutor
 
 
 
 --
 'There is only one basic human right, and that is to do as you damn well 
 please.
 And with it comes the only basic human duty, to take the consequences.
 


-- 
'There is only one basic human right, and that is to do as you damn well please.
And with it comes the only basic human duty, to take the consequences.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression re.search() object . Please help

2005-01-13 Thread Jeff Shannon
Liam Clarke wrote:
openFile=file(probe_pairs.txt,r)
probe_pairs=openFile.readlines()
openFile.close()
indexesToRemove=[]
for lineIndex in range(len(probe_pairs)):
   if probe_pairs[lineIndex].startswith(Name=):
 probe_pairs[lineIndex]=''
If the intent is simply to remove all lines that begin with Name=, 
and setting those lines to an empty string is just shorthand for that, 
it'd make more sense to do this with a filtering list comprehension:

openfile = open(probe_pairs.txt,r)
probe_pairs = openfile.readlines()
openfile.close()
probe_pairs = [line for line in probe_pairs \
  if not line.startswith('Name=')]
(The '\' line continuation isn't strictly necessary, because the open 
list-comp will do the same thing, but I'm including it for 
readability's sake.)

If one wants to avoid list comprehensions, you could instead do:
openfile = open(probe_pairs.txt,r)
probe_pairs = []
for line in openfile.readlines():
if not line.startswith('Name='):
probe_pairs.append(line)
openfile.close()
Either way, lines that start with 'Name=' get thrown away, and all 
other lines get kept.

Jeff Shannon
Technician/Programmer
Credit International
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression re.search() object . Please help

2005-01-13 Thread jfouhy
Quoting Jeff Shannon [EMAIL PROTECTED]:

 If the intent is simply to remove all lines that begin with Name=, 
 and setting those lines to an empty string is just shorthand for that, 
 it'd make more sense to do this with a filtering list comprehension:
[...]
 If one wants to avoid list comprehensions, you could instead do:
[...]

Or, since we're filtering, we could use the filter() function!

probe_pairs = filter(lambda x: not x.startswith('Name='), probe_pairs)

(hmm, I wonder which is the faster option...)

-- 
John.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression re.search() object . Please help

2005-01-13 Thread Jörg Wölke
 Quoting Jeff Shannon [EMAIL PROTECTED]:

[ snip ]

 (hmm, I wonder which is the faster option...)

Propably grep -v ^Name= filename
 
 -- 
 John.

0.2 EUR, Jo!


-- 
Wir sind jetzt ein Imperium und wir schaffen uns
unsere eigene Realität. Wir sind die Akteure der 
Geschichte, und Ihnen, Ihnen allen bleibt nichts,
als die Realität zu studieren, die wir geschaffen haben.
-- Karl Rove zu Ron Suskind (NYT)

+++ GMX - die erste Adresse für Mail, Message, More +++
1 GB Mailbox bereits in GMX FreeMail http://www.gmx.net/de/go/mail
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression re.search() object . Please help

2005-01-13 Thread kumar s
Hello group:
thank you for the suggestions. It worked for me using 

if not line.startswith('Name='): expression. 


I have been practising regular expression problems. I
tumle over one simple thing always. After obtaining
either a search object or a match object, I am unable
to apply certain methods on these objects to get
stuff. 

I have looked into many books including my favs(
Larning python and Alan Gaulds Learn to program using
python) I did not find the basic question, how can I
get what I intend to do with returned reg.ex match
object (search(), match()).

For example:

I have a simple list like the following:

 seq
['probe:HG-U133B:20_s_at:164:623;
Interrogation_Position=6649; Antisense;',
'TCATGGCTGACAACCCATCTTGGGA']


Now I intend to extract particular pattern and write
to another list say: desired[]

What I want to extract:
I want to extract 164:623:
Which always comes after _at: and ends with ;
2. The second pattern/number I want to extract is
6649:
This always comes after position=.

How I want to put to desired[]:

 desired
['164:623|6649', 'TCATGGCTGACAACCCATCTTGGGA']

I write a pattern:


pat = '[0-9]*[:][0-9]*'
pat1 = '[_Position][=][0-9]*'

 for line in seq:
pat = '[0-9]*[:][0-9]*'
pat1 = '[_Position][=][0-9]*'
print (re.search(pat,line) and re.search(pat1,line))


_sre.SRE_Match object at 0x163CAF00
None


Now I know that I have a hit in the seq list evident
by  _sre.SRE_Match object at 0x163CAF00.


Here is the black box:

What kind of operations can I do on this to get those
two matches: 
164:623 and 6649. 


I read 
http://www.python.org/doc/2.2.3/lib/re-objects.html


This did not help me to progress further. May I
request tutors to give a small note explaining things.
In Alan Gauld's book, most of the explanation stopped
at 
_sre.SRE_Match object at 0x163CAF00 this level.
After that there is no example where he did some
operations on these objects.  If I am wrong, I might
have skipped/missed to read it. Aplogies for that. 

Thank you very much in advance. 

K









--- Liam Clarke [EMAIL PROTECTED] wrote:

 ...as do I.
 
 openFile=file(probe_pairs.txt,r)
 probe_pairs=openFile.readlines()
 
 openFile.close()
 
 indexesToRemove=[]
 
 for lineIndex in range(len(probe_pairs)):
 
if
 probe_pairs[lineIndex].startswith(Name=):
 
 indexesToRemove.append(lineIndex)
 
 for index in indexesToRemove:
   probe_pairs[index]='
 
 Could just be
 
 openFile=file(probe_pairs.txt,r)
 probe_pairs=openFile.readlines()
 
 openFile.close()
 
 indexesToRemove=[]
 
 for lineIndex in range(len(probe_pairs)):
 
if
 probe_pairs[lineIndex].startswith(Name=):
  probe_pairs[lineIndex]=''
 
 
 
 
 
 On Fri, 14 Jan 2005 09:38:17 +1300, Liam Clarke
 [EMAIL PROTECTED] wrote:
name1 = '[N][a][m][e][=]'
for i in range(len(probe_pairs)):
   key = re.match(name1,probe_pairs[i])
   key
  
   _sre.SRE_Match object at 0x00E37A68
   _sre.SRE_Match object at 0x00E37AD8
   _sre.SRE_Match object at 0x00E37A68
   _sre.SRE_Match object at 0x00E37AD8
   _sre.SRE_Match object at 0x00E37A68
  
  
  You are overwriting key each time you iterate.
 key.group() gives the
  matched characters in that object, not a group of
 objects!!!
  
  You want
name1 = '[N][a][m][e][=]'
keys=[]
for i in range(len(probe_pairs)):
   key = re.match(name1,probe_pairs[i])
   keys.append[key]
  
   print keys
  
   'Name='
  
   1. My aim:
   To remove those Name= lines from my
 probe_pairs
   list
  
  Why are you deleting the object key?
  
for i in range(len(probe_pairs)):
   key = re.match(name1,probe_pairs[i])
   del key
   print probe_pairs[i]
  
  Here's the easy way. Assuming that probe_pairs is
 stored in a file callde
  probe_pairs.txt
  
  openFile=file(probe_pairs.txt,r)
  probe_pairs=openFile.readlines()
  
  openFile.close()
  
  indexesToRemove=[]
  
  for lineIndex in range(len(probe_pairs)):
  
  if
 probe_pairs[lineIndex].startswith(Name=):
   
 indexesToRemove.append(lineIndex)
  
  for index in indexesToRemove:
 probe_pairs[index]='
  
  Try that.
  
  Argh, my head. You do some strange things to
 Python.
  
  Liam Clarke
  
  On Thu, 13 Jan 2005 10:56:00 -0800 (PST), kumar s
 [EMAIL PROTECTED] wrote:
   Dear group:
  
   My list looks like this: List name = probe_pairs
   Name=AFFX-BioB-5_at
   Cell1=96369 N   control
 AFFX-BioB-5_at
   Cell2=96370 N   control
 AFFX-BioB-5_at
   Cell3=441   3   N   control
 AFFX-BioB-5_at
   Cell4=441   4   N   control
 AFFX-BioB-5_at
   Name=223473_at
   Cell1=307   87  N   control
 223473_at
   Cell2=307   88  N   control
 223473_at
   Cell3=367   84  N   control
 223473_at
  
   My Script:
name1 = '[N][a][m][e][=]'
for i in range(len(probe_pairs)):
   key = 

Re: [Tutor] Regular expression re.search() object . Please help

2005-01-13 Thread Alan Gauld
 My list looks like this: List name = probe_pairs
 Name=AFFX-BioB-5_at
 Cell1=96 369 N control AFFX-BioB-5_at
 Cell2=96 370 N control AFFX-BioB-5_at
 Cell3=441 3 N control AFFX-BioB-5_at
 Cell4=441 4 N control AFFX-BioB-5_at
 ...
 My Script:
  name1 = '[N][a][m][e][=]'

Why not just: 'Name=' - the result is the same.

  for i in range(len(probe_pairs)):

andwhy not just
   for line in probe_pairs:
  key = re.match(name1,line)

Although I suspect it would be easier still to use

line.startswith('Name=')

especially combined with the fileinput module.
It is really quite good for line by line
matching/processing of files, and I assume this
data comes from a file originally?.

 key = re.match(name1,probe_pairs[i])
 key
 _sre.SRE_Match object at 0x00E37A68

One per line that matches.

 when I say group() it prints only one object why?

Because the group is the string you are looking for.
But I may be missing something since there is no
indentation in the post, its hard to tell whats
inside and whats outside the loop.

 1. My aim:
 To remove those Name= lines from my probe_pairs
 list

  for i in range(len(probe_pairs)):
 key = re.match(name1,probe_pairs[i])
 del key

That will remove the match object, not the line from
the list!

To filter the list I'd have thought you'd be better using
a list comprehension:

filtered = [line for line in probe_pairs if not
line.startswith('Name=')]

HTH,

Alan G.

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression re.search() object . Please help

2005-01-13 Thread Alan Gauld
 I have looked into many books including my favs(
 Larning python and Alan Gaulds Learn to program using

Yes this is pushing regex a bit further than I show in my book.

 What I want to extract:
 I want to extract 164:623:
 Which always comes after _at: and ends with ;

You should be able to use the group() method to extract 
the matching string out of the match object.

 2. The second pattern/number I want to extract is
 6649:
 This always comes after position=.
 
 How I want to put to desired[]:
 
  desired
 ['164:623|6649', 'TCATGGCTGACAACCCATCTTGGGA']
 
 I write a pattern:
 
 
 pat = '[0-9]*[:][0-9]*'
 pat1 = '[_Position][=][0-9]*'
 
  for line in seq:
 pat = '[0-9]*[:][0-9]*'
 pat1 = '[_Position][=][0-9]*'

pat1 = [_Position] will match any *one* of the characters 
in _Position, is that really what you want?

I suspect its:

'_Position=[0-9]*'

Which is the fixed string followed by any number(including zero) 
of digits.

 print (re.search(pat,line) and re.search(pat1,line))

This is asking print to print the boolean value of your expression
which if the first search fails will be that failure and if 
it succeeeds will be the result of the second search. Check 
the section on Functional Programming in my tutor to see why.

 _sre.SRE_Match object at 0x163CAF00
 None

Looks like your searches worked but of course you don't have 
the match objects stored so you can't use them for anything.
But I'm suspicious of where the None is coming from...
Again without any indentation showing its not totally clear 
what your code looks like.

 What kind of operations can I do on this to get those
 two matches: 
 164:623 and 6649. 

I think that if you keep the match objects you can use group() 
to extract the thing that matched which should be close to 
what you want.

 In Alan Gauld's book, most of the explanation stopped
 at 
 _sre.SRE_Match object at 0x163CAF00 this level.

Yep, personally I use regex to find the line then extract 
the data I need from the line manually. Messing around with 
match objects is something I try to avoid and so left it 
out of the tutor.

:-)

Alan G.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Regular expression re.search() object . Please help

2005-01-13 Thread Jacob S.
I assume that both you and Liam are using previous-er versions of python?
Now files are iterators by line and you can do this.

openFile = open(probe_pairs.txt,r)
indexesToRemove = []
for line in openFile:
if line.startswith(Name=):
line = ''   ## Ooops, this won't work because it just changes what
line references or points to or whatever, it doesn't
##actually change the object
openFile.close()

I think this is a great flaw with using for element in list instead of for
index in range(len(list))

Of course you later put IMHO better solutions (list comprehensions,etc.) -- 
I just wanted to point out that files have been
iterators for a few versions now and that is being used more and more. It
also saves the memory problem of using big files
and reading them all at once with readlines.

Jacob

 Liam Clarke wrote:

  openFile=file(probe_pairs.txt,r)
  probe_pairs=openFile.readlines()
 
  openFile.close()
 
  indexesToRemove=[]
 
  for lineIndex in range(len(probe_pairs)):
 
 if probe_pairs[lineIndex].startswith(Name=):
   probe_pairs[lineIndex]=''

 If the intent is simply to remove all lines that begin with Name=,
 and setting those lines to an empty string is just shorthand for that,
 it'd make more sense to do this with a filtering list comprehension:

  openfile = open(probe_pairs.txt,r)
  probe_pairs = openfile.readlines()
  openfile.close()

  probe_pairs = [line for line in probe_pairs \
if not line.startswith('Name=')]


 (The '\' line continuation isn't strictly necessary, because the open
 list-comp will do the same thing, but I'm including it for
 readability's sake.)

 If one wants to avoid list comprehensions, you could instead do:

  openfile = open(probe_pairs.txt,r)
  probe_pairs = []

  for line in openfile.readlines():
  if not line.startswith('Name='):
  probe_pairs.append(line)

  openfile.close()

 Either way, lines that start with 'Name=' get thrown away, and all
 other lines get kept.

 Jeff Shannon
 Technician/Programmer
 Credit International


 ___
 Tutor maillist  -  Tutor@python.org
 http://mail.python.org/mailman/listinfo/tutor


___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor