Re: Regular expression to structure HTML
504cr...@gmail.com wrote: > No -- sorry -- I don't want to use BeautifulSoup (though I have for > other projects). Humor me, please -- I'd really like to see if this > can be done with just regular expressions. I think the reason why people are giving funny comments here is that you failed to provide a reason for the above requirement. That makes it sound like a typical "How can I use X to do Y?" question. http://www.catb.org/~esr/faqs/smart-questions.html#id383188 Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular expression to structure HTML
On Thu, 01 Oct 2009 22:10:55 -0700, 504cr...@gmail.com wrote: > I'm kind of new to regular expressions The most important thing to learn about regular expressions is to learn what they can do, what they can't do, and what they can do in theory but can't do in practice (usually because of exponential or combinatorial growth). One thing they can't do is to match any kind of construct which has arbitrary nesting. E.g. you can't match any class of HTML element which can self-nest or whose children can self-nest. In practice, this means you can only match a handful of elements which are either empty (e.g. ) or which can only contain CDATA (e.g.
Re: Regular expression to structure HTML
On Oct 2, 11:14 pm, greg wrote: > Brian D wrote: > > This isn't merely a question of knowing when to use the right > > tool. It's a question about how to become a better developer using > > regular expressions. > > It could be said that if you want to learn how to use a > hammer, it's better to practise on nails rather than > screws. > > -- > Greg It could be said that the bandwidth in technical forums should be reserved for on-topic exchanges, not flaming intelligent people who might have something to contribute to the forum. The truth is, I found a solution where others were ostensibly either too lazy to attempt, or too eager grandstanding their superiority to assist. Who knows -- maybe I'll provide an alternative to BeautifulSoup one day. -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular expression to structure HTML
Brian D wrote: This isn't merely a question of knowing when to use the right tool. It's a question about how to become a better developer using regular expressions. It could be said that if you want to learn how to use a hammer, it's better to practise on nails rather than screws. -- Greg -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular expression to structure HTML
Screw: >>> html = """ 14313 Python Hammer Institute #2 Jefferson 70114 8583 New Screwdriver Technical Academy, Inc #4 Jefferson 70114 9371 Career RegEx Center Jefferson 70113 """ Hammer: First remove line returns. Then remove extra spaces. Then insert a line return to restore logical rows on each combination. For more information, see: http://www.qc4blog.com/?p=55 >>> s = re.sub(r'\n','', html) >>> s = re.sub(r'\s{2,}', '', s) >>> s = re.sub('()()', r'\1\n\2', s) >>> print s 14313Python Hammer Institute #2Jefferson70114 8583New Screwdriver Technical Academy, Inc #4Jefferson70114 9371Career RegEx CenterJefferson70113 >>> p = re.compile(r"()(?P\d+)()(>> valign=top>)(>> href=lic_details\.asp)(\?lic_number=\d+)(>)(?P[\s\S\WA-Za-z0-9]*?)()()(?:>> valign=top>)(?P[\s\WA-Za-z]+)()(>> valign=top>)(?P\d+)()()$", re.M) >>> n = >>> p.sub(r'LICENSE:\g|NAME:\g|PARISH:\g|ZIP:\g', >>> s) >>> print n LICENSE:14313|NAME:Python Hammer Institute #2|PARISH:Jefferson|ZIP: 70114 LICENSE:8583|NAME:New Screwdriver Technical Academy, Inc #4| PARISH:Jefferson|ZIP:70114 LICENSE:9371|NAME:Career RegEx Center|PARISH:Jefferson|ZIP:70113 >>> The solution was to escape the period in the ".asp" string, e.g., "\.asp". I also had to limit the pattern in the grouping by using a "?" qualifier to limit the "greediness" of the "*" pattern metacharacter. Now, who would like to turn that re.compile pattern into a MULTILINE expression, combining the re.M and re.X flags? Documentation says that one should be able to use the bitwise OR operator (e.g., re.M | re.X), but I sure couldn't get it to work. Sometimes a hammer actually is the right tool if you hit the screw long and hard enough. I think I'll try to hit some more screws with my new hammer. Good day. On Oct 2, 12:10 am, "504cr...@gmail.com" <504cr...@gmail.com> wrote: > I'm kind of new to regular expressions, and I've spent hours trying to > finesse a regular expression to build a substitution. > > What I'd like to do is extract data elements from HTML and structure > them so that they can more readily be imported into a database. > > No -- sorry -- I don't want to use BeautifulSoup (though I have for > other projects). Humor me, please -- I'd really like to see if this > can be done with just regular expressions. > > Note that the output is referenced using named groups. > > My challenge is successfully matching the HTML tags in between the > first table row, and the second table row. > > I'd appreciate any suggestions to improve the approach. > > rText = "8583 href=lic_details.asp?lic_number=8583>New Horizon Technical Academy, > Inc #4Jefferson70114 tr>9371 lic_number=9371>Career Learning Center valign=top>Jefferson70113" > > rText = re.compile(r'()(?P\d+)()( valign=top>)()(?P[A- > Za-z0-9#\s\S\W]+)().+$').sub(r'LICENSE:\g|NAME: > \g\n', rText) > > print rText > > LICENSE:8583|NAME:New Horizon Technical Academy, Inc #4 valign=top>Jefferson70114 valign=top>9371 lic_number=9371>Career Learning Center|PARISH:Jefferson|ZIP:70113 -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular expression to structure HTML
The other thought I had was that I may not be properly trapping the end of the first row, and the beginning of the next row. On Oct 2, 8:38 am, John wrote: > On Oct 2, 1:10 am, "504cr...@gmail.com" <504cr...@gmail.com> wrote: > > > > > I'm kind of new to regular expressions, and I've spent hours trying to > > finesse a regular expression to build a substitution. > > > What I'd like to do is extract data elements from HTML and structure > > them so that they can more readily be imported into a database. > > > No -- sorry -- I don't want to use BeautifulSoup (though I have for > > other projects). Humor me, please -- I'd really like to see if this > > can be done with just regular expressions. > > > Note that the output is referenced using named groups. > > > My challenge is successfully matching the HTML tags in between the > > first table row, and the second table row. > > > I'd appreciate any suggestions to improve the approach. > > > rText = "8583 > href=lic_details.asp?lic_number=8583>New Horizon Technical Academy, > > Inc #4Jefferson70114 > tr>9371 > lic_number=9371>Career Learning Center > valign=top>Jefferson70113" > > > rText = re.compile(r'()(?P\d+)()( > valign=top>)()(?P[A- > > Za-z0-9#\s\S\W]+)().+$').sub(r'LICENSE:\g|NAME: > > \g\n', rText) > > > print rText > > > LICENSE:8583|NAME:New Horizon Technical Academy, Inc #4 > valign=top>Jefferson70114 > valign=top>9371 > lic_number=9371>Career Learning Center|PARISH:Jefferson|ZIP:70113 > > Some suggestions to start off with: > > * triple-quote your multiline strings > * consider using the re.X, re.M, and re.S options for re.compile() > * save your re object after you compile it > * note that re.sub() returns a new string > > Also, it sounds like you want to replace the first 2 elements for > each element with their content separated by a pipe (throwing > away the tags themselves), correct? > > ---John -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular expression to structure HTML
Yes, John, that's correct. I'm trying to trap and discard the row elements, re-formatting with pipes so that I can more readily import the data into a database. The tags are, of course, initially useful for pattern discovery. But there are other approaches -- I could just replace the tags and capture the data as an array. I'm well aware of the problems using regular expressions for html parsing. This isn't merely a question of knowing when to use the right tool. It's a question about how to become a better developer using regular expressions. I'm trying to figure out where the regular expression fails. The structure of the page I'm scraping is uniform in the production of tags -- it's an old ASP page that pulls data from a database. What's different in the first row is the appearance of a comma, a # pound symbol, and a number (", Inc #4"). I'm making the assumption that's what's throwing off the remainder of the regular expression -- because (despite the snark by others above) the expression is working for every other data row. But I could be wrong. Of course, if I could identify the problem, I wouldn't be asking. That's why I posted the question for other eyes to review. I discovered that I may actually be properly parsing the data from the tags when I tried this test in a Python interpreter: >>> s = "New Horizon Technical Academy, Inc #4" >>> p = re.compile(r'([\s\S\WA-Za-z0-9]*)()') >>> m = p.match(s) >>> m = p.match(s) >>> m.group(0) "New Horizon Technical Academy, Inc #4" >>> m.group(1) "New Horizon Technical Academy, Inc #4" >>> m.group(2) '' I found it curious that I was capturing the groups as sequences, but I didn't understand how to use this knowledge in named groups -- or maybe I am merely mis-identifying the source of the regular expression problem. It's a puzzle. I'm hoping someone will want to share the wisdom of their experience, not criticize for the attempt to learn. Maybe one shouldn't learn how to use a hammer on a screw, but I wouldn't say that I have never hammered a screw into a piece of wood just because I only had a hammer. Thanks, Brian On Oct 2, 8:38 am, John wrote: > On Oct 2, 1:10 am, "504cr...@gmail.com" <504cr...@gmail.com> wrote: > > > > > I'm kind of new to regular expressions, and I've spent hours trying to > > finesse a regular expression to build a substitution. > > > What I'd like to do is extract data elements from HTML and structure > > them so that they can more readily be imported into a database. > > > No -- sorry -- I don't want to use BeautifulSoup (though I have for > > other projects). Humor me, please -- I'd really like to see if this > > can be done with just regular expressions. > > > Note that the output is referenced using named groups. > > > My challenge is successfully matching the HTML tags in between the > > first table row, and the second table row. > > > I'd appreciate any suggestions to improve the approach. > > > rText = "8583 > href=lic_details.asp?lic_number=8583>New Horizon Technical Academy, > > Inc #4Jefferson70114 > tr>9371 > lic_number=9371>Career Learning Center > valign=top>Jefferson70113" > > > rText = re.compile(r'()(?P\d+)()( > valign=top>)()(?P[A- > > Za-z0-9#\s\S\W]+)().+$').sub(r'LICENSE:\g|NAME: > > \g\n', rText) > > > print rText > > > LICENSE:8583|NAME:New Horizon Technical Academy, Inc #4 > valign=top>Jefferson70114 > valign=top>9371 > lic_number=9371>Career Learning Center|PARISH:Jefferson|ZIP:70113 > > Some suggestions to start off with: > > * triple-quote your multiline strings > * consider using the re.X, re.M, and re.S options for re.compile() > * save your re object after you compile it > * note that re.sub() returns a new string > > Also, it sounds like you want to replace the first 2 elements for > each element with their content separated by a pipe (throwing > away the tags themselves), correct? > > ---John -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular expression to structure HTML
On Oct 2, 1:10 am, "504cr...@gmail.com" <504cr...@gmail.com> wrote: > I'm kind of new to regular expressions, and I've spent hours trying to > finesse a regular expression to build a substitution. > > What I'd like to do is extract data elements from HTML and structure > them so that they can more readily be imported into a database. > > No -- sorry -- I don't want to use BeautifulSoup (though I have for > other projects). Humor me, please -- I'd really like to see if this > can be done with just regular expressions. > > Note that the output is referenced using named groups. > > My challenge is successfully matching the HTML tags in between the > first table row, and the second table row. > > I'd appreciate any suggestions to improve the approach. > > rText = "8583 href=lic_details.asp?lic_number=8583>New Horizon Technical Academy, > Inc #4Jefferson70114 tr>9371 lic_number=9371>Career Learning Center valign=top>Jefferson70113" > > rText = re.compile(r'()(?P\d+)()( valign=top>)()(?P[A- > Za-z0-9#\s\S\W]+)().+$').sub(r'LICENSE:\g|NAME: > \g\n', rText) > > print rText > > LICENSE:8583|NAME:New Horizon Technical Academy, Inc #4 valign=top>Jefferson70114 valign=top>9371 lic_number=9371>Career Learning Center|PARISH:Jefferson|ZIP:70113 Some suggestions to start off with: * triple-quote your multiline strings * consider using the re.X, re.M, and re.S options for re.compile() * save your re object after you compile it * note that re.sub() returns a new string Also, it sounds like you want to replace the first 2 elements for each element with their content separated by a pipe (throwing away the tags themselves), correct? ---John -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular expression to structure HTML
Paul McGuire wrote: > On Oct 2, 12:10 am, "504cr...@gmail.com" <504cr...@gmail.com> wrote: >> I'm kind of new to regular expressions, and I've spent hours trying to >> finesse a regular expression to build a substitution. >> >> What I'd like to do is extract data elements from HTML and structure >> them so that they can more readily be imported into a database. > > Oy! If I had a nickel for every misguided coder who tried to scrape > HTML with regexes... > > Some reasons why RE's are no good at parsing HTML: > - tags can be mixed case > - tags can have whitespace in many unexpected places > - tags with no body can combine opening and closing tag with a '/' > before the closing '>', as in "" > - tags can have attributes that you did not expect (like " CLEAR=ALL>") > - attributes can occur in any order within the tag > - attribute names can also be in unexpected upper/lower case > - attribute values can be enclosed in double quotes, single quotes, or > even (surprise!) NO quotes BTW, BeautifulSoup's parser also uses regexes, so if the OP used it, he/she could claim to have solved the problem "with regular expressions" without even lying. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular expression to structure HTML
On Oct 2, 12:10 am, "504cr...@gmail.com" <504cr...@gmail.com> wrote: > I'm kind of new to regular expressions, and I've spent hours trying to > finesse a regular expression to build a substitution. > > What I'd like to do is extract data elements from HTML and structure > them so that they can more readily be imported into a database. Oy! If I had a nickel for every misguided coder who tried to scrape HTML with regexes... Some reasons why RE's are no good at parsing HTML: - tags can be mixed case - tags can have whitespace in many unexpected places - tags with no body can combine opening and closing tag with a '/' before the closing '>', as in "" - tags can have attributes that you did not expect (like "") - attributes can occur in any order within the tag - attribute names can also be in unexpected upper/lower case - attribute values can be enclosed in double quotes, single quotes, or even (surprise!) NO quotes For HTML that is machine-generated, you *may* be able to make some page-specific assumptions. But if edited by human hands, or if you are trying to make a generic page scraper, RE's will never cut it. -- Paul -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular expression to structure HTML
504cr...@gmail.com a écrit : I'm kind of new to regular expressions, and I've spent hours trying to finesse a regular expression to build a substitution. What I'd like to do is extract data elements from HTML and structure them so that they can more readily be imported into a database. No -- sorry -- I don't want to use BeautifulSoup (though I have for other projects). Humor me, please -- I'd really like to see if this can be done with just regular expressions. I'm kind of new to hammers, and I've spent hours trying to find out how to drive a screw with a hammer. No -- sorry -- I don't want to use a screwdriver. -- http://mail.python.org/mailman/listinfo/python-list