Re: HTML Tags and muliline regular expresions.

David Bovill Wed, 09 Aug 2006 11:49:54 -0700

OK - here is what I have got so far.

First I gave up on the multiline thing... for now I just replaced all
lineFeeeds with empty - still would like to know how to do this longer term.
This is my function:


function html_ExtractTagContents tagName, someHtml

    -- get the first one only
    -- using white space char "\s*" all over the place

    local tagContents -- not sure if it is still required

    put "<\s*" & tagName & "\s+name=[^>]*>(.*)<\s*/\s*" & tagName & "\s*>"
into someReg
    -- put "(?m)" before someReg -- does not seem to have an effect
    replace lineFeed with empty in someHtml -- seems neessary

    if matchText(someHtml, someReg, tagContents) is false then
        return empty
    else
        return tagContents
    end if
end html_ExtractTagContents


Any improvements - especially how to do the multiline thing properly?



For reference the following extracts were taken from the prce manText at:
http://www.pcre.org/man.txt

Some RegExp Info

There are two different sets of metacharacters: those that are recog-

nized anywhere in the pattern except within square brackets, and those
that are recognized in square brackets. Outside square brackets, the
metacharacters are as follows:

\ general escape character with several uses
^ assert start of string (or line, in multiline mode)
$ assert end of string (or line, in multiline mode)
. match any character except newline (by default)
[ start character class definition
| start of alternative branch
( start subpattern
) end subpattern
? extends the meaning of (
also 0 or 1 quantifier
also quantifier minimizer
* 0 or more quantifier
+ 1 or more quantifier
also "possessive quantifier"
{ start min/max quantifier

Part of a pattern that is in square brackets is called a "character
class". In a character class the only metacharacters are:

\ general escape character
^ negate the class, but only if the first character
- indicates character range
[ POSIX character class (only if followed by POSIX
syntax)
] terminates the character class



Non-printing characters

\d any decimal digit
\D any character that is not a decimal digit
\s any whitespace character
\S any character that is not a whitespace character
\w any "word" character
\W any "non-word" character

Non-printing characters

\a alarm, that is, the BEL character (hex 07)
\cx "control-x", where x is any character
\e escape (hex 1B)
\f formfeed (hex 0C)
\n newline (hex 0A)
\r carriage return (hex 0D)
\t tab (hex 09)
\ddd character with octal code ddd, or backreference
\xhh character with hex code hh
\x{hhh..} character with hex code hhh..

The backslashed assertions are:

\b matches at a word boundary
\B matches when not at a word boundary
\A matches at start of subject
\Z matches at end of subject or before newline at end
\z matches at end of subject
\G matches at first matching position in subject

INTERNAL OPTION SETTING

The  settings  of  the  PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
PCRE_EXTENDED options can be changed  from  within  the  pattern  by  a
sequence  of  Perl  option  letters  enclosed between "(?" and ")". The
option letters are

i for PCRE_CASELESS

m for PCRE_MULTILINE
s for PCRE_DOTALL
x for PCRE_EXTENDED


For example, (?im) sets caseless, multiline matching. It is also possi-
ble to unset these options by preceding the letter with a hyphen, and a
combined setting and unsetting such as (?im-sx), which sets PCRE_CASE-
LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
is also permitted. If a letter appears both before and after the
hyphen, the option is unset.
_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Re: HTML Tags and muliline regular expresions.

Reply via email to