Giampaolo Tomassoni wrote: > > From: LuKreme [mailto:krem...@kreme.com] > > > > On 25-Mar-2009, at 11:24, Giampaolo Tomassoni wrote: > > > rawbody LARGETABLE > > > m'<tr\W(?:[^<]|<(?!t[dr]\W))*(?:<td\W(?:[^<]|<(?!t[rd]\W))*){30,}</ > > > tr'is > > > > > > Just to be sure my parsing is working correctly, that is flagging if > > there are 30 or more TDs in a single TR? > > Right. > > > > If so, couldn't that be > > written a lot more compactly? > > Probably yes. The problem is that a simple way like > '<tr\W.*(?:<td\W.*){30,}</tr' would easily fail because the '*' > operator would work "greedly" here, consuming <td>s and <tr>s which > should instead be counted.
Then why not use the non-greedy version? <tr\W.*?(?:<td\W.*?){30,}</tr On the other hand, '.*' of any kind is usually a bad idea in a SpamAssassin rule. It should always be limited to avoid excessive backtracking. <tr\W.{,20}?(?:<td\W.{,20}?){30,}</tr I pulled the 20 character limit out of thin air. Change it to whatever makes sense for this rule. -- Bowie