On Wed, 2014-07-02 at 14:44 -0600, Philip Prindeville wrote: > Okay, was tinkering with the code below but the zero-width lookahead is > not disqualifying ampersand followed by #x[0-9A-F]{4}; so the output > is bogus (you can run this and see what I mean). > > What am I doing wrong?
You are using an overly complex and fugly test case. ;) Seriously, a stripped down test string does not require more than about 4 instances of plain chars and HTML entities. Much easier on the eye. > my @matches = m/[\001-\045\047-\177]|&(?!#x[0-9A-F]{4};)/g; That RE is a single, straight-forward alternation with two alternatives. The first one translates to a single char in a given, specific range. Basically, anything but the ampersand. The second alternative is an ampersand, that is not followed by #xDDDD. The (?!pattern) is a zero-width negative look-ahead assertion. A zero width means, it does not consume what it matches. Thus, the second alternation ultimately will match a single ampersand only. The /g global matching then continues where it left of after the last matching attempt. In the case of that ampersand followed by #xDDDD, that still is right after the ampersand. line: Thе R matches: T,h,#,x,0,4,3,5,;, ,R The offending ampersand part of the HTML entity encoding correctly is not matched. The following chars do match the "anything but an ampersand" first alternative. I am unsure what you are trying to achieve. If you want to compare the number of HTML entities with the number of regular chars, wouldn't it be easier to simply drop them flat? $data =~ s/&#x[0-9A-F]{4};//g; Or just plain match and count? @matches = $data =~ /&#x[0-9A-F]{4};/g; -- char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1: (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}