Re: More text/plain questions

Karsten Bräckelmann Wed, 02 Jul 2014 16:17:30 -0700

On Wed, 2014-07-02 at 14:44 -0600, Philip Prindeville wrote:
> Okay, was tinkering with the code below but the zero-width lookahead is
> not disqualifying ampersand followed by #x[0-9A-F]{4}; so the output
> is bogus (you can run this and see what I mean).
> 
> What am I doing wrong?


You are using an overly complex and fugly test case. ;)  Seriously, a
stripped down test string does not require more than about 4 instances
of plain chars and HTML entities. Much easier on the eye.


>     my @matches = m/[\001-\045\047-\177]|&(?!#x[0-9A-F]{4};)/g;

That RE is a single, straight-forward alternation with two alternatives.

The first one translates to a single char in a given, specific range.
Basically, anything but the ampersand. The second alternative is an
ampersand, that is not followed by #xDDDD.

The (?!pattern) is a zero-width negative look-ahead assertion. A zero
width means, it does not consume what it matches. Thus, the second
alternation ultimately will match a single ampersand only. The /g global
matching then continues where it left of after the last matching
attempt. In the case of that ampersand followed by #xDDDD, that still is
right after the ampersand.

  line: Th&#x0435; R
  matches: T,h,#,x,0,4,3,5,;, ,R

The offending ampersand part of the HTML entity encoding correctly is
not matched. The following chars do match the "anything but an
ampersand" first alternative.


I am unsure what you are trying to achieve. If you want to compare the
number of HTML entities with the number of regular chars, wouldn't it be
easier to simply drop them flat?

  $data =~ s/&#x[0-9A-F]{4};//g;

Or just plain match and count?

  @matches = $data =~ /&#x[0-9A-F]{4};/g;


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Re: More text/plain questions

Reply via email to