On Wed, 2014-07-02 at 19:10 -0600, Philip Prindeville wrote:
> On Jul 2, 2014, at 5:16 PM, Karsten Bräckelmann <guent...@rudersport.de> 
> wrote:

> > That RE is a single, straight-forward alternation with two alternatives.
> > 
> > The first one translates to a single char in a given, specific range.
> > Basically, anything but the ampersand. The second alternative is an
> > ampersand, that is not followed by #xDDDD.
> > 
> > The (?!pattern) is a zero-width negative look-ahead assertion. A zero
> > width means, it does not consume what it matches. Thus, the second
> > alternation ultimately will match a single ampersand only. The /g global
> > matching then continues where it left of after the last matching
> > attempt. In the case of that ampersand followed by #xDDDD, that still is
> > right after the ampersand.

> Okay, so what I was trying to do is skip any ampersand followed by
> #xDDDD; as part of the matched text (but include ampersands not
> followed by #xDDDD; as part of the match).

That is the result of the plain s/&#x[0-9A-F]{4};//g global substitution
I posted.

You should define what you ultimately want to achieve. Not, what you
right now think is a step-stone and part of the solution.


> So that if I had the text:
> 
> This that & thos&#x0065;.
> 
> The first @match would be counted as $chars:
> 
> T,h,i,s, ,t,h,a,t, ,&, ,t,h,o,s,.
> 
> and the 2nd @match would be:
> 
> &#x0065;
> 
> counting as $uchars.
> 
> So in the first case, the &#x0065; would be skipped over as part of the 
> capture.

Skipped over, since it is part of the capture. That kind of contradicts
itself...

Do you want all of those (HTML entity string) matches? The raw matches
themselves? Or is that just an attempt of debug visualization? Do you
actually want its number only?

This has quite an impact on the Perl code and logic / math involved.


Number of HTML entity escapes, length(char) of reminder:

  my $number = $data =~ s/&#x[0-9A-F]{4};//g;

  print "number:  ", $number, "\n";
  print "other:   ", length $data,    " = '", $data, "'\n";


Do need the complete HTML entity escapes. Quick hack to compute reminder.

  my @matches = $data =~ /&#x[0-9A-F]{4};/g;

  print "matches: ", scalar @matches, " = ", join(',', @matches), "\n";
  print "other:   ", length ($data) - 8*(scalar @matches), "\n";


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Reply via email to