At 01:56 -0500 2001.08.21, Stuart Johnston wrote:
>Does anyone have a simple filter for URL encoding that I can use?
Not simple, no. :)
This is what I use in Slash, though. YMMV. uses HTML::Entities and URI.
The important part to you is probably just the one regex with $URI::uric
and %URI::Escape::escapes. We have other needs too; stripping out script:
stuff, stripping out the "authority" (has been a problem on Slashdot
comments), remove certain characters, etc. HTH.
sub fixurl {
my($url) = @_;
# Remove quotes and whitespace (we will expect some at beginning and
# end, probably)
$url =~ s/["\s]//g;
# any < or > char after the first char truncates the URL right there
# (we will expect a trailing ">" probably)
$url =~ s/^[<>]+//;
$url =~ s/[<>].*//;
# strip surrounding ' if exists
$url =~ s/^'(.+?)'$/$1/g;
# add '#' to allowed characters; escape anything not allowed.
$url =~ s/([^$URI::uric#])/$URI::Escape::escapes{$1}/oge;
if (1) {
# Strip the authority, if any.
# This prevents annoying browser-display-exploits
# like "http:[EMAIL PROTECTED]".
# In future we may set up a package global or a field like
# getCurrentUser()->{state}{fixurlauth} that will allow
# this behavior to be turned off -- it's wrapped in
# "if (1)" to remind us of this...
my $uri = new URI $url;
if ($uri && $uri->can('host') && $uri->can('authority')) {
# don't need to print the port if we
# already have the correct port
my $host = $uri->can('host_port') &&
$uri->port != $uri->default_port
? $uri->host_port
: $uri->host;
$host =~ tr/A-Za-z0-9.-//cd; # per RFC 1035
$uri->authority($host);
$url = $uri->canonical->as_string;
}
}
# we don't like SCRIPT at the beginning of a URL
my $decoded_url = decode_entities($url);
return $decoded_url =~ s|^\s*\w+script\b.*$||i ? undef : $url;
}
--
Chris Nandor [EMAIL PROTECTED] http://pudge.net/
Open Source Development Network [EMAIL PROTECTED] http://osdn.com/