Re: url parsing in URI / HTTP::Request

2004-03-10 Thread Gisle Aas
Ulrich Wisser [EMAIL PROTECTED] writes:

 today I got an error code 400 (bad request) from my url checker. When
 I tested the url in my browser it worked fine. The url is
 
  http://www.leomajken.se?source=digdev
 
 I realize that there is a / missing after the domain name. I don't
 know if the problem is in URI or HTTP::Request. URI seems to accept
 the URL, but when I try to make an request I get the error code 400.
 
 Shouldn't that work?

It should.  This is an bug in LWP.  This is a fix:

Index: lib/LWP/Protocol/http.pm
===
RCS file: /cvsroot/libwww-perl/lwp5/lib/LWP/Protocol/http.pm,v
retrieving revision 1.66
diff -u -p -r1.66 http.pm
--- lib/LWP/Protocol/http.pm23 Oct 2003 19:11:33 -  1.66
+++ lib/LWP/Protocol/http.pm10 Mar 2004 20:09:36 -
@@ -147,7 +147,7 @@ sub request
$host = $url-host;
$port = $url-port;
$fullpath = $url-path_query;
-   $fullpath = / unless length $fullpath;
+   $fullpath = /$fullpath unless $fullpath =~ m,^/,;
 }
 
 # connect to remote site


Re: robot/ua-get..........FAILED tests 1-3, 5, 7

2004-03-11 Thread Gisle Aas
ALexander N. Treyner [EMAIL PROTECTED] writes:

 Could somebody help me to figure out what's wrong?

It means that your machine can't talk to itself, probably because the
hostname of your machine does not resolve to itself.  If you are on
a Unix system, then ping `hostname` needs to work.

--Gisle


Re: [rfc] HTTP::Multipart

2004-04-03 Thread Gisle Aas
Joshua Hoblitt [EMAIL PROTECTED] writes:

 I've been kicking around the idea for this module for a few days now
 and I'd like to commit it to code.  The module I'm proposing would
 be called CHTTP::Multipart.  It would accept an CHTTP::Response
 object and determine if it indeed does contain a multipart HTTP
 message.  If it does then the passed object would be cloned once for
 every part in the message and the 'Content-Length', 'Content-Type',
 and 'Content-Range' headers would be adjusted along with the
 Ccontent value to reflect one of the parts.  Then a list of
 non-multipart CHTTP::Response objects would be returned.  I
 believe this would simplify handling multipart responses.
 
 1) Is this a good idea?

The usecase for this seems a bit unclear to me.  How would you use
this module?  I don't understand what needs handling of multipart
responses requires.

 2) Is HTTP::Multipart a good name?

I think all that would be needed for this is a method on
HTTP::Message.  It could for instance be called 'parts'.  If the
method is not too long and generally useful then it should just go
into that module.

 3) Is it appropriate to require CHTTP::Response objects?  Would
 just requiring objects to be ISA CHTTP::Message or
 CHTTP::Headers be better?

It is best not to require any specific object at all.  Just depend to
a certain interface, i.e. a set of methods to be implemented.

 4) Should there be an CHTTP::Multipart object that contains a list
 of modified CHTTP::Response objects or would a class method be
 sufficient?

What you described above appear to be a simple function that takes one
HTTP::Response (or HTTP::Message) and breaks it into (possibly) many
smaller.  I don't see any need for an extra object or class here.

 5) if a class method is sufficient, would should it's name be?
 (ie., Cparase?)

What does 'parase' mean?

Regards,
Gisle


Re: [rfc] HTTP::Multipart

2004-04-05 Thread Gisle Aas
Thinking some more.  This is what I think I would like to see.

We introduce the methods 'parent', 'parts' and 'add_part' to
HTTP::Message.

  $msg2 = $msg-parent

This attribute point back to the parent message.  If defined
it makes this message a message part belonging to the parent
message.  This attribute is set by the other methods described
below.

We might consider automatic delegation to the parent, but
I'm not sure how useful that would be.

  @parts = $msg-parts

This will return a list of HTTP::Message objects.  If the
content-type of $msg is not multipart/* or message/* then this
will return the empty list.  The returned message part objects
are read only (so that future versions can make it possible to
modify the parent by modifying the parts).

If the content-type of $msg is message/* then there will only
be one part.

If the content-type is message/http, then this will return either
an HTTP::Request or an HTTP::Response object.

  $msg-parts( @parts )
  $msg-parts( [EMAIL PROTECTED] )

This will set the content of the message to be the provided list
of parts.  If the old content-type is not multipart/* or message/*
then it is set to multipart/mixed and other content-* headers are
cleared as well.  The part objects now belong to $msg and can not
be set to be parts of other messages, but clones can be made part
of other messages.  This method will croak if the provided parts
are not independent.

This method will croak if the content type is message/* and more
than one part is provided.

The array ref form is provided so that an empty list can be
provided without any special cases.

  $msg-add_part( $part )

This will add a part to a message.  If the old content-type is not
multipart/* then the old content (together with all content-*
headers) will be made part #1 and the content-type made
multipart/mixed before the new part is added.

  $part-clone

Will return an independent part object (i.e. the parent
attribute will always be cleared).  This ensures that
this works:

$msg2-parts([map $_-clone, $msg1-parts]);

When the parts are updated via the parts() or add_part() method, then
a suitable boundary will be automatically created so that it is unique
(like HTTP::Request::Common currently does).  If the boundary is set
explicitly then it is kept and the user is responsible for ensuring
that the string --$boundary does not occur in the content of any
part.

The current HTTP::Message object also provide the 'protocol()' method
that does not make sense for all parts.  This method should be moved
out or replicated in both HTTP::Request and HTTP::Response.

Regards,
Gisle


Re: [rfc] HTTP::Multipart

2004-04-05 Thread Gisle Aas
Paul Marquess [EMAIL PROTECTED] writes:

 Does this interface allow you manipulate nested multi-part messages?

Yes.  The parts() method on HTTP::Message return HTTP::Message objects
so there should not be any problem nesting this as you see fit.

The MIME::Entity class provide a method called parts_DFS that return
all parts in a depth-first-search order.  I don't see a need for it
for HTTP::Message, and it can easily be constructed from the parts()
method.

Regards,
Gisle


Re: Cookies Redirection

2004-04-05 Thread Gisle Aas
Paul Marquess [EMAIL PROTECTED] writes:

 This is from UserAgent::request (LWP 5.76) where it is dealing with a
 redirect response
 
   # These headers should never be forwarded
   $referral-remove_header('Host', 'Cookie');
   
 I've found that while writing a script to automate logging on to Yahoo Web
 mail, I've needed to change this behaviour in a private copy of
 UserAgent::request to retain the Cookies.

The reason the Cookie headers are removed is that they will be added
automatically again if the redirect goes to a place that requires
cookies.  This happens even if the redirect goes to the same place as
the original request.

FYI, logging onto Yahoo involves
 dealing with a series of 302 responses. The first of these responses (from
 http://login.yahoo.com), is a 302 that redirects back to itself - this
 response has a Set-Cookie header that is needed to be applied to the
 redirection request to continue with the login.

That should just work.  If it does not it is a bug.

 Apart from the fact that this behaviour is being used in the wild, my
 reading of RFC 2109 is that this use of a Set-Cookie is ok because the
 domain attribute in the Cookie still refers to .yahoo.com.

Can you provide a trace of sequence of request/responses that are
exchanged and the content of the cookie_jar as this happens.

Regards,
Gisle


Re: Error when running LWP

2004-04-06 Thread Gisle Aas
Octavian Rasnita [EMAIL PROTECTED] writes:

 Hi all,
 
 I have recieved the following error when I tried to run a simple script that
 only downloads and prints a page. The script runs fine under Windows, but It
 give this error when running under Linux.
 Do you know which could be the cause for this error? Please tell me how can
 I solve it.
 
 The error is:
 
 Can't locate auto/Compress/Zlib/autosplit.ix in @INC (@INC contains:
 /usr/lib/perl5/5.8.0/i386-linux-thread-multi /usr/lib/perl5/5.8.0
 /usr/lib/perl5/site_perl/5.8.0/i386-linux-thread-multi
 /usr/lib/perl5/site_perl/5.8.0 /usr/lib/perl5/site_perl
 /usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi
 /usr/lib/perl5/vendor_perl/5.8.0 /usr/lib/perl5/vendor_perl .) at
 /usr/lib/perl5/5.8.0/AutoLoader.pm line 158.
  at
 /usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi/Compress/Zlib.pm
 line 16

It looks like Compress::Zlib is not properly installed on the system.
LWP will try to load it if it is available.  I bet you get a similar
error with:

   perl -MCompress::Zlib -e1

To fix this situation either remove 
/usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi/Compress/Zlib.pm
or reinstall Compress::Zlib.

--Gisle


Re: Bug submitting large HTTP requests

2004-04-06 Thread Gisle Aas
Jamie Lokier [EMAIL PROTECTED] writes:

 The subroutine Net::HTTP::Methods::write_request calls print, but
 doesn't check the return value.
 
 It's a non-blocking socket, so it's quite normal for the print to do a
 short write if the string is very large -- larger than the socket
 transmit buffer.

I would belive that print should be responsible for the handing short
writes itself.  On what system are you running and what perl version
are you using?

What could make sense to to rewrite Net::HTTP so that it use syswrite
all over the place instead.  With it we can easily handle short writes
outself.

Regards,
Gisle


Re: [PATCH] LWP::RobotUA case-sensitive check for Disallow

2004-04-06 Thread Gisle Aas
Liam Quinn [EMAIL PROTECTED] writes:

 LWP::RobotUA won't parse a robots.txt file if the file does not contain
 Disallow.  The check for Disallow is case sensitive, but according to
 the robot exclusion standard, field names are case insensitive.  This
 causes LWP::RobotUA to ignore some robots.txt files that it should parse.
 
 Attached is a patch that makes the check for Disallow case insensitive.  
 The patch is against libwww-perl 5.76 (RobotUA.pm 1.23).

Thanks! Applied as:

Index: lib/LWP/RobotUA.pm
===
RCS file: /cvsroot/libwww-perl/lwp5/lib/LWP/RobotUA.pm,v
retrieving revision 1.23
retrieving revision 1.24
diff -u -p -r1.23 -r1.24
--- lib/LWP/RobotUA.pm  24 Oct 2003 11:13:03 -  1.23
+++ lib/LWP/RobotUA.pm  6 Apr 2004 11:02:50 -   1.24
@@ -1,10 +1,10 @@
 package LWP::RobotUA;
 
-# $Id: RobotUA.pm,v 1.23 2003/10/24 11:13:03 gisle Exp $
+# $Id: RobotUA.pm,v 1.24 2004/04/06 11:02:50 gisle Exp $
 
 require LWP::UserAgent;
 @ISA = qw(LWP::UserAgent);
-$VERSION = sprintf(%d.%02d, q$Revision: 1.23 $ =~ /(\d+)\.(\d+)/);
+$VERSION = sprintf(%d.%02d, q$Revision: 1.24 $ =~ /(\d+)\.(\d+)/);
 
 require WWW::RobotRules;
 require HTTP::Request;
@@ -126,7 +126,7 @@ sub simple_request
my $fresh_until = $robot_res-fresh_until;
if ($robot_res-is_success) {
my $c = $robot_res-content;
-   if ($robot_res-content_type =~ m,^text/,  $c =~ /Disallow/) {
+   if ($robot_res-content_type =~ m,^text/,  $c =~ /^Disallow\s*:/mi) {
LWP::Debug::debug(Parsing robot rules);
$self-{'rules'}-parse($robot_url, $c, $fresh_until);
}

 
 -- 
 Liam Quinn
 
 
 
 --- LWP/RobotUA.pm.orig   2003-10-24 07:13:03.0 -0400
 +++ LWP/RobotUA.pm2004-04-03 17:59:04.0 -0500
 @@ -126,7 +126,7 @@
   my $fresh_until = $robot_res-fresh_until;
   if ($robot_res-is_success) {
   my $c = $robot_res-content;
 - if ($robot_res-content_type =~ m,^text/,  $c =~ /Disallow/) {
 + if ($robot_res-content_type =~ m,^text/,  $c =~ /Disallow/i) {
   LWP::Debug::debug(Parsing robot rules);
   $self-{'rules'}-parse($robot_url, $c, $fresh_until);
   }


Re: [PATCH] WWW::RobotRules user-agent matching

2004-04-06 Thread Gisle Aas
Liam Quinn [EMAIL PROTECTED] writes:

 WWW::RobotRules attempts to trim the robot's User-Agent before comparing 
 it with the User-agent field of a robots.txt file:
 
 # Strip it so that it's just the short name.
 # I.e., FooBot  = FooBot
 #   FooBot/1.2  = FooBot
 #   FooBot/1.2 [http://foobot.int; [EMAIL PROTECTED] = FooBot
 
 delete $self-{'loc'};   # all old info is now stale
 $name = $1 if $name =~ m/(\S+)/; # get first word
 $name =~ s!/?\s*\d+.\d+\s*$!!;  # loose version
 
 My robot's name is WDG_SiteValidator/1.5.6.  The above code changes the
 name to WDG_SiteValidator/1., which causes it not to match a robots.txt
 User-agent field of WDG_SiteValidator.
 
 I've attached a patch against libwww-perl 5.76 (WWW::RobotRules 1.26) that
 replaces the last line above with
 
 $name =~ s!/.*!!;  # lose version
 
 which seems to cover the various cases correctly.

Agree.  Patch applied.  Thanks!

Regards,
Gisle


 --- WWW/RobotRules.pm.orig2003-10-23 15:11:33.0 -0400
 +++ WWW/RobotRules.pm 2004-04-03 18:06:01.0 -0500
 @@ -187,7 +187,7 @@
  
   delete $self-{'loc'};   # all old info is now stale
   $name = $1 if $name =~ m/(\S+)/; # get first word
 - $name =~ s!/?\s*\d+.\d+\s*$!!;  # loose version
 + $name =~ s!/.*!!;  # lose version
   $self-{'ua'}=$name;
  }
  $old;


Re: Suggest change to WWW::RobotRules

2004-04-06 Thread Gisle Aas
Craig Macdonald [EMAIL PROTECTED] writes:

 Hi, just a short note to suggest a 1-line change to WWW::RobotRules.
 
 When loading, http://www.maths.gla.ac.uk/robots.txt I noticed
 WWW::RobotRules giving me warnings:
 
 RobotRules: Unexpected line:  User-agent: *
 RobotRules: Unexpected line:  Disallow: /error/
 RobotRules: Unexpected line:  Disallow: /tla_review/
 
 etc.
 
 The problem is that WWW::RobotRules doesn't support leading space on a
 robots.txt line. As such, I would suggest adding
 s/^\s*//;
 at line 51 of RobotRules.pm.
 
 I'm not sure how frequent a problem this might be, but it seems
 important to make WWW::RobotRules as robust at parsing robots.txt files
 as possible, in order to prevent parts of sites being crawled that
 shouldn't be.

The spec at http://www.robotstxt.org/wc/norobots.html states that
leading space is not allowed, but I agree that LWP should be a bit
more liberal when parsing.  I've now applied the following patch.

Regads,
Gisle


Index: lib/LWP/RobotUA.pm
===
RCS file: /cvsroot/libwww-perl/lwp5/lib/LWP/RobotUA.pm,v
retrieving revision 1.24
diff -u -p -r1.24 RobotUA.pm
--- lib/LWP/RobotUA.pm  6 Apr 2004 11:02:50 -   1.24
+++ lib/LWP/RobotUA.pm  6 Apr 2004 11:36:10 -
@@ -126,7 +126,7 @@ sub simple_request
my $fresh_until = $robot_res-fresh_until;
if ($robot_res-is_success) {
my $c = $robot_res-content;
-   if ($robot_res-content_type =~ m,^text/,  $c =~ /^Disallow\s*:/mi) {
+   if ($robot_res-content_type =~ m,^text/,  $c =~ /^\s*Disallow\s*:/mi) {
LWP::Debug::debug(Parsing robot rules);
$self-{'rules'}-parse($robot_url, $c, $fresh_until);
}
Index: lib/WWW/RobotRules.pm
===
RCS file: /cvsroot/libwww-perl/lwp5/lib/WWW/RobotRules.pm,v
retrieving revision 1.28
diff -u -p -r1.28 RobotRules.pm
--- lib/WWW/RobotRules.pm   6 Apr 2004 11:10:49 -   1.28
+++ lib/WWW/RobotRules.pm   6 Apr 2004 11:36:11 -
@@ -54,7 +54,7 @@ sub parse {
last if $is_me; # That was our record. No need to read the rest.
$is_anon = 0;
}
-elsif (/^User-Agent:\s*(.*)/i) {
+elsif (/^\s*User-Agent\s*:\s*(.*)/i) {
$ua = $1;
$ua =~ s/\s+$//;
if ($is_me) {
@@ -68,7 +68,7 @@ sub parse {
$is_me = 1;
}
}
-   elsif (/^Disallow\s*:\s*(.*)/i) {
+   elsif (/^\s*Disallow\s*:\s*(.*)/i) {
unless (defined $ua) {
warn RobotRules: Disallow without preceding User-agent\n;
$is_anon = 1;  # assume that User-agent: * was intended


Re: [PATCH] redirection in LWP::Simple

2004-04-06 Thread Gisle Aas
Ward Vandewege [EMAIL PROTECTED] writes:

 I had some trouble using LWP::Simple (v1.36 from Debian's libwww-perl package
 version 5.69-4) with this url:
 
   http://www.tvgids.nl/
 
 It turns out that site does an immediate redirect when hitting that url. The
 webserver seems to be broken because it writes 'location:' instead of
 'Location:' in the HTTP headers.
 
 The latest LWP::Simple version (v1.38 from libwww-perl 5.76) does not
 understand 'location' with lower-case first letter either.
 
 The patch below (against v1.38) fixes LWP::Simple to accept a lowercase
 'location' header. 
 
 In the mindset of 'Be liberal in what you receive, and conservative in what
 you send', is this worth adding to libwww-perl?

It sure is.  Now applied.  Thanks!

Regards,
Gisle

 
 Thanks,
 Ward Vandewege. 
 
 --- Simple.pm   2003-12-31 14:15:59.0 -0500
 +++ Simple.pm   2003-12-31 14:16:24.0 -0500
 @@ -180,7 +180,7 @@
 if ($buf =~ m,^HTTP/\d+\.\d+\s+(\d+)[^\012]*\012,) {
 my $code = $1;
 #print CODE=$code\n$buf\n;
 -   if ($code =~ /^30[1237]/  $buf =~ /\012Location:\s*(\S+)/) {
 +   if ($code =~ /^30[1237]/  $buf =~ /\012Location:\s*(\S+)/i) {
 # redirect
 my $url = $1;
 return undef if $loop_check{$url}++;
 


Re: Bug submitting large HTTP requests

2004-04-06 Thread Gisle Aas
Jamie Lokier [EMAIL PROTECTED] writes:

 Gisle Aas wrote:
   The subroutine Net::HTTP::Methods::write_request calls print, but
   doesn't check the return value.
   
   It's a non-blocking socket, so it's quite normal for the print to do a
   short write if the string is very large -- larger than the socket
   transmit buffer.
  
  I would belive that print should be responsible for the handing short
  writes itself.  On what system are you running and what perl version
  are you using?
 
 Red Hat 9, perl-5.8.0-88.3.
 
 print normally does handly short writes and keep writing until it's
 done the whole string.  However, it will stop when it gets an error
 code, and it does: EAGAIN because the socket transmit buffer is full
 and it's non-blocking.

Yes.  That's a problem, but it might be argued that the user of
$http-write_request() is responsible for checking for the error.  The
method will return FALSE on error and set $! like print :)

  What could make sense to to rewrite Net::HTTP so that it use syswrite
  all over the place instead.  With it we can easily handle short writes
  outself.
 
 It's not the short writes as such, it's the EAGAINs.

This problem is exactly why LWP::Protocol::http never use
write_request() itself, but calls format_request() and then use
syswrite() to get the bytes out on the wire.

Regards,
Gisle


libwww-perl-5.77

2004-04-06 Thread Gisle Aas
I've been going through the backlog in my LWP folder today and managed
to apply some of the patches found there.  I now have to return to my
real work, but I still have lots of email I did not find time to look
into.  The result so far has just been uploaded to CPAN as
libwww-perl-5.77.  Feel free to remind me of important patches
missing, especially if the patch also comes with updates to the test
suite and documentation.

These are the changes since version 5.76:

LWP::Simple did not handle redirects properly when the Location
header used uncommon letter casing.
Patch by Ward Vandewege [EMAIL PROTECTED].

LWP::UserAgent passed the wrong request to redirect_ok().
Patch by Ville Skyttä [EMAIL PROTECTED].
https://rt.cpan.org/Ticket/Display.html?id=5828

LWP did not handle URLs like http://www.example.com?foo=bar
properly.

LWP::RobotUA construct now accept key/value arguments in the
same way as LWP::UserAgent.
Based on patch by Andy Lester [EMAIL PROTECTED].

LWP::RobotUA did not parse robots.txt files that contained
Disallow: using uncommon letter casing.
Patch by Liam Quinn [EMAIL PROTECTED].

WWW::RobotRules now allow leading space when parsing robots.txt
file as suggested by Craig Macdonald [EMAIL PROTECTED].
We now also allow space before the colon.

WWW::RobotRules did not handle User-Agent names that use complex
version numbers.  Patch by Liam Quinn [EMAIL PROTECTED].

Case insensitive handling of hosts and domain names
in HTTP::Cookies.
https://rt.cpan.org/Ticket/Display.html?id=4530

The bundled media.types file now match video/quicktime
with the .mov extension, as suggested by Michel Koppelaar
[EMAIL PROTECTED].

Experimental support for composite messages, currently
implemented by the HTTP::MessageParts module.  Based on
ideas from Joshua Hoblitt [EMAIL PROTECTED].

Fixed libscan in Makefile.PL.
Patch by Andy Lester [EMAIL PROTECTED].

The HTTP::Message constructor now accept a plain array reference
as its $headers argument.

The return value of the HTTP::Message as_string() method now
better conforms to the HTTP wire layout.  No additional \n
are appended to the as_string value for HTTP::Request and
HTTP::Response.  The HTTP::Request as_string now replace missing
method or URI with - instead of [NO METHOD] and [NO URI].
We don't want values with spaces in them, because it makes it
harder to parse.

Enjoy!

Regards,
Gisle


Re: Latest LWP fails tests

2004-04-08 Thread Gisle Aas
Scott R. Godin [EMAIL PROTECTED] writes:

 It seems to require Data::Dump which I do not have installed.

libwww-perl-5.78 has been uploaded.  It fixes this problem.

Regards,
Gisle


Re: libwww-perl-5.77

2004-04-09 Thread Gisle Aas
Gisle Aas [EMAIL PROTECTED] writes:

 [EMAIL PROTECTED] (François Pons) writes:
 
  Gisle Aas [EMAIL PROTECTED] writes:
  
   I've been going through the backlog in my LWP folder today and managed
   to apply some of the patches found there.  I now have to return to my
   real work, but I still have lots of email I did not find time to look
   into.  The result so far has just been uploaded to CPAN as
   libwww-perl-5.77.  Feel free to remind me of important patches
   missing, especially if the patch also comes with updates to the test
   suite and documentation.
  
  I wonder about the code in HTML::Form to handle forms with disabled state, which
  are enabled back using JavaScript code. This is a simple modifications but not
  handled, I will agree there are no RFC allowing this, which is more a hack than
  anything else but nothing allow us (to me) to get the disabled input back on
  live.
  
  Does there are any reasons not using this very simple patch ?
 
 I think the patch is wrong.  The data from a disabled input should not
 be sent back unless it is enabled.  Your patch effectively always
 enable them.  To get this right we would have to have to track
 enabledness and then provide a way of tweaking this attribute.

This a patch implemting this.  It exposes the 'readonly' and
'disabled' attribute for form inputs.  The patch has been applied :)

Regards,
Gisle


Index: lib/HTML/Form.pm
===
RCS file: /cvsroot/libwww-perl/lwp5/lib/HTML/Form.pm,v
retrieving revision 1.38
diff -u -p -r1.38 Form.pm
--- lib/HTML/Form.pm23 Oct 2003 19:11:32 -  1.38
+++ lib/HTML/Form.pm9 Apr 2004 14:14:30 -
@@ -188,10 +188,11 @@ sub push_input
Carp::carp(Unknown input type '$type') if $^W;
$class = TextInput;
 }
-$class = IgnoreInput if exists $attr-{disabled};
 $class = HTML::Form::$class;
+my @extra;
+push(@extra, readonly = 1) if $type eq hidden;
 
-my $input = $class-new(type = $type, %$attr);
+my $input = $class-new(type = $type, %$attr, @extra);
 $input-add_to_form($self);
 }
 
@@ -769,6 +770,41 @@ sub value_names {
 return
 }
 
+=item $bool = $input-readonly
+
+=item $input-readonly( $bool )
+
+This method is used to get/set the value of the readonly attribute.
+You are allowed to modify the value of readonly inputs, but setting
+the value will generate some noise when warnings are enabled.  Hidden
+fields always start out readonly.
+
+=cut
+
+sub readonly {
+my $self = shift;
+my $old = $self-{readonly};
+$self-{readonly} = shift if @_;
+$old;
+}
+
+=item $bool = $input-disabled
+
+=item $input-disabled( $bool )
+
+This method is used to get/set the value of the disabled attribute.
+Disabled inputs do not contribute any key/value pairs for the form
+value.
+
+=cut
+
+sub disabled {
+my $self = shift;
+my $old = $self-{disabled};
+$self-{disabled} = shift if @_;
+$old;
+}
+
 =item $input-form_name_value
 
 Returns a (possible empty) list of key/value pairs that should be
@@ -781,6 +817,7 @@ sub form_name_value
 my $self = shift;
 my $name = $self-{'name'};
 return unless defined $name;
+return if $self-{disabled};
 my $value = $self-value;
 return unless defined $value;
 return ($name = $value);
@@ -833,9 +870,8 @@ sub value
 my $old = $self-{value};
 $old =  unless defined $old;
 if (@_) {
-   if (exists($self-{readonly}) || $self-{type} eq hidden) {
-   Carp::carp(Input '$self-{name}' is readonly) if $^W;
-   }
+Carp::carp(Input '$self-{name}' is readonly)
+   if $^W  $self-{readonly};
$self-{value} = shift;
 }
 $old;
@@ -1068,6 +1104,7 @@ sub form_name_value
 return unless $clicked;
 my $name = $self-{name};
 return unless defined $name;
+return if $self-{disabled};
 return ($name.x = $clicked-[0],
$name.y = $clicked-[1]
   );
@@ -1154,6 +1191,7 @@ sub form_name_value {
 
 my $name = $self-name;
 return unless defined $name;
+return if $self-{disabled};
 
 my $file = $self-file;
 my $filename = $self-filename;


Using Content-Loction as base

2004-04-09 Thread Gisle Aas
---BeginMessage---
Apologies if this is not an appropriate place to report issues
with libwww  - if which case if you could let me know a better
address I'd be very grateful.

I've noticed at least one case where $response-base does not
match what would be set by a normal web browser.

For the url http://www.stateline.org/stateline/ the HTTP headers
returned are:

HTTP/1.1 200 OK
Date: Tue, 20 Jan 2004 16:28:28 GMT
Server: Orion/1.5.2
Content-Location: http://www.stateline.org:9090/jsp/staticSite/index2.jsp
Set-Cookie: JSESSIONID=KPDJDBGMOFOL; Domain=.stateline.org; Path=/
Cache-Control: private
Connection: Close
Content-Type: text/html
Transfer-Encoding: chunked

From this $response-base is set to
http://www.stateline.org:9090/jsp/staticSite/index2.jsp
which means any relative URIs start with
http://www.stateline.org:9090/

Unfortunately the server is not listening on 9090 (or more likely
firewalled), so attempts to download any links fail.

Normal web browsers do not set port 9090 in the base so can
access links and content without problem.

Trivial testlink script, run with
   testlink http://www.stateline.org/stateline/

Thanks

#!/usr/pkg/bin/perl -wT

use strict;
use LWP;

my $browser = LWP::UserAgent-new(agent = 'Mozilla/5.0');
my $response = $browser-get($ARGV[0]);

if ($response-is_success  $response-content_type eq 'text/html')
{
my $base = $response-base;
my $data = $response-content;
print Base: $base\n;

while ($data =~ s/.*?\b(src|link\b[^]*\s+href)\s*=\s*([^]+)//is)
{
my $link = URI-new_abs($2, $base);
print Link: $link\n;
}
}



-- 
   David Brownlee -- [EMAIL PROTECTED]


---End Message---


Re: multi part form posts

2004-04-10 Thread Gisle Aas
petersm [EMAIL PROTECTED] writes:

 I am new to LWP and WWW::Mech and have used them on a couple of
 projects. I was wondering what is the best way to do a multipart
 post (file upload) using LWP.

From LWP you just do a post with something like:

  $ua-post('http://www.example.com',
content_type = 'form-data',
content = [
   foo = 1,
   file = [foo.txt],
]);

More details available by reading the HTTP::Request::Common manpage.

Regards,
Gisle


libwww-perl-5.79

2004-04-13 Thread Gisle Aas
Another release has been uploaded to CPAN with quite a few
enhancements from Ville, and then some HTTP::Headers hacks by me.
These are the changes since 5.78:

HTML::Form now exposes the 'readonly' and 'disabled'
attribute for inputs.  This allow your program to simulate
JavaScript code that modifies these attributes.

RFC 2616 says that http: referer should not be sent with
https: requests.  The lwp-rget program, the $req-referer method
and the redirect handling code now try to enforce this.
Patch by Ville Skyttä [EMAIL PROTECTED].

WWW::RobotRules now look for the string found in
robots.txt as a case insensitive substring from its own
User-Agent string, not the other way around.
Patch by Ville Skyttä [EMAIL PROTECTED].

HTTP::Headers: New method 'header_field_names' that
return a list names as suggested by its name.

HTTP::Headers: $h-remove_content_headers will now
also remove the headers Allow, Expires and
Last-Modified.  These are also part of the set
that RFC 2616 denote as Entity Header Fields.

HTTP::Headers: $h-content_type is now more careful
in removing embedded space in the returned value.
It also now returns all the parameters as the second
return value as documented.

HTTP::Headers: $h-header() now croaks.  It used to
silently do nothing.

HTTP::Headers: Documentation tweaks.  Documented a
few bugs discovered during testing.

Typo fixes to the documentation all over the place
by Ville Skyttä [EMAIL PROTECTED].

Updated tests.

and since 5.78 was not really announced these are the changes applied
to 5.77 to make it 5.78:

Removed stray Data::Dump reference from test suite.

Added the parse(), clear(), parts() and add_part() methods to
HTTP::Message.  The HTTP::MessageParts module of 5.77 is no more.

Added clear() and remove_content_headers() methods to
HTTP::Headers.

The as_string() method of HTTP::Message now appends a newline
if called without arguments and the non-empty content does
not end with a newline.  This ensures better compatibility with
5.76 and older versions of libwww-perl.

Use case insensitive lookup of hostname in $ua-credentials.
Patch by Andrew Pimlott [EMAIL PROTECTED].

Enjoy!

Regards,
Gisle


Re: getting webpage from different server than the url points to?

2004-05-10 Thread Gisle Aas
hubert depesz lubaczewski [EMAIL PROTECTED] writes:

 Charles C. Fu wyrzebi(a):
  If 10.2.1.7 complies even even minimally with HTTP/1.1, then you
  can force requests to be sent to it by setting 10.2.1.7 to be your
  proxy server.  If limiting yourself to LWP::Simple, then the proxy
  server is set through environment variables (e.g., set http_proxy
  to http://10.2.1.7/).  See the LWP::UserAgent man page for more
  details.
 
 i'm not limiting myself to anything. right now i did it using plain
 sockets.  in fact i was not thinking about using webserwer as proxy,
 and for some reasone i find this idea rather unpleasant.  i would
 just to be able to send the request someplace else - without all
 this proxy things.

There is basically nothing more to the proxy concept, than the fact
that you send the request someplace else.

Another way of doing it is to plug in an alternative
LWP::Protocol::http module that for instance pick up the IP address
from a request header.

Or you can try this:

  local @LWP::Protocol::http::EXTRA_SOCK_OPTS =
  (PeerAddr = 10.2.1.7);
  print $ua-get(http://www.example.com/foo;);

Regards,
Gisle


Re: HTML::TreeBuilder and lwp-request

2004-04-27 Thread Gisle Aas
Jacinta Richardson [EMAIL PROTECTED] writes:

 I've noticed that HTML::TreeBuilder is a subclass of HTML::Parser and that
 HTML::Parser is required by LWP although it doesn't appear that
 HTML::TreeBuilder is.
 
 I've recently noticed that the /usr/bin tools POST, GET, HEAD and
 lwp-request provided by LWP are dependant on the deprecated HTML::Parse
 module from the HTML::Tree package.  I presume that at some point LWP
 moved away from HTML::Parse but these tools were forgotten.  These tools
 fail to work in certain situations without this module being installed
 (with HTML::Parse isn't in @INC errors) but do not mention in their
 documentation this dependancy. 
 
 I bring this up because I had a question today which mentioned that this
 person's bash script worked perfectly up until when he tried to pass the
 -o switch to lwp-request.  He didn't understand Perl and didn't understand
 the @INC error message.  I don't think he should have had to just to use
 this tool.
 
 I have a few questions:
 
 Is HTML::TreeBuilder only required for these tools or does it appear in
 other parts of the distribution?  

This is the only place.

 Is there any specific design decision to leave HTML::TreeBuilder out of
 the list of required modules?  

Just because we want to limit the number of dependencies.  It is
pretty obscure that additional modules are required if you use the -o
option of lwp-request.  If you know perl it should be pretty obvious
what is wrong if you fail to have the module installed.

Note that extra HTML::Format* modules might also be needed by -o.

 Is there someone actively maintaining these tools who I should consult
 before patching them to not use HTML::Parse and to test (reporting failure
 reasonably) that modules exist before requireing them?  

Send suggested patches to this mailing list.

 Is there a reason why all four of these files appear to be identical but
 they're not installed as hard links? 

I have not been able to convince MakeMaker to do this.  At some time
we tried to install the GET, HEAD, POST aliases as symlinks, but it
never worked properly.

Regards,
Gisle


Re: HTML-Parser

2004-05-03 Thread Gisle Aas
matthew zip [EMAIL PROTECTED] writes:

 Having problems getting this module to work with my new Perl 5.8.4 on
 Linux.
 
 I followed the instructions but when I attempt to use HTML/LinkExtor I
 get:
 
 HTML::Parser object version 3.36 does not match bootstrap parameter 3.26
 at /usr/lib/perl5/5.8.4/i686-linux/DynaLoader.pm line 253.
 
 
 Is this package compatible with Perl 5.8?

It sure is.  Your installation seems to be mixing incompatible
versions of the HTML/Parser.so and HTML/Parser.pm file.  This should
not happen if you let 'make install' install the module.  I would try
to reinstall HTML-Parser.

Regards,
Gisle


Re: :mechanize issues/mechanize.pm dies!!

2004-06-02 Thread Gisle Aas
Darrell Gammill [EMAIL PROTECTED] writes:

 Look back a the output of 'print $b-current_form()-dump();'  Do you
 see where the option for 'Anthropology' appears by itself?  This is
 because the HTML is not being parsed right.  The following line seems to
 be the offender:
 
   option value=ANT
 Name=AnthropologyAnthropology/option
 
 The 'Name' attribute seems to be confusing the form parser so
 Anthropology is not one of the available options.

I don't believe that this can confuse HTML::Form.  It does not care
about the Name attribute at all.  Care to explain better what you
think happens here?

Regards,
Gisle Aas


Re: :mechanize issues/mechanize.pm dies!!

2004-06-03 Thread Gisle Aas
Darrell Gammill [EMAIL PROTECTED] writes:

 The 'Anthropology' option is being interpreted as its own separate input
 rather then part of the 'u_input' input.  To test this, I used the
 section of code below with the results right after it.

Thanks for the test case.  This is a bug in HTML::Form.  The 'name'
from the option tag overrides the 'name' from the select tag when it
should not.  We also get in trouble with (illegal) option attributes
like 'disabled', 'multiple', 'type' etc.

The following patch fixes these problems.  It will be in the next
libwww-perl.

Regards,
Gisle

Index: Form.pm
===
RCS file: /cvsroot/libwww-perl/lwp5/lib/HTML/Form.pm,v
retrieving revision 1.39
diff -u -p -r1.39 Form.pm
--- Form.pm 9 Apr 2004 14:17:32 -   1.39
+++ Form.pm 3 Jun 2004 09:13:44 -
@@ -136,15 +136,26 @@ sub parse
$f-push_input(textarea, $attr);
}
elsif ($tag eq select) {
-   $attr-{select_value} = $attr-{value}
-   if exists $attr-{value};
+   # rename attributes reserved to come for the option tag
+   for (value, value_name) {
+   $attr-{select_$_} = delete $attr-{$_}
+   if exists $attr-{$_};
+   }
while ($t = $p-get_tag) {
my $tag = shift @$t;
last if $tag eq /select;
next if $tag =~ m,/?optgroup,;
next if $tag eq /option;
if ($tag eq option) {
-   my %a = (%$attr, %{$t-[0]});
+   my %a = %{$t-[0]};
+   # rename keys so they don't clash with %attr
+   for (keys %a) {
+   next if $_ eq value;
+   $a{option_$_} = delete $a{$_};
+   }
+   while (my($k,$v) = each %$attr) {
+   $a{$k} = $v;
+   }
$a{value_name} = $p-get_trimmed_text;
$a{value} = delete $a{value_name}
unless defined $a{value};
@@ -192,6 +203,7 @@ sub push_input
 my @extra;
 push(@extra, readonly = 1) if $type eq hidden;
 
+delete $attr-{type}; # don't confuse the type argument
 my $input = $class-new(type = $type, %$attr, @extra);
 $input-add_to_form($self);
 }
@@ -913,9 +925,9 @@ sub new
 }
 else {
$self-{menu} = [$value];
-   my $checked = exists $self-{checked} || exists $self-{selected};
+   my $checked = exists $self-{checked} || exists $self-{option_selected};
delete $self-{checked};
-   delete $self-{selected};
+   delete $self-{option_selected};
if (exists $self-{multiple}) {
unshift(@{$self-{menu}}, undef);
$self-{value_names} = [off, $value_name];


Re: [PATCH] Make URI::sip honor the new_abs(), abs(), rel() contract

2004-06-03 Thread Gisle Aas
Ville Skyttä [EMAIL PROTECTED] writes:

 URI::sip(s) does not honor the URI API contract of returning the
 original URI if it cannot be made absolute in new_abs() or abs(), or
 relative in rel().  Fix along with a couple of test cases attached.

Applied. Thanks!

Regards,
Gisle


Re: libwww-perl: Patch to support not sending Content-Length...

2004-06-03 Thread Gisle Aas
Matt Christian [EMAIL PROTECTED] writes:

  This patch kills the Content-Length both for the request itself and
  for the multipart/* parts.  I think only the later is what you really
  want.
 
  I think the better fix is to simply remove the content-length for the
  parts.  There is probably nothing that really requires them, even
  though they ought to be harmless.
 
 Yes, that was on purpose.  The broken web server I need to interact with
 doesn't understand Content-Length for the request or multipart/* parts.
 If I send *any* Content-Length headers, it dies with a 5xx error.

But the Content-Length header for the request itself will be added by
the protocol handler if the request does not have any.  It means that
not adding the Content-Length to the request itself should make no
difference for the server.

  I don't want really want to introduce yet another ugly global.
 
 I don't like the ugly global either so I'm open to suggestions on how to
 better handle it.  Maybe add another option to
 LWP::UserAgent-new(%options) ?  Would that be preferred?

No.  POST() can be called without using LWP::UserAgent at all.

 What are the chances of a (possibly modified) version of my patch making
 it into libwww-perl proper?  I'm open to suggestions...

I'm willing to apply the following patch if you can confirm that it
fixes your problem.

Index: lib/HTTP/Request/Common.pm
===
RCS file: /cvsroot/libwww-perl/lwp5/lib/HTTP/Request/Common.pm,v
retrieving revision 1.22
diff -u -p -r1.22 Common.pm
--- lib/HTTP/Request/Common.pm  23 Oct 2003 19:11:32 -  1.22
+++ lib/HTTP/Request/Common.pm  3 Jun 2004 13:31:05 -
@@ -152,7 +152,6 @@ sub form_data   # RFC1867
local($/) = undef; # slurp files
$content = $fh;
close($fh);
-   $h-header(Content-Length = length($content));
}
unless ($ct) {
require LWP::MediaTypes;


Regards,
Gisle


Re: [patch] HTTP::Message-is_multipart

2004-06-08 Thread Gisle Aas
Joshua Hoblitt [EMAIL PROTECTED] writes:

 After writing this bit of ugly code...
 
 if ( $res-can( 'parts' ) ) {
 die multipart messages are not supported unless scalar @{[ $res-parts ]} = 1;
 }
 
 I decided that an is_multipart method might be handy.  Would anyone
 else find this functionality useful?

I don't really like it.  I would have expected a method like
$res-is_multipart to actually test for $res-content_type =~
m,^multipart/,.

Seems like I should have made 'parts' return the number of parts in
scalar context instead of the first one.  That would be more useful
here.  To stay compatible it seems like the best route is to add a
method called 'num_parts', but it is not clear to me why you want to
handle multipart messages with one part but not those with more.  If
the need for testing the number of parts is not a common use case I
think it is better to leave this method out.

Another approach for you is to simply put this sub into your app:

sub HTTP::Message::have_many_parts {
my $self = shift;
return 0 unless $self-can('parts');
return @{[ $self-parts ]} = 1;
}

and then you can write:

die multipart messages are not supported if $res-have_many_parts;

Regards,
Gisle


Re: [patch] HTTP::Message-is_multipart

2004-06-08 Thread Gisle Aas
Joshua Hoblitt [EMAIL PROTECTED] writes:

  Seems like I should have made 'parts' return the number of parts in
  scalar context instead of the first one.  That would be more useful
  here.  To stay compatible it seems like the best route is to add a
  method called 'num_parts', but it is not clear to me why you want to
  handle multipart messages with one part but not those with more.  If
  the need for testing the number of parts is not a common use case I
  think it is better to leave this method out.
 
 I had some discussion about this on freenode/#perl before submitting
 the patch.  Everyone asked why the parts count wasn't returned in
 scalar context. :) I had wanted to maintain backwards compatibility
 but, now that I think about it, I doubt many are using that method
 in scalar context.  Why don't you just fix the behavior now?

Because I don't know if anybody is using that method in scalar context
and I don't want to break published APIs.  It might be unlikely that
any code actually breaks, but the benefit of doing this change is
also very small.

Why isn't parts count returned in scalar context?  Because I felt that
it would be more useful to not have to force array context when you
want to extract the single part of a 'message/*' message

   if ($res-content_type =~ m,^message/,) {
if (my $part = $res-parts) {
# do something with the part
...
}
   }

and if there was a strong demand for getting the number of parts we
could always add a method for that purpose.  If I redid this now I
think I would make 'parts' return the number and then add a 'part'
method (without the 's') that always returns the first part regardless
of context.

Regards,
Gisle


Re: HTTP::Message, setting content with a ref

2004-06-09 Thread Gisle Aas
Joshua Hoblitt [EMAIL PROTECTED] writes:

 I would like the ability to set the content of an HTTP::Message
 object by passing in a ref to scalar.  This would be a 1 x content
 savings of memory, which can be significant for large messages.
 
 This would require some re-pluming so that $mess-{_content} becomes
 a ref to scalar (instead of a scalar) and the addition of a mutator,
 eg. $mess-set_content_ref.  It's too bad that lvalues are still
 problematic.
 
 Comments?

The LWP API does not use set_ methods or lvalues.  The value of the
content_ref attribute would be updated if you pass an argument to the
method.

I think this is a good idea since we already have the content_ref
method.  I tried to implement it too since I thought it would be
trivial.  The change got a lot bigger than trivial before I was happy
with how this interacted with the 'parts*' methods.

This is the patch I ended up with.  It is likely to be part of the
next LWP release.

Regards,
Gisle


Index: lib/HTTP/Message.pm
===
RCS file: /cvsroot/libwww-perl/lwp5/lib/HTTP/Message.pm,v
retrieving revision 1.42
diff -u -p -r1.42 Message.pm
--- lib/HTTP/Message.pm 9 Apr 2004 15:07:04 -   1.42
+++ lib/HTTP/Message.pm 9 Jun 2004 10:53:50 -
@@ -75,7 +75,7 @@ sub clone
 sub clear {
 my $self = shift;
 $self-{_headers}-clear;
-$self-{_content} = ;
+$self-content();
 delete $self-{_parts};
 return;
 }
@@ -84,16 +84,33 @@ sub clear {
 sub protocol { shift-_elem('_protocol',  @_); }
 
 sub content  {
-my $self = shift;
-if (defined(wantarray)  !exists $self-{_content}) {
-   $self-_content;
+
+my $self = $_[0];
+if (defined(wantarray)) {
+   $self-_content unless exists $self-{_content};
+   my $old = $self-{_content};
+   _set_content if @_  1;
+   $old = $$old if ref($old) eq SCALAR;
+   return $old;
 }
-my $old = $self-{_content};
-if (@_) {
-   $self-{_content} = shift;
-   delete $self-{_parts};
+
+if (@_  1) {
+   _set_content;
+}
+else {
+   Carp::carp(Useless content call in void context) if $^W;
 }
-$old;
+}
+
+sub _set_content {
+my $self = $_[0];
+if (ref($self-{_content}) eq SCALAR) {
+   ${$self-{_content}} = $_[1];
+}
+else {
+   $self-{_content} = $_[1];
+}
+delete $self-{_parts} unless $_[2];
 }
 
 
@@ -101,11 +118,18 @@ sub add_content
 {
 my $self = shift;
 $self-_content unless exists $self-{_content};
-if (ref($_[0])) {
-   $self-{'_content'} .= ${$_[0]};  # for backwards compatability
+my $chunkref = \$_[0];
+$chunkref = $$chunkref if ref($$chunkref);  # legacy
+
+my $ref = ref($self-{_content});
+if (!$ref) {
+   $self-{_content} .= $$chunkref;
+}
+elsif ($ref eq SCALAR) {
+   ${$self-{_content}} .= $$chunkref;
 }
 else {
-   $self-{'_content'} .= $_[0];
+   Carp::croak(Can't append to $ref content);
 }
 delete $self-{_parts};
 }
@@ -116,7 +140,14 @@ sub content_ref
 my $self = shift;
 $self-_content unless exists $self-{_content};
 delete $self-{_parts};
-\$self-{'_content'};
+my $old = \$self-{_content};
+$old = $$old if ref($$old);
+if (@_) {
+   my $new = shift;
+   Carp::croak(Setting content_ref to a non-ref) unless ref($new);
+   $self-{_content} = $new;
+}
+return $old;
 }
 
 
@@ -144,7 +175,7 @@ sub headers_as_string  { shift-{'_heade
 
 sub parts {
 my $self = shift;
-if (defined(wantarray)  !exists $self-{_parts}) {
+if (defined(wantarray)  (!exists $self-{_parts} || ref($self-{_content}) eq 
SCALAR)) {
$self-_parts;
 }
 my $old = $self-{_parts};
@@ -160,7 +191,7 @@ sub parts {
$self-content_type(multipart/mixed);
}
$self-{_parts} = [EMAIL PROTECTED];
-   delete $self-{_content};
+   _stale_content($self);
 }
 return @$old if wantarray;
 return $old-[0];
@@ -174,15 +205,27 @@ sub add_part {
$self-content_type(multipart/mixed);
$self-{_parts} = [$p];
 }
-elsif (!exists $self-{_parts}) {
+elsif (!exists $self-{_parts} || ref($self-{_content}) eq SCALAR) {
$self-_parts;
 }
 
 push(@{$self-{_parts}}, @_);
-delete $self-{_content};
+_stale_content($self);
 return;
 }
 
+sub _stale_content {
+my $self = shift;
+if (ref($self-{_content}) eq SCALAR) {
+   # must recalculate now
+   $self-_content;
+}
+else {
+   # just invalidate cache
+   delete $self-{_content};
+}
+}
+
 
 # delegate all other method calls the the _headers object.
 sub AUTOLOAD
@@ -219,7 +262,7 @@ sub _parts {
die Assert unless @h;
my %h = @{$h[0]};
if (defined(my $b = $h{boundary})) {
-   my $str = $self-{_content};
+   my $str = $self-content;
$str =~ s/\r?\n--\Q$b\E--\r?\n.*//s;
if ($str =~ s

Re: Patch to support --full-time in File::Listing

2004-06-16 Thread Gisle Aas
Christopher J. Madsen [EMAIL PROTECTED] writes:

 Attached is a patch against LWP 5.79 to allow File::Listing to
 interpret the output of GNU ls's --full-time option.  This allows you
 to get timestamps accurate to the second, instead of the minute-based
 ones you get with a normal ls -l.

The patch did not apply here.  Are you patching from a pristine 5.79?

[EMAIL PROTECTED] lwp5]$ patch -p0 full-time.patch
patching file lib/File/Listing.pm
Hunk #2 FAILED at 372.
Hunk #3 FAILED at 1.
Hunk #4 FAILED at 84.
3 out of 4 hunks FAILED -- saving rejects to file lib/File/Listing.pm.rej

Anyway, this is how --full-time comes out here (Redhat 9).  It does
not appear to be the same format you try to parse.

[EMAIL PROTECTED] lwp5]$ ls -l --full-time
total 368
-rw-rw-r--1 gislegisle3800 2004-04-07 12:44:47.0 +0200 AUTHORS
drwxrwxr-x3 gislegisle4096 2004-06-14 14:59:56.0 +0200 bin
drwxrwxr-x7 gislegisle4096 2004-06-14 14:59:58.0 +0200 blib
-rw-rw-r--1 gislegisle   83867 2004-06-14 19:30:48.0 +0200 Changes

[EMAIL PROTECTED] lwp5]$ ls --version
ls (coreutils) 4.5.3
Written by Richard Stallman and David MacKenzie.

Copyright (C) 2002 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Regards,
Gisle


 I believe it also handles BSD ls's -T option, but I don't have a BSD
 system to test.  I'm just working off the OpenBSD manpage.
 
 The new time formats are recognized automatically; you just call
 parse_dir like you normally would.


Re: lwp-request patch to display response body on error

2004-06-16 Thread Gisle Aas
Lucas Gonze [EMAIL PROTECTED] writes:

 The lwp-request, GET, HEAD, POST - Simple WWW user agent  utilities
 never display the response body if the response code is an error.  For
 RESTful web services this suppresses potential debug information.

You don't state what version of LWP you are using, but
libwww-perl-5.71 (2003-10-14) had this fix:

lwp-request now prints unsuccessful responses in the same way
as successsful ones.  The status will still indicate failures.
Based on a patch by Steve Hay [EMAIL PROTECTED].

Didn't that address this concern?

Regards,
Gisle


 Background: I am writing an API for my web app; documentation (out of date
 but enough to get the gist) on what I am doing is at
 http://webjay.org/help/api.  The client is expected to be a program, not a
 browser, so I use response status codes to communicate specifics about
 errors and the response body to communicate useful debugging hints.  A
 typical error response is:
 
 ...
 HTTP/1.1 409 Conflict
 Content-Type: text/plain
 
 There is already a playlist with this title.
 ...
 
 However, requests made using lwp-request never display the response body
 if there is an error.  lwp-request does this:
 if ($response-is_success){
   ...
 } else {
   print STDERR $response-error_as_HTML unless $options{'d'};
 }
 
 And that turns into the boilerplate HTML in HTTP/Response.pm:
 sub error_as_HTML
 {
 my $self = shift;
 my $title = 'An Error Occurred';
 my $body  = $self-status_line;
 return EOM;
 HTML
 HEADTITLE$title/TITLE/HEAD
 BODY
 H1$title/H1
 $body
 /BODY
 /HTML
 EOM
 }
 
 I am expecting clients to be shell scripts using the lwp-request
 utilities, so it's important for the debug messages to be displayed.
 
 The fix: in GET, I have added a -D flag to display the response body even
 if there is an error.  This seemed like a good cognate next to -d, which
 always suppresses the response body.  Here is the patch, diff'd against my
 local copy of GET, which may not be the most recent:
 
 bash-2.05a$ diff /usr/bin/GET GET
 282a283
'D', # LG patch -- display response body even on error
 477a479,482
  # LG patch to support my added -D flag
  if( $options{'D'} ){
  print STDERR $response-content unless $options{'d'};
  } else {
 479a485
  }
 
 - Lucas Gonze


libwww-perl-5.800

2004-06-17 Thread Gisle Aas
A brand new libwww-perl release should be out on CPAN now.  In fear of
running out of version numbers less than 5.9 I've added one more
digit.  I want to reserve 5.9 for betas for 6.0 if that should ever
happen.  The next release will be 5.801, so this scheme should keep us
going for a while.

The changes since 5.79 are:

HTML::Form will allow individual menu entries to be disabled.
This was needed to support input type=radio disabled value=foo
and selectoption disabledfoo.

HTML::Form now avoids name clashes between the select and
option attributes.

HTML::Form now implicitly close select elements when it sees
another input or /form.  This is closer to the MSIE behaviour.

HTML::Form will now support keygen-inputs.  It will not
calculate a key by itself.  The user will have to set its
value for it to be returned by the form.

HTTP::Headers now special case field names that start with a
':'.   This is used as an escape mechanism when you need the
header names to not go through canonicalization.  It means
that you can force LWP to use a specific casing and even
underscores in header names.  The ugly $TRANSLATE_UNDERSCORE
global has been undocumented as a result of this.

HTTP::Message will now allow an external 'content_ref'
to be set.  This can for instance be used to let HTTP::Request
objects pick up content data from some scalar variable without
having to copy it.

HTTP::Request::Common.  The individual parts will no longer
have a Content-Length header for file uploads.  This improves
compatibility with normal browsers.

LWP::Simple doc patch for getprint.
Contributed by Yitzchak Scott-Thoennes [EMAIL PROTECTED].

LWP::UserAgent: New methods default_header() and
default_headers().  These can be used to set up headers that
area automatically added to requests as they are sent.  This
can for instance be used to initialize various Accept headers.

Various typo fixes by Ville Skyttä [EMAIL PROTECTED].

Fixed test failure under perl-5.005.

LWP::Protocol::loopback:  This is a new protocol handler that
works like the HTTP TRACE method, it will return the request
provided to it.  This is sometimes useful for testing.  It can
for instance be invoked by setting the 'http_proxy' environment
variable to 'loopback:'.

Enjoy!

Regards,
Gisle


Re: libwww-perl-5.800

2004-06-18 Thread Gisle Aas
\(William\) Wenjie Wang [EMAIL PROTECTED] writes:

 Failed Test   Stat Wstat Total Fail  Failed  List of Failed
 
 live/activestate.t 255 65280 23 150.00%  1-2
 live/jigsaw-auth-b.t 33 100.00%  1-3
 live/jigsaw-auth-d.t 11 100.00%  1
 live/jigsaw-chunk.t  9  2304 58 160.00%  1-5
 live/jigsaw-md5-get.t22 100.00%  1-2
 live/jigsaw-md5.t22 100.00%  1-2
 live/jigsaw-neg-get.t11 100.00%  1
 live/jigsaw-neg.t11 100.00%  1
 live/validator.t 2   512 24 200.00%  1-2
 Failed 9/41 test scripts, 78.05% okay. 19/761 subtests failed, 97.50% okay.
 NMAKE : fatal error U1077: 'C:\Perl\bin\perl.exe' : return code '0x2'
 Stop.

I think this must be a local problem at your site.  Are the machine
you're testing from properly connected to the Internet?  Do you have
to go through some proxy?

Regards,
Gisle


Re: libwww-perl-5.800

2004-06-18 Thread Gisle Aas
I would be greatful if you had the time to figure out why the tests
fail and perhaps even propose patches to work around the issue.  There
might be simple tweaks that can be done to them to make them work in
your envirionment.

You might run tests individually like this:

cd libwww-perl-5.800
perl -Ilib t/live/jigsaw-md5.t

--Gisle


Re: a suggestion for URI or URI::Heuristic

2004-06-28 Thread Gisle Aas
[EMAIL PROTECTED] writes:

 How about this for the next version of URI:
 
 URI-new(%68ttp://www.example.com/)-canonical eq http://www.example.com/;

Why?  This appears just wrong.  RFC 2396 does not allow escapes in the
scheme part.  Is this used out in the wild?

Regards,
Gisle


Re: support for multiple outgoing IPs

2004-07-13 Thread Gisle Aas
Jeff 'japhy' Pinyan [EMAIL PROTECTED] writes:

 I'm going to release these subclasses, but I'd like to know if the libwww
 suite can perhaps be rewritten in the future to allow for this type of
 thing...

It is already sort of supported.  You can set the outgoing address by
tweaking the @LWP::Protocol::http::EXTRA_SOCK_OPTS.  What is your suggested
change to support this?

Regards,
Gisle


Re: support for multiple outgoing IPs

2004-07-13 Thread Gisle Aas
Jeff 'japhy' Pinyan [EMAIL PROTECTED] writes:

 On Jul 13, Gisle Aas said:
 
 Jeff 'japhy' Pinyan [EMAIL PROTECTED] writes:
 
  I'm going to release these subclasses, but I'd like to know if the libwww
  suite can perhaps be rewritten in the future to allow for this type of
  thing...
 
 It is already sort of supported.  You can set the outgoing address by
 tweaking the @LWP::Protocol::http::EXTRA_SOCK_OPTS.  What is your suggested
 change to support this?
 
 (That's not in the FTP protocol module, by the way...)

I know :(

 I see that, and I'm using it in my subclass, but it's a matter of getting
 the stuff *to* EXTRA_SOCK_OPTS.  The data (the array of IPs to use)
 shouldn't necessarily belong to the LWP::Protocol::http subclass; I'd
 expect it to belong to the LWP::UserAgent object, or in this case, the
 HTTP::Proxy object.

Either that or we could attach it to the request object.  Attaching it
to the request give more flexibility and it could potentially be
defaulted from the $ua-default_header settings.

 And I haven't found a way to create my own LWP::Protocol::http subclass
 that is used instead of the original one.  That's why I had to subclass
 LWP::UserAgent and LWP::Protocol as well.

You should be able to override protocol handlers with:

   LWP::Protocol::implementor(http, MyClass)

 The server I did my work on is currently down, but I'll provide my code
 tomorrow.

Ok, I'll take a look then.

Regards,
Gisle


Re: Patch to Form.pm to recognize button type=submit

2004-07-19 Thread Gisle Aas
Michael Alan Dorman [EMAIL PROTECTED] writes:

 Because the input type=button tag doesn't allow text other than
 the value to be displayed in the button, I've had to start using the
 button tag on some of my pages.  Imagine my dismay when this caused
 WWW::Mechanize to no longer recognize that my form had buttons!

Nobody complained before so I guess they are not used much.

 After poking around WWW::Mechanize for a bit, I was led to HTML::Form,
 which doesn't currently recognize button tags.  This patch certainly
 fixes the issue I was having, and I think it represents a generally
 applicable enhancement.  I'd love to see it included in the next drop
 of libwww-perl.

Looks good.  Would be even better if you also updated the test suite.
I'll get in included in the next release.  Currely I have problems
accessing soureforge so I'm not able to get it checked in.

 --- Form.pm.orig  2004-06-16 06:41:23.0 -0400
 +++ Form.pm   2004-07-19 11:03:31.0 -0400
 @@ -96,7 +96,7 @@
  my $p = HTML::TokeParser-new(ref($html) ? $html-content_ref : \$html);
  eval {
   # optimization
 - $p-report_tags(qw(form input textarea select optgroup option keygen));
 + $p-report_tags(qw(button form input textarea select optgroup option keygen));
  };
  
  unless (defined $base_uri) {
 @@ -130,6 +130,11 @@
   $attr-{value_name} = $p-get_phrase;
   $f-push_input($type, $attr);
   }
 + elsif ($tag eq button) {
 + my $type = delete $attr-{type} || submit;
 + $attr-{value_name} = $p-get_phrase;
 + $f-push_input($type, $attr);

I don't think we should support button type=checkbox and similar
so I suggest we only push the input if the $type is submit at this
point.

 + }
   elsif ($tag eq textarea) {
   $attr-{textarea_value} = $attr-{value}
   if exists $attr-{value};

Regards,
Gisle


Re: Problem uploading large files with PUT

2004-07-28 Thread Gisle Aas
Rodrigo Ruiz [EMAIL PROTECTED] writes:

 Yesterday I updated my LWP module from version 5.75 to the current 5.8
 version. From this update, one of my scripts has stopped working.

Oops!  Sorry.

 The script creates a PUT request, specifying a subroutine as the
 content, for dynamic content retrieval. The original code does:
 
   my $req = HTTP::Request-new(PUT, $url, $header, $readFunc);
 
 But now it dies with a Not a SCALAR reference error.

I tried to reproduce this error but it did not happen for me.  Are you
able to provide a complete little program that demonstrate this
failure.

 I have been debugging the LWP code, and I have found the following
 workaround:
 
   my $req = HTTP::Request-new(PUT, $url, $header, \$readFunc);
 
 That is, pass the function reference, by reference.
 
 Unfortunately, this change makes my script fail with older LWP versions.
 
 My questions are:
 Is there a more elegant workaround that do not break compatibility
 with older LWP versions?

You could always do;

    $LWP::VERSION  5.800 ? $readFunc : \$readFunc

but I rather fix this problem in 5.801.  A test case that reproduce
this would be very helpful.

 If not, and I put these two lines in an if-else sentence, comparing
 the $LWP::VERSION value with a threshold , which exact version
 should I compare to?

This change went into 5.800. I'm quite sure it must be the culprit:

|HTTP::Message will now allow an external 'content_ref'
|to be set.  This can for instance be used to let HTTP::Request
|objects pick up content data from some scalar variable without
|having to copy it.

Regards,
Gisle


Re: LWP

2004-08-05 Thread Gisle Aas
DePriest, Mitch [EMAIL PROTECTED] writes:

 Will LWP::Simple run an activestate

Not really sure what you are asking about here, but LWP is part of the
standard ActivePerl distribution from ActiveState.  This includes the
LWP::Simple module.  A system with ActivePerl will always have LWP::Simple.

Regards,
Gisle Aas,
ActiveState


Re: Simulate HTTP transactions

2004-08-20 Thread Gisle Aas
William McKee [EMAIL PROTECTED] writes:

 On Thu, Aug 19, 2004 at 02:07:27PM -0700, Jaime Rodriguez wrote:
  Somebody tell me that in Perl, this is easily handled with the
  lib-www-perl module.  I had never used Perl and I wonder if you now
  what  I'm talking about and even if you can help me with some guidance
  of how to do it.
 
 You could do this with LWP but there are at least a couple helper
 modules that will make your life easier:
 
   HTTP::WebTest
   WWW::Mechanize

But if he never used Perl then it might be a good idea to learn that
first.  Reading books could be a way to do that.

--Gisle


Re: scriptscript bug in HTML::TokeParser?

2004-08-25 Thread Gisle Aas
ashley [EMAIL PROTECTED] writes:

 Hey everyone. I think this is the right list to bring this up, please
 forgive me if I'm wrong.

This list should be right.

 While writing a simple HTML validator, forbidden tag stripper, I came
 across what might be a problem, though it might be expected and
 appropriate behavior, I thought I'd better bring it up.
 
 A script following a script is interpreted as text.

This is expected behaviour.  After script no tags are recognized
until /script is seen.  Everything in between is reported as text
and should be passed to whatever is able to parse the script if your
interested in it.

The same behaviour occurs for style, textarea and xmp.

  I realize that
 the actual script is text but maybe it should be loaded into PI (or D
 or C) instead of T? If not, plain stripping of the HTML leaves a
 potentially problematic situation.

How?  If you strip all script text there should not be a problem.

Regards,
Gisle

 Demo of the problem is below.
 
 Thank you for looking!
 -Ashley
 
 
 use HTML::TokeParser;
 use Data::Dumper;
 $Data::Dumper::Terse = 1;
 
 my $text = join '', DATA;
 my $p = HTML::TokeParser-new( \$text );
 
 while ( my $token = $p-get_token )
 {
  print Dumper $token;
 }
 
 __DATA__
 This is my spurious or malicious
 htmlscriptscriptalert('boo!')/script


Re: is_success() returning tru even though server was down

2004-09-01 Thread Gisle Aas
James Cloos [EMAIL PROTECTED] writes:

 I have some code that does:
 
 my $req = HTTP::Request-new(GET = http://$foo/bar;);
 my $res = $ua-request($req);
 push @good, $foo if ($res-is_success);
 
 in a loop.
 
 I tested that is_success did the right thing if the file bar was not
 in teh server's $SERVER_ROOT, and I presumed it would return false
 if $foo was not up.
 
 But in fact, $res-_rc is 200 when the remote box is down just like
 when the file bar exists.
 
 Tested on gentoo w/ latest ebuilds of perl and libwww, and freebsd
 5.2.1 w/ their ports.
 
 Why does _rc == 200 when their was no reply from the server?

I've never seen that happen.  Can you provide me with the full
$res-as_string output in this case.  It might also be instructive to
strace the client as it runs to see what happens at the syscall level.
If you get a 200 response it must mean that the connection to the
server succeeded.

 I presume part of the problem is that it appears to be sending a HTTP
 0.9 GET rather than a 1.0 GET.  I don't see anything in the docs
 about forcing the latter.  How is that done?

LWP always sends HTTP/1.1 GETs.

 Or should I do a HEAD instead of a GET, given that I'm only testing
 for the existence of the file and the network connection between the
 two boxen?

The HEAD might be cheaper, but not all servers implement it for all
resources.

Regards,
Gisle


Re: URI::file not RFC 1738 compliant?

2004-09-06 Thread Gisle Aas
Ville Skyttä [EMAIL PROTECTED] writes:

 As far as I can tell, RFC 1738, section 3.10, as well as the BNF in
 section 5 explicitly say that file: URI must have two forward slashes
 before the optional hostname, followed by another forward slash, and
 then the path.

RFC 1738 is becoming a bit stale.  I do believe that the intent is for
'file' URIs to also follow the RFC 2396 syntax for hierarchical
namespaces which clearly states at the 'authority' is optional.

  absoluteURI   = scheme : ( hier_part | opaque_part )
  hier_part = ( net_path | abs_path ) [ ? query ]
  net_path  = // authority [ abs_path ]
  abs_path  = /  path_segments

 However:
 
   $ perl -MURI::file -e 'print URI::file-new_abs(/foo), \n'
   file:/foo
 
 I would have expected file:///foo.

I just find 'file:///foo' very ugly so I try to avoid using the
triple slash whenever I can.  There is also a slight semantic
difference between these two forms.  If the authority is missing it
means that it is unknown, while authority of  is documented to be
synonym for localhost.  Perhaps this can be used to argue that
'file:///foo' is more correct.

  These one-slash file: URIs cause
 various interoperability problems here and there with applications or
 other libraries that require strict RFC compliance.  For example,
 XML::LibXML::SAX seems to treat file:/foo as a literal relative path
 from the current directory (ie. $PWD/file:/foo), whereas file:///foo
 works with it as expected.

Do you have other examples?

 Would it be possible to have this fixed in URI?

Sure.  Especially if I'm told about more apps that can't inter operate
with authority-less-file-URIs.  I might want to make it an option.

Regards,
Gisle


Re: URI::file not RFC 1738 compliant?

2004-09-07 Thread Gisle Aas
Bjoern Hoehrmann [EMAIL PROTECTED] writes:

   Unhandled Exception: System.UriFormatException: Invalid URI:
   The format of the URI could not be determined.

Ok. URI-1.32 has just been uploaded and it revise how we map filenames
to file URIs.  Some examples with the new module:

$ perl -MURI::file -le 'print URI::file-new(foo, unix)'
foo
$ perl -MURI::file -le 'print URI::file-new(/foo, unix)'
file:///foo
$ perl -MURI::file -le 'print URI::file-new(foo, win32)'
foo
$ perl -MURI::file -le 'print URI::file-new(/foo, win32)'
file:///foo
$ perl -MURI::file -le 'print URI::file-new(//h/foo, win32)'
file:h/foo
$ perl -MURI::file -le 'print URI::file-new(c:foo, win32)'
file:///c:foo
$ perl -MURI::file -le 'print URI::file-new(c:\\foo, win32)'
file:///c:/foo


Re: Download manager - problem solved

2004-09-17 Thread Gisle Aas
Octavian Rasnita [EMAIL PROTECTED] writes:

 I have found how to insert HTTP headers in the line:
 
 $ua-get($url, :content_file = $file);
 
 ...But I still can't find how to get and send cookies in other way than
 trying to manually add the headers for Cookies.

Just enable the cookie jar, with $ua-cookie_jar({}), and LWP will
manage the cookies for you.

Regards,
Gisle


Re: Breaking a keep alive connection

2004-10-01 Thread Gisle Aas
Bill Moseley [EMAIL PROTECTED] writes:

 I'm using keep alives and the form of $ua-get() that uses a callback
 function to read the data as it arrives.
 
 If the callback function dies will the connection always be broken?

Yes, unless it dies after the last part of the response has actually
been read.

 That is, will the next request to that server be a new connection, not
 an existing open connection from a previous keep alive request?
 
 I assume if I've only read the first chunk out of a very large
 response then the connection will be broken.  But, I'm not clear what
 happens if the fetched document is very small (like the first chunk is
 the entire document).
 
 Does size matter?  Or would LWP drop the connection regardless?

If LWP has provided you with the complete content when your callback
dies, then the connection is kept up.

 Also, is there a way to ask LWP if a request would be to an open
 connection before making the actual connection?

You could roam around in the $ua-conn_cache object, but it is not
really documented how the connections are indexed, or what the
connection objects actually are.

If you want to make sure LWP has no more connections open, you can
call $au-conn_cache-drop;

Regards,
Gisle


Re: Byte Order Mark mucks up headers

2004-10-07 Thread Gisle Aas
Phil Archer [EMAIL PROTECTED] writes:

 I've read Sean Burke's book, I've looked through the archives of this
 list and done other searches but can't find an answer to a problem I
 have found with LWP. If the character coding for a website has a byte
 order mark (things like utf-16, all that big endian/little endian
 stuff) then LWP can't interpret HTML headers in the usual way. Does
 anyone know a way around this?

HML::HeadParser needs to be fixed.  It will assume that there is no
head section when it sees text before anything else.  The part of
the code responsible for this currently allows whitespace, but needs
to be tought that BOM is harmless too.  Look at the 'text' method.

Do you want to try to provide a patch?

Regards,
Gisle


Re: mirror.al

2004-10-26 Thread Gisle Aas
Bill Moseley [EMAIL PROTECTED] writes:

 I'm trying to understand this error:
 
 $ perl -MLWP::UserAgent -le 'LWP::UserAgent-new-mirror(http://perl.org;, 
 perl.org )'
 Can't locate auto/LWP/UserAgent/mirror.al in @INC (@INC contains: 
 /usr/local/lib/perl5/5.8.0/sun4-solaris /usr/local/lib/perl5/5.8.0 
 /usr/local/lib/perl5/site_perl/5.8.0/sun4-solaris 
 /usr/local/lib/perl5/site_perl/5.8.0 /usr/local/lib/perl5/site_perl .) at -e line 1

Looks like LWP was not installed with the normal 'perl Makefile.PL 
make install' drill.  This creates the .al files.

 There's a few versions of Perl installed on this machine -- so I'm
 wondering if there's some kind of conflict.  Or if it's just a broken
 Sun package.
 
 $ perl -MLWP -le 'print $LWP::VERSION'
 5.10

Newer versions of LWP do not use the autoloader and would not run into
this problem.  This version of LWP is more than 7 years old.  I
recommend something newer.

Regards,
Gisle


Re: LWP installation failed make test: base/date

2004-11-05 Thread Gisle Aas
Craig Cummings [EMAIL PROTECTED] writes:

 I'm trying to install Bundle::LWP on my Debian Linux system.

Interesting.  What does these commands print on your system:

   perl -le 'print scalar gmtime(0)'
   perl -le 'print scalar gmtime(760233600)'
   perl -le 'print scalar gmtime(3915993600)'



Re: Promoting Mechanize

2004-11-08 Thread Gisle Aas
Andy Lester [EMAIL PROTECTED] writes:

 Gisle, can we put some kind of mention of WWW::Mechanize in
 LWP::UserAgent?  Plenty of people know about LWP, but want to do the
 rest of the stuff that Mech does.  See this as an example:
 
 http://www.perlmonks.org/?node_id=405988

The LWP::UserAgent manpage[1] already says:

| See WWW::Mechanize and WWW::Search for examples of more specialized
| user agents based on LWP::UserAgent.

Do you want some other reference as well?

[1] http://search.cpan.org/dist/libwww-perl/lib/LWP/UserAgent.pm#SEE_ALSO


Re: Segfault using HTML::Parser and URI::URL

2004-11-09 Thread Gisle Aas
Thibaut Britz [EMAIL PROTECTED] writes:

 the following produces a segfault using the latest version of libwww. 

I see segfaults with ActivePerl 810 but not with our latests builds.
What version of perl are you using?  The segfault appears to be a bug
in perl I would like to find out if the problem has really been fixed.

 As it seems, HTML::Parser is marking non UTF8 strings as UTF8 strings.

Did you enable the Unicode support when you installed HTML-Parser?  It
seems like this would be the only time this happens, but I want to be
sure.

 or to see it:
 
 #!/usr/bin/perl
 use warnings;
 use strict;
 use Devel::Peek;
 use HTML::Parser;
 my $html = qq{img title=rsquo;\260};
 my $p = HTML::Parser-new(api_version=3,start_h=[sub{Dump(shift-
 {title})}, attr]);
 $p-parse($html);

What output do you get?


Re: Segfault using HTML::Parser and URI::URL

2004-11-10 Thread Gisle Aas
The following patch should make sure that HTML::Parser does not
produce badly encoded SVs.  That avoid the problem demonstrated, but I
still need to track down why perl itself segfaulted because of this.

Regards,
Gisle

Index: util.c
===
RCS file: /cvsroot/libwww-perl/html-parser/util.c,v
retrieving revision 2.20
retrieving revision 2.21
diff -u -p -r2.20 -r2.21
--- util.c  8 Nov 2004 14:14:35 -   2.20
+++ util.c  10 Nov 2004 13:32:56 -  2.21
@@ -209,23 +209,21 @@ decode_entities(pTHX_ SV* sv, HV* entity
}
 
if (!SvUTF8(sv)  repl_utf8) {
-   STRLEN len = t - SvPVX(sv);
-   if (len) {
-   /* need to upgrade the part that we have looked though */
-   STRLEN old_len = len;
-   char *ustr = bytes_to_utf8(SvPVX(sv), len);
-   STRLEN grow = len - old_len;
-   if (grow) {
-   /* XXX It might already be enough gap, so we don't need 
this,
-  but it should not hurt either.
-   */
-   grow_gap(aTHX_ sv, grow, t, s, end);
-   Copy(ustr, SvPVX(sv), len, char);
-   t = SvPVX(sv) + len;
-   }
-   Safefree(ustr);
-   }
+   /* need to upgrade sv before we continue */
+   STRLEN before_gap_len = t - SvPVX(sv);
+   char *before_gap = bytes_to_utf8(SvPVX(sv), before_gap_len);
+   STRLEN after_gap_len = end - s;
+   char *after_gap = bytes_to_utf8(s, after_gap_len);
+
+   sv_setpvn(sv, before_gap, before_gap_len);
+   sv_catpvn(sv, after_gap, after_gap_len);
SvUTF8_on(sv);
+
+   Safefree(before_gap);
+   Safefree(after_gap);
+
+   s = t = SvPVX(sv) + before_gap_len;
+   end = SvPVX(sv) + before_gap_len + after_gap_len;
}
else if (SvUTF8(sv)  !repl_utf8) {
repl = bytes_to_utf8(repl, repl_len);
Index: t/uentities.t
===
RCS file: /cvsroot/libwww-perl/html-parser/t/uentities.t,v
retrieving revision 1.8
retrieving revision 1.9
diff -u -p -r1.8 -r1.9
--- t/uentities.t   8 Nov 2004 14:14:42 -   1.8
+++ t/uentities.t   10 Nov 2004 13:33:03 -  1.9
@@ -14,7 +14,7 @@ unless (HTML::Entities::UNICODE_SUPPORT
 exit;
 }
 
-print 1..13\n;
+print 1..14\n;
 
 print not  unless decode_entities(euro) eq \x{20AC};
 print ok 1\n;
@@ -90,3 +90,6 @@ print ok 12\n;
 
 print not  unless decode_entities(#56256) eq chr(0xFFFD);
 print ok 13\n;
+
+print not  unless decode_entities(\260rsquo;\260) eq 
\x{b0}\x{2019}\x{b0};
+print ok 14\n;


Re: Segfault using HTML::Parser and URI::URL

2004-11-10 Thread Gisle Aas
Gisle Aas [EMAIL PROTECTED] writes:

 The following patch should make sure that HTML::Parser does not
 produce badly encoded SVs.  That avoid the problem demonstrated, but I
 still need to track down why perl itself segfaulted because of this.

Perl crashed because the regexp engine did deal properly with bad
UTF8.  This will be fixed in perl-5.8.6 by this patch:

   http://public.activestate.com/cgi-bin/perlbrowse?patch=23261

Regards,
Gisle


Re: HTML::Parser plaintext tag

2004-11-10 Thread Gisle Aas
Alex Kapranoff [EMAIL PROTECTED] writes:

 As far as I can understand HTML::Parser simply ignores closing
 /plaintext tag. I read the tests and Changes so I see that this is
 intended behaviour and plaintext is special-cased of all CDATA
 elements.
 
 Does someone know the reasoning of this decision? :) It is just plain
 interesting.

A long time ago the HTTP protocol did not have MIME-like headers.  The
client sent a GET foo line and the server responded with HTML and
then closed the connection.  Since there was no way for the server to
indicate any other Content-Type than text/html the plaintext tag was
introduced so that text files could be served by just prefixing the
file content with this tag.

This was before the img tag was invented so luckily we don't have a
similar unclosed gif tag :)

 Does HTML::Parser imitate some old browser here?

Yes, it was there in the beginning but still seems well supported.  Of
my current browsers both Konqueror and MSIE support this.  Firefox
support it in the same way as xmp, i.e. it allow you to escape out
of it with /plaintext.

The plaintext tag is described in this historic document:

   http://www.w3.org/History/19921103-hypertext/hypertext/WWW/MarkUp/Tags.html#7

 It results in weird effects for me as I write a HTML sanitizer for
 WebMail.

Howcome?  Do you have a need to suppress this behaviour in HTML::Parser?

Regards,
Gisle


Re: HTML::Parser plaintext tag

2004-11-11 Thread Gisle Aas
Alex Kapranoff [EMAIL PROTECTED] writes:

 * Alex Kapranoff [EMAIL PROTECTED] [November 11 2004, 11:11]:
It results in weird effects for me as I write a HTML sanitizer for
WebMail.
   Howcome?  Do you have a need to suppress this behaviour in HTML::Parser?
Yes, I'd like to have an option to resume parsing after `/plaintext'
  just as firefox does. As I understand the original intentions now I'll
  try to produce a patch.
 
 I've filed a ticket 8362 in rt.cpan.org with the patch. It creates an
 additional boolean attribute `closing_plaintext'. Not that I insist on
 naming.

Seems good; and I've just uploaded HTML-Parser-3.38 with this patch.


Re: [PATCH] Caching/reusing WWW::RobotRules(::InCore)

2004-11-12 Thread Gisle Aas
Ville Skyttä [EMAIL PROTECTED] writes:

 The current behaviour of LWP::RobotUA, when passed in an existing
 WWW::RobotRules::InCore object is counterintuitive to me.
 
 I am of this opinion because of the documentation of $rules in
 LWP::RobotUA-new() and WWW::RobotRules-agent(), as well as the
 implementation in WWW::RobotRules::AnyDBM_File.
 
 Currently, W::R::InCore empties the cache always when agent() is called,
 regardless if the agent name changed or not.  W::R::AnyDBM_File does not
 seem to have this problem.
 
 I suggest applying the attached patch to fix this.

Applied.  Will be in 5.801.

Regards,
Gisle


 Index: lib/WWW/RobotRules.pm
 ===
 RCS file: /cvsroot/libwww-perl/lwp5/lib/WWW/RobotRules.pm,v
 retrieving revision 1.30
 diff -a -u -r1.30 RobotRules.pm
 --- lib/WWW/RobotRules.pm 9 Apr 2004 15:09:14 -   1.30
 +++ lib/WWW/RobotRules.pm 12 Oct 2004 06:39:34 -
 @@ -185,10 +185,12 @@
  #   FooBot/1.2  = FooBot
  #   FooBot/1.2 [http://foobot.int; [EMAIL PROTECTED] = 
 FooBot
  
 - delete $self-{'loc'};   # all old info is now stale
   $name = $1 if $name =~ m/(\S+)/; # get first word
   $name =~ s!/.*!!;  # get rid of version
 - $self-{'ua'}=$name;
 + unless ($old  $old eq $name) {
 + delete $self-{'loc'}; # all old info is now stale
 + $self-{'ua'} = $name;
 + }
  }
  $old;
  }


Re: Patch for WWW::RobotsRules.pm

2004-11-12 Thread Gisle Aas
Bill Moseley [EMAIL PROTECTED] writes:

 I've got a spider that uses LWP::RobotUA (WWW::RobotRules) and a few
 users of the spider have complained that the warning messages were
 not obvious enough.  I guess I can agree because when they are
 spidering multiple hosts the message doesn't tell them what robots.txt
 had a problem.

The patch I've now applied is this one:

Index: lib/WWW/RobotRules.pm
===
RCS file: /cvsroot/libwww-perl/lwp5/lib/WWW/RobotRules.pm,v
retrieving revision 1.31
retrieving revision 1.32
diff -u -p -u -r1.31 -r1.32
--- lib/WWW/RobotRules.pm   12 Nov 2004 16:05:09 -  1.31
+++ lib/WWW/RobotRules.pm   12 Nov 2004 16:14:25 -  1.32
@@ -1,8 +1,8 @@
 package WWW::RobotRules;

-# $Id: RobotRules.pm,v 1.31 2004/11/12 16:05:09 gisle Exp $
+# $Id: RobotRules.pm,v 1.32 2004/11/12 16:14:25 gisle Exp $

-$VERSION = sprintf(%d.%02d, q$Revision: 1.31 $ =~ /(\d+)\.(\d+)/);
+$VERSION = sprintf(%d.%02d, q$Revision: 1.32 $ =~ /(\d+)\.(\d+)/);
 sub Version { $VERSION; }

 use strict;
@@ -70,7 +70,7 @@ sub parse {
}
elsif (/^\s*Disallow\s*:\s*(.*)/i) {
unless (defined $ua) {
-   warn RobotRules: Disallow without preceding User-agent\n;
+   warn RobotRules $robot_txt_uri: Disallow without preceding 
User-agent\n if $^W;
$is_anon = 1;  # assume that User-agent: * was intended
}
my $disallow = $1;
@@ -97,7 +97,7 @@ sub parse {
}
}
else {
-   warn RobotRules: Unexpected line: $_\n;
+   warn RobotRules $robot_txt_uri: Unexpected line: $_\n if $^W;
}
 }

 So maybe something like:
 
 --- RobotRules.pm.old   2004-04-09 08:37:08.0 -0700
 +++ RobotRules.pm   2004-09-16 09:46:03.0 -0700
 @@ -70,7 +70,7 @@
 }
 elsif (/^\s*Disallow\s*:\s*(.*)/i) {
 unless (defined $ua) {
 -   warn RobotRules: Disallow without preceding User-agent\n;
 +   warn RobotRules: [$robot_txt_uri] Disallow without preceding 
 User-agent\n;
 $is_anon = 1;  # assume that User-agent: * was intended
 }
 my $disallow = $1;
 @@ -97,7 +97,7 @@
 }
 }
 else {
 -   warn RobotRules: Unexpected line: $_\n;
 +   warn RobotRules: [$robot_txt_uri] Unexpected line: $_\n;
 }
  }


Re: WWW::RobotRules warning could be more helpful

2004-11-12 Thread Gisle Aas
[EMAIL PROTECTED] writes:

 If you spider several sites and one of them has a broken robots.txt file you
 can't tell which one since the warning doesn't tell you.

This will be better in 5.801.  I've applied a variation of Bill
Moseley's suggested patch for the same problem.

 Around line 73 of RobotRules.pm
 change:
   warn RobotRules: Disallow without preceding User-agent\n;
 to
   # [EMAIL PROTECTED]: added $netloc
 warn RobotRules: $netloc Disallow without preceding User-agent\n;


Re: / uri escaped in LWP::Protocol::file

2004-11-15 Thread Gisle Aas
Moshe Kaminsky [EMAIL PROTECTED] writes:

 It appears to me there is a small bug in LWP::Protocol::file. The '/' 
 added to the end of each directory member which is itself a directory, 
 is escaped when turning it into a url, making the url quite useless. I 
 suggest the following patch:

Finally applied.  Thanks!

Regards,
Gisle


 --- /usr/lib/perl5/vendor_perl/5.8.4/LWP/Protocol/file.old2004-09-19 
 22:56:35.786858776 +0300
 +++ /usr/lib/perl5/vendor_perl/5.8.4/LWP/Protocol/file.pm 2004-09-19 
 22:56:24.0 +0300
 @@ -96,14 +96,13 @@
   closedir(D);
  
   # Make directory listing
 +my $pathe = $path . ( $^O eq 'MacOS' ? ':' : '/');
   for (@files) {
 - if($^O eq MacOS) {
 - $_ .= / if -d $path:$_;
 - }
 - else {
 - $_ .= / if -d $path/$_;
 - }
   my $furl = URI::Escape::uri_escape($_);
 +if ( -d $pathe$_ ) {
 +$furl .= '/';
 +$_ .= '/';
 +}
   my $desc = HTML::Entities::encode($_);
   $_ = qq{LIA HREF=$furl$desc/A};
   }


Re: HTML::HeadParser

2004-11-15 Thread Gisle Aas
David Hofmann [EMAIL PROTECTED] writes:

 I'm currently using your perl module for processing input from a
 spider I wrote.
 
 The problem I'm encountering is some pages have  in the title.
 
 Example HTML:
 
 TITLE274500 - XL: Save Changes in Bookname Prompt Even If No
 Changes Are Made/TITLE
 
 The result I get back is XL: Save Changes in . Also the
 description, keywords and last-modified come back bank on these pages
 if they were after the title on the page.

It looks like most other browsers parse title stuff in what the
HTML::Parser sources call literal mode.  I've now applied the
following patch to my sources, but I'm not really sure this is a good
idea.  I might still decide to revert it before release.

Index: hparser.c
===
RCS file: /cvsroot/libwww-perl/html-parser/hparser.c,v
retrieving revision 2.98
retrieving revision 2.99
diff -u -p -u -r2.98 -r2.99
--- hparser.c   11 Nov 2004 10:12:51 -  2.98
+++ hparser.c   15 Nov 2004 22:19:49 -  2.99
@@ -1,4 +1,4 @@
-/* $Id: hparser.c,v 2.98 2004/11/11 10:12:51 gisle Exp $
+/* $Id: hparser.c,v 2.99 2004/11/15 22:19:49 gisle Exp $
  *
  * Copyright 1999-2002, Gisle Aas
  * Copyright 1999-2000, Michael A. Chase
@@ -27,6 +27,7 @@ literal_mode_elem[] =
 {5, style, 1},
 {3, xmp, 1},
 {9, plaintext, 1},
+{5, title, 0},
 {8, textarea, 0},
 {0, 0, 0}
 };

The problem here is that other browsers seems to switch into a mode
where tags inside title is still recognized if no /title end tag
was found in the document.  HTML-Parser does not have brains to do
something like this. It tries to parse the document in a stream-like
fashion, and buffering of it all to figure out what quirk-mode to
parse in does not seem attractive.

Regards,
Gisle


libwww-perl-5.801

2004-11-17 Thread Gisle Aas
Eventually I found time to fix the problem with code references as
content that was introduced by 5.800 and integrate some more patches.
I probably will make a 5.802 later this week so if there are new or
old patches you really want applied this might be is a good time to
speak up.

The changes since 5.800 are:


HTTP::Message improved content/content_ref interaction.  Fixes
DYNAMIC_FILE_UPLOAD and other uses of code content in requests.

HTML::Form:
  - Handle clicking on nameless image.
  - Don't let $form-click invoke a disabled submit button.

HTTP::Cookies could not handle a old-style cookie named
Expires.

HTTP::Headers work-around for thread safety issue in perl = 5.8.4.

HTTP::Request::Common improved documentation.

LWP::Protocol: Check that we can write to the file specified in
$ua-request(..., $file) or $ua-mirror.

LWP::UserAgent clone() dies if proxy was not set.  Patch by
Andy Lester [EMAIL PROTECTED]

HTTP::Methods now avoid use of uninitialized-warning when server
replies with incomplete status line.

lwp-download will now actually tell you why it aborts if it runs
out of disk space of fails to write some other way.

WWW::RobotRules: only display warning when running under 'perl -w'
and show which robots.txt file they correspond to.  Based on
patch by Bill Moseley.

WWW::RobotRules: Don't empty cache when agent() is called if the
agent name does not change.  Patch by Ville Skyttä [EMAIL PROTECTED].


Enjoy!

Regards,
Gisle


HTML-Parser-3.39_90

2004-11-17 Thread Gisle Aas
I just uploaded HTML-Parser-3.39_90 to CPAN.  It is supposed to have
proper handling of Unicode on perl-5.8 or better.  The compile time
option to select decoding of Unicode entities is gone.

This release also make title.../title parse in literal mode.  If
there are many pages out there with non-terminated title elements this
might not be such a good idea, so this change might not stay.

Please try it out to see if you find problems with it.

Regards,
Gisle


Re: URI doesn't accept a semi-colon as query parameter separator

2004-11-24 Thread Gisle Aas
Brian Cassidy [EMAIL PROTECTED] writes:

 
 I was testing an app at the command line which does some query and URL
 manipulation. At one point, I pass the URL as generated from CGI.pm, which
 happens to use a semi-colon (rather than an ampersand) as the query
 parameter separator. Once I tried to access the params from the hash URI
 returns from query_form(), I noticed that there was only 1 param instead of
 the many more I was expecting.

Is using the URI::Semi class workable for you?  If not, why?

http://www.rosat.mpe-garching.mpg.de/mailing-lists/libwww-perl/2002-09/msg00022.html


Re: user agents

2004-12-01 Thread Gisle Aas
Zed Lopez [EMAIL PROTECTED] writes:

 I'd like to suggest these differences be documented.

I agree this is wrong.  Do you want to suggest a doc patch?

 Does anyone know why _trivial_http_get uses its own user agent and
 HTTP version?

Because it is a totally different client implementation with it's own
bugs and limitations.  You can force always using the full LWP client
implementaion by importing $ua from LWP::Simple.

Regards,
Gisle


HTML-Parser-3.41

2004-12-01 Thread Gisle Aas
HTML-Parser-3.41 is available from CPAN.  The major news is that
HTML::Parser should now do the right thing with Unicode strings and
that the compile time option to enable Unicode entities is gone.
There is a new 'utf8_mode' that allow saner parsing of raw undecoded
UTF-8.  The Unicode support is only available if you use perl-5.8 or
better.

Other noteworthy recent changes:

   - title content parsed in literal mode

   - script and style skip quoted strings when looking for
 matching end tag

   - if no matching end tag is found for script, style, xmp
 title, textarea then generate one where the next tag
 starts.

   - will decode unterminated entities in 'dtext', i.e. foonbspbar
 become foo bar.


Enjoy!


libwww-perl-5.802

2004-12-01 Thread Gisle Aas
libwww-perl-5.802 is available from CPAN. The changes since 5.801 are:

The HTTP::Message object now have a decoded_content() method.
This will return the content after any Content-Encodings and
charsets has been decoded.

Compress::Zlib is now a prerequisite module.

HTTP::Request::Common: The POST() function created an invalid
Content-Type header for file uploads with no parameters.

Net::HTTP: Allow Transfer-Encoding with trailing whitespace.
http://rt.cpan.org/Ticket/Display.html?id=3929

Net::HTTP: Don't allow empty content to be treated as a valid
HTTP/0.9 response.
http://rt.cpan.org/Ticket/Display.html?id=4581
http://rt.cpan.org/Ticket/Display.html?id=6883

File::Protocol::file: Fixup directory links in HTML generated
for directories.  Patch by Moshe Kaminsky [EMAIL PROTECTED].

Makefile.PL will try to discover misconfigured systems that
can't talk to themselves and disable tests that depend on this.

Makefile.PL will now default to 'n' when asking about whether
to install the GET, HEAD, POST programs.  There has been
too many name clashes with these common names.


Enjoy!


decoded_content

2004-12-01 Thread Gisle Aas
Gisle Aas [EMAIL PROTECTED] writes:

 The HTTP::Message object now have a decoded_content() method.
 This will return the content after any Content-Encodings and
 charsets has been decoded.

The current $mess-decoded_content implementation is quite naïve in
it's mapping of charsets.  It need to either start using Björn's
HTML::Encoding module or start doing similar sniffing to better guess
the charset when the Content-Header does not provide any.

I also plan to expose a $mess-charset method that would just return
the guessed charset, i.e. something similar to
encoding_from_http_message() provided by HTML::Encoding.

Another problem is that I have no idea how well the charset names
found in the HTTP/HTML maps to the encoding names that the perl Encode
module supports.  Anybody knows what the state here is?

When this works the next step is to figure out the best way to do
streamed decoding.  This is needed for the HeadParser that LWP
invokes.

The main motivation for decoded_content is that HTML::Parser now works
better if properly decoded Unicode can be provided to it, but it still
fails here:

  $ lwp-request -d www.microsoft.com
  Parsing of undecoded UTF-8 will give garbage when decoding entities
  at lib/LWP/Protocol.pm line 114.

Here decoded_content needs to sniff the BOM to be able to guess that
they use UTF-8 so that a properly decoded string can be provided to
HTML::HeadParser.

The decoded_content also solve the frequent request of supporting
compressed content.  Just do something like this:

   $ua = LWP::UserAgent-new;
   $ua-default_header(Accept-Encoding = gzip, deflate);

   $res = $ua-get(http://www.example.com;);
   print $res-decoded_content(charset = none);

Regards,
Gisle


Re: user agents

2004-12-02 Thread Gisle Aas
Zed Lopez [EMAIL PROTECTED] writes:

 On 01 Dec 2004 01:35:13 -0800, Gisle Aas [EMAIL PROTECTED] wrote:
  Zed Lopez [EMAIL PROTECTED] writes:
   I'd like to suggest these differences be documented.
  
  I agree this is wrong.  Do you want to suggest a doc patch?
 
 I'm working on the doc patch... would it be considered desirable to
 document that a user can get get() to drive HTTP::Request by setting
 $LWP::Simple::FULL_LWP to a true value? Or that one can use get_old()
 to drive HTTP::Request?
 
 Obviously, no one wants to add a lot of complexity to a ::Simple
 module, but right now the behavior includes: the user agent and HTTP
 version are subject to change if an HTTP proxy is in use or if the
 requested page does a redirect. And there's no way to code around that
 within this module's public interface.

It is documented (barely) that the module export the variable '$ua'.
A side effect of importing this variable is that this forces the full
LWP::UserAgent implementation to be used, otherwise settings on the
$ua object would have no effect.  I want to declare this as the
official interface to force this and not document either get_old or
$FULL_LWP.

Regards,
Gisle


Re: HTML::Parser 3.40/3.41 and UTF8 on perl 5.8.0

2004-12-02 Thread Gisle Aas
Reed Russell - rreed [EMAIL PROTECTED] writes:

 The sv_catpvn_utf8_upgrade macro used in hparser.c in versions 3.40 and 3.41
 of HTML::Parser doesn't seem to exist in Perl 5.8.0.  Can the macro be
 replaced, so that the module is compatible with this version of Perl?

Sure.  Applied.  I've simpified your patch to be:

Index: hparser.c
===
RCS file: /cvsroot/libwww-perl/html-parser/hparser.c,v
retrieving revision 2.117
diff -u -p -r2.117 hparser.c
--- hparser.c   2 Dec 2004 11:14:59 -   2.117
+++ hparser.c   2 Dec 2004 11:50:59 -
@@ -300,8 +300,10 @@ report_event(PSTATE* p_state,
sv_catpvn(p_state-pend_text, beg, end - beg);
}
else {
-   SV *tmp = NULL;
-   sv_catpvn_utf8_upgrade(p_state-pend_text, beg, end - beg, tmp);
+   SV *tmp = newSVpvn(beg, end - beg);
+   sv_utf8_upgrade(tmp);
+   sv_catsv(p_state-pend_text, tmp);
+   SvREFCNT_dec(tmp);
}
 #else
sv_catpvn(p_state-pend_text, beg, end - beg);
@@ -639,8 +641,10 @@ IGNORE_EVENT:
 #ifdef UNICODE_HTML_PARSER
}
else {
-   SV *tmp = NULL;
-   sv_catpvn_utf8_upgrade(p_state-skipped_text, beg, end - beg, tmp);
+   SV *tmp = newSVpvn(beg, end - beg);
+   sv_utf8_upgrade(tmp);
+   sv_catsv(p_state-pend_text, tmp);
+   SvREFCNT_dec(tmp);
}
 #endif
 }


Re: HTTP::Response inconsistency

2004-12-03 Thread Gisle Aas
Harald Joerg [EMAIL PROTECTED] writes:

 HTTP::Response::clone doesn't clone the protocol either.
 This, however, can be fixed easily:

Thanks. Applied this patch to HTTP::Message so that also Requests
clone their protocol attribute.

 --- Response.pm.1.502004-12-02 21:36:42.43750 +0100
 +++ Response.pm 2004-12-02 21:37:18.34375 +0100
 @@ -47,4 +47,5 @@
   my $self = shift;
   my $clone = bless $self-SUPER::clone, ref($self);
 +$clone-protocol($self-protocol);
   $clone-code($self-code);
   $clone-message($self-message);
 
 -- 
 Cheers,
 haj


Re: HTTP::Response inconsistency

2004-12-03 Thread Gisle Aas
Harald Joerg [EMAIL PROTECTED] writes:

 As a fallback, HTTP::Response::parse could set the protocol to undef
 if it turns out to be a three-digit number, assigning this value to
 the code (after assigning to the message what was parsed as the code).

This is my preferred fix.  Just make HTTP::Response::parse deal with
what as_string spits out.  I would just make it look at the string
before spliting it.  If it starts with /\d/ split in 2 instead of 3.

 
 Maybe the best fallback would be to write some undefined value in
 HTTP::Response::as_string if the protocol is undefined:
 
  my $status_line = $code;
  my $proto = $self-protocol;
 -   $status_line = $proto $status_line if $proto;
 +   $status_line = $proto ? $proto $status_line
 + : UNKNOWN $status_line;
 
 But again, this might break existing code.

I also find this quite ugly.

 I could submit patches for all the fallbacks and workarounds -

That would be very much appreciated.

Regards,
Gisle


Re: Bug in HTML::Form label support

2004-12-03 Thread Gisle Aas
Dan Kubb [EMAIL PROTECTED] writes:

   label
 input type=radio name=r1 value=1One
   /label

Is label in common use?  What browsers support it?

Regards,
Gisle


Re: HTML::Parser 3.42: some tests fail on MSWin32

2004-12-06 Thread Gisle Aas
Bjoern Hoehrmann [EMAIL PROTECTED] writes:

   HTML::Parser 3.41/3.42 fails on some tests on MSWin32, see

This should be fixed in 3.43 that I just uploaded.  The SvUTF8 flag
was not propagated correctly when replacing unterminated entities.

Regards,
Gisle


Re: suggestion for $ua-env_proxy method

2004-12-06 Thread Gisle Aas
bulk 88 [EMAIL PROTECTED] writes:

 Can the env_proxy method return the result of getting the proxy
 settings from the enviroment so that this will work?
 
 $EnvProxyResult = $ua-env_proxy;
 
 I would like it so I can have a proper Using proxy settings from
 enviroment. line. Or a Forced proxy settings from enviroment not
 found. Currently, it just returns false if it gets the proxy settings
 or not.

I don't see a problem with this.  It is more likely to happen if you
are able to provide a patch.  Especially if the patch also updates the
documentation and the test suite appropriately.

Regards,
Gisle


Re: libwww@perl.org

2004-12-06 Thread Gisle Aas
Tony [EMAIL PROTECTED] writes:

 I've been trying to install the LWP bundle for several days.
 I saw that URI-1.34.tar.gz was unavailable.  I had to go to
 http://cpan.n0i.net/modules/by-module/URI/ to download URI-1.35.tar.gz.
 Why does cpan go after the old version?

Stale local index cache?

The cpan:modules/02packages.details.txt.gz index points to URI-1.35 as
it should.

Regards,
Gisle


Re: Can't use www::mechanize with an array form field

2004-12-06 Thread Gisle Aas
Tim [EMAIL PROTECTED] writes:

 I have a website written in PHP/MySQL.
 
 I'm using www::mechanize and www::mechanize::formfiller to test the site.
 
 I declare one of the form fields as an array in PHP like so:
 
  echo input type=\checkbox\ name=\cat[]\
 value=\$cat_id\.VarPrepForDisplay($title).;
 
 which in turn creates the following HTML code that www::mechanize uses
 to test the code.
 
input type=checkbox name=cat[] value=164
 
 
 this makes the cat field an array. the problem is that when I try to
 use www::mechanize to submit values to this filed I get the following
 error:
 
 Illegal value '211' for field 'cat[]' at /path.pl line 89
 
 Does anyone know how I can submit values to an array based form field?

I don't know what it takes in WWW::Mechanize land, but if you have
access to the HTML::Form object you can do this:

   $form-param(cat[], 211, 213);

This will turn on the 211 and 213 check box and all the other
cat[] checkboxes off.

Regards,
Gisle


Re: libwww-perl-5.802

2004-12-06 Thread Gisle Aas
Moshe Kaminsky [EMAIL PROTECTED] writes:

 * Gisle Aas [EMAIL PROTECTED] [01/12/04 12:02]:
  libwww-perl-5.802 is available from CPAN. The changes since 5.801 are:
  
  The HTTP::Message object now have a decoded_content() method.
  This will return the content after any Content-Encodings and
  charsets has been decoded.
  
 
 For some reason, the original content is killed in the response object 
 when I use this method - the content() method returns an empty string 
 after calling decoded_content. The reason appears to be passing 
 $$content_ref to Encode::decode in line 220 of HTTP/Message.pm. I guess 
 it's probably some problem with decode(),
 but in any case, replacing that line with
 
 my $cont = $$content_ref;
 $content_ref = \Encode::decode($charset, $cont, Encode::FB_CROAK());
 
 Solved the problem. This is with HTTP::Message version 1.52, perl 
 version 5.8.6, Encode version 2.08 on linux.

Thanks for your report.  There was a similar issue with memGunzip and
the patch I applied for it will also fix this problem.

 Also, I would like to suggest adding a flag, which will cause the 
 content() method to return the output of decoded_content(). This will 
 allow scripts which ignored the charset to automatically do the right 
 thing by simply setting this flag.

I'm not too happy about this suggestion asis.  One option is to
introduce a '$mess-decode_content' method and then make
LWP::UserAgent grow some option that makes it automatically call this
for all responses it receives.  The 'decode_content' would be like

$resp-content(encode_utf8($res-decoded_content));

but would also fix up the Content-Encoding and Content-Type header.

Regards,
Gisle


Re: calling decoded_content on gzipped content destroys raw content

2004-12-06 Thread Gisle Aas
Andreas Beckmann [EMAIL PROTECTED] writes:

 I found the new decoded_content method destroying the raw content if
 Content-Encoding was gzip.
 
 This happens because:
 
 Compress::Zlib::memGunzip
...
The contents of the buffer parameter are
destroyed after calling this function.
 
 I fixed this the following way:
 
 HTTP/Message.pm:
 -$content_ref = \Compress::Zlib::memGunzip($$content_ref);
 +$content_ref = \Compress::Zlib::memGunzip(my $buf = $$content_ref);
 
 I didn't check the other decoding functions, so this could happen at
 other places, too.

Encode::decode() also destroy its argument.  I've now applied the patch below.

 Thanks for the decoded_content funktion - this makes using
 compression a lot easier :-)
 
 Perhaps an option to replace the current raw content could be added,
 this would also have to change the Content-Encoding and
 Content-Type/Charset headers.

I can see that might be useful.  The 'content' is supposed to be bytes
so the result would have to be encoded UTF-8, while 'decoded_content'
returns decoded UTF-8.

I think it is better to have a 'decode_content' method (a verb) then
for 'decoded_content' to suddenly have a side effect on the message
when given an option.

Regards,
Gisle



Index: lib/HTTP/Message.pm
===
RCS file: /cvsroot/libwww-perl/lwp5/lib/HTTP/Message.pm,v
retrieving revision 1.54
retrieving revision 1.55
diff -u -p -r1.54 -r1.55
--- lib/HTTP/Message.pm 3 Dec 2004 08:35:41 -   1.54
+++ lib/HTTP/Message.pm 6 Dec 2004 13:27:20 -   1.55
@@ -1,10 +1,10 @@
 package HTTP::Message;
 
-# $Id: Message.pm,v 1.54 2004/12/03 08:35:41 gisle Exp $
+# $Id: Message.pm,v 1.55 2004/12/06 13:27:20 gisle Exp $
 
 use strict;
 use vars qw($VERSION $AUTOLOAD);
-$VERSION = sprintf(%d.%02d, q$Revision: 1.54 $ =~ /(\d+)\.(\d+)/);
+$VERSION = sprintf(%d.%02d, q$Revision: 1.55 $ =~ /(\d+)\.(\d+)/);
 
 require HTTP::Headers;
 require Carp;
@@ -161,6 +161,7 @@ sub decoded_content
 {
 my($self, %opt) = @_;
 my $content_ref;
+my $content_ref_iscopy;
 
 eval {
 
@@ -183,6 +184,12 @@ sub decoded_content
next unless $ce || $ce eq identity;
if ($ce eq gzip || $ce eq x-gzip) {
require Compress::Zlib;
+   unless ($content_ref_iscopy) {
+   # memGunzip is documented to destroy its buffer argument
+   my $copy = $$content_ref;
+   $content_ref = \$copy;
+   $content_ref_iscopy++;
+   }
$content_ref = \Compress::Zlib::memGunzip($$content_ref);
die Can't gunzip content unless defined $$content_ref;
}
@@ -190,11 +197,13 @@ sub decoded_content
require Compress::Bzip2;
$content_ref = Compress::Bzip2::decompress($$content_ref);
die Can't bunzip content unless defined $$content_ref;
+   $content_ref_iscopy++;
}
elsif ($ce eq deflate) {
require Compress::Zlib;
$content_ref = \Compress::Zlib::uncompress($$content_ref);
die Can't inflate content unless defined $$content_ref;
+   $content_ref_iscopy++;
}
elsif ($ce eq compress || $ce eq x-compress) {
die Can't uncompress content;
@@ -202,10 +211,12 @@ sub decoded_content
elsif ($ce eq base64) {  # not really C-T-E, but should be 
harmless
require MIME::Base64;
$content_ref = \MIME::Base64::decode($$content_ref);
+   $content_ref_iscopy++;
}
elsif ($ce eq quoted-printable) { # not really C-T-E, but 
should be harmless
require MIME::QuotedPrint;
$content_ref = \MIME::QuotedPrint::decode($$content_ref);
+   $content_ref_iscopy++;
}
else {
die Don't know how to decode Content-Encoding '$ce';
@@ -218,7 +229,16 @@ sub decoded_content
$charset = lc($charset);
if ($charset ne none) {
require Encode;
-   $content_ref = \Encode::decode($charset, $$content_ref, 
Encode::FB_CROAK());
+   if (do{my $v = $Encode::VERSION; $v =~ s/_//g; $v}  2.0901 
+   !$content_ref_iscopy)
+   {
+   # LEAVE_SRC did not work before Encode-2.0901
+   my $copy = $$content_ref;
+   $content_ref = \$copy;
+   $content_ref_iscopy++;
+   }
+   $content_ref = \Encode::decode($charset, $$content_ref,
+  Encode::FB_CROAK() | 
Encode::LEAVE_SRC());
}
}
 };


Re: [patch] Allow a directory as lwp-download's 2nd argument

2004-12-11 Thread Gisle Aas
Radoslaw Zielinski [EMAIL PROTECTED] writes:

 The attached patch allows specifying a directory as lwp-download's
 second argument. Also makes 0 valid destination file name.

Thanks. Applied.

Regards,
Gisle


Re: [PATCH] HTTP::Daemon defaults

2004-12-11 Thread Gisle Aas
Kees Cook [EMAIL PROTECTED] writes:

 I'd like to see this patch added so that HTTP::Daemon::SSL can more 
 cleanly overload the url function without having to totally reimplement 
 it.

Thanks. Applied.

But I made the defaults 80 and http :)

 Also, could HTTP::Daemon::SSL be made part of the libwww bundle?

I don't have a problem with that if its author wants the same.

Regards,
Gisle


 --- libwww-perl-5.802/lib/HTTP/Daemon.pm  2004-04-09 13:21:43.0 
 -0700
 +++ libwww-perl-5.802-kees/lib/HTTP/Daemon.pm 2004-12-10 10:13:30.0 
 -0800
 @@ -37,10 +37,22 @@
  }
  
  
 +sub _default_port {
 +443;
 +}
 +
 +
 +sub _default_scheme {
 +https;
 +}
 +
 +
 +# Implemented with calls to _default_port and _default_scheme so that
 +# HTTP::Daemon::SSL can overload them and still use this function.
  sub url
  {
  my $self = shift;
 -my $url = http://;;
 +my $url = $self-_default_scheme().://;
  my $addr = $self-sockaddr;
  if (!$addr || $addr eq INADDR_ANY) {
   require Sys::Hostname;
 @@ -50,7 +62,7 @@
   $url .= gethostbyaddr($addr, AF_INET) || inet_ntoa($addr);
  }
  my $port = $self-sockport;
 -$url .= :$port if $port != 80;
 +$url .= :$port if $port != $self-_default_port();
  $url .= /;
  $url;
  }


Re: HTTP::Response::base fails if the response has no request

2004-12-11 Thread Gisle Aas
Harald Joerg [EMAIL PROTECTED] writes:

 Once more I'd like to suggest a patch for HTTP::Response.
 
 When working with my homegrown responses I found that the base method
 fails fatally if the response doesn't have a request inside:
 
Can't call method uri on an undefined value at
/usr/lib/perl5/site_perl/5.8.5/HTTP/Response.pm line 78.
 
 I can work around this by defining a fake request for my responses,
 but I'd prefer if HTTP::Response::base would simply return undef if
 it finds neither a base-defining header nor an embedded request.

Seems fine.  I tweaked your patch into this one before I applied it.
Thanks!

Regards,
Gisle

Index: lib/HTTP/Response.pm
===
RCS file: /cvsroot/libwww-perl/lwp5/lib/HTTP/Response.pm,v
retrieving revision 1.50
retrieving revision 1.51
diff -u -p -r1.50 -r1.51
--- lib/HTTP/Response.pm30 Nov 2004 12:00:22 -  1.50
+++ lib/HTTP/Response.pm11 Dec 2004 14:30:00 -  1.51
@@ -75,9 +75,20 @@ sub base
 my $base = $self-header('Content-Base') ||  # used to be HTTP/1.1
$self-header('Content-Location') ||  # HTTP/1.1
$self-header('Base');# HTTP/1.0
-return $HTTP::URI_CLASS-new_abs($base, $self-request-uri);
-# So yes, if $base is undef, the return value is effectively
-# just a copy of $self-request-uri.
+if ($base  $base =~ /^$URI::scheme_re:/o) {
+   # already absolute
+   return $HTTP::URI_CLASS-new($base);
+}
+
+my $req = $self-request;
+if ($req) {
+# if $base is undef here, the return value is effectively
+# just a copy of $self-request-uri.
+return $HTTP::URI_CLASS-new_abs($base, $req-uri);
+}
+
+# can't find an absolute base
+return undef;
 }
 
 
@@ -366,6 +377,9 @@ received some redirect responses first.
 
 =back
 
+If neither of these sources provide an absolute URI, undef is
+returned.
+
 When the LWP protocol modules produce the HTTP::Response object, then
 any base URI embedded in the document (step 1) will already have
 initialized the Content-Base: header. This means that this method


Re: HTTP::Response inconsistency

2004-12-11 Thread Gisle Aas
Harald Joerg [EMAIL PROTECTED] writes:

 Gisle Aas writes:
 
  Harald Joerg [EMAIL PROTECTED] writes:
 
 As a fallback, HTTP::Response::parse could set the protocol to undef
 if it turns out to be a three-digit number, assigning this value to
 the code (after assigning to the message what was parsed as the code).
  This is my preferred fix.  Just make HTTP::Response::parse deal with
  what as_string spits out.  I would just make it look at the string
  before spliting it.  If it starts with /\d/ split in 2 instead of 3.
 
 Patch is attached.

Thanks. Applied.

Regards,
Gisle

 --- Response.pm.1.502004-12-02 21:36:42.43750 +0100
 +++ Response.pm 2004-12-03 22:10:27.421875000 +0100
 @@ -35,5 +35,11 @@
  
  my $self = $class-SUPER::parse($str);
 -my($protocol, $code, $message) = split(' ', $status_line, 3);
 +my($protocol, $code, $message);
 +if ($status_line =~ /^\d{3} /) {
 +   # Looks like a response created by HTTP::Response-new
 +   ($code, $message) = split(' ', $status_line, 2);
 +} else {
 +   ($protocol, $code, $message) = split(' ', $status_line, 3);
 +}
  $self-protocol($protocol) if $protocol;
  $self-code($code) if defined($code);


Re: How can I PUT a large file?

2004-12-13 Thread Gisle Aas
Rodrigo Ruiz [EMAIL PROTECTED] writes:

 I need to perform a PUT operation and send a very large file (several
 hundred MBytes). I have been using the following code to do this:
 
 ...
 my $header = HTTP::Headers-new;
 $header-content_type('application/octet-stream');
 $header-content_length($fileSize);
 $header-authorization_basic($usr, $pwd);
 
 my $readFunc = sub {
   read(FH, my $buf, 65536);
   return $buf;
 };
 
 my $req = HTTP::Request-new(PUT, $url, $header, $readFunc);
 ...

Seems sane.

 But after updating to the 5.802 version of LWP this code has stopped
 working.
 
 When I execute my script, it prints a warning telling me that the
 Content-Length header has been fixed, and the file in the destination
 server is corrupted.
 Looking at the code of the library, I have found these lines:
 
 # Set (or override) Content-Length header
 my $clen = $request_headers-header('Content-Length');
 if (defined($$content_ref)  length($$content_ref)) {
  $has_content++;
  if (!defined($clen) || $clen ne length($$content_ref)) {
if (defined $clen) {
  warn Content-Length header value was wrong, fixed;
  hlist_remove([EMAIL PROTECTED], 'Content-Length');
}
push(@h, 'Content-Length' = length($$content_ref));
  }
 } elsif ($clen) {
  warn Content-Length set when there is not content, fixed;
  hlist_remove([EMAIL PROTECTED], 'Content-Length');
 }
 
 I think these lines prevent the use of a function as the content reference.
 
 Is this a bug, or the support for function references has been removed?

No this is supposed to work.  This code block should not be entered as
there is a test for code reference content just above it.  Can you
figure out why the:

  if (ref($content_ref) eq 'CODE') {

test fails?  What is $content_ref in this case?

Regards,
Gisle



Re: How can I PUT a large file?

2004-12-13 Thread Gisle Aas
Gisle Aas [EMAIL PROTECTED] writes:

 No this is supposed to work.

I've now verified that using request code content like this, does
indeed work for me when posting to my own server.

Unless you can debug this problem directly with your app, please try
to create a complete (short) example program the demonstrates this
problem and send it to this list.

Regards,
Gisle



Re: Bug in HTML::Form label support

2004-12-11 Thread Gisle Aas
Dan Kubb [EMAIL PROTECTED] writes:

 Hi Gisle,
 
  Are there other form elements than input that might take labels?
 
 Yes, all the normal form elements can take labels.  I'm
 just not sure how you would use them without adding to
 or changing the interface in HTML::Form.
 
 For input tags that are radio or checkboxes its easy..
 just set the value_name attribute with the label name
 and the existing interface will use it.  I can do that
 for other elements, but some of them inherit a noop
 value_names() method -- I didn't want to change this
 method's behaviour because it says in the docs that
 the values it returns correspond 1 to 1 with the return
 values from possible_values().
 
 Still it would be nice to set the values of a text input
 value like this:
 
   $form-value('First Name');
 
 Rather than:
 
   $form-value('contact.name.first');
 
 I wasn't going to propose any interface changes in my
 patch without checking with the you first.

Seems like it might be a good idea to introduce a 'label' attribute
for inputs, but perhaps that creates the wrong expectation for radio
and checkbox entries.  Got to ponder that some more.


  Indentation is not consistent with the rest of the code.
 
 What's your indenting style for patches?  I'm a two-space
 intender myself.  The patch you received had tabs inserted
 manually just as I was finishing up.  I tried to find a
 pattern in HTML::Form, but the style wasn't consistent
 enough for me to pick one up -- I figured there must be
 a lot of different maintainers ;)

It seems consistent to me.  Perhaps you have tweaked your tab-stop to
not be the standard 8.

  + 1 while $attr-{value_name} =~ s/\s\z//;
  
  why not '$attr-{value_name} =~ s/\s+\z//;'
 
 Just finished a project with some large file processing..
 the 1 while version is faster (strangely enough), there
 were some benchmarks on Perlmonks I believe.

You learn something new every day.  I guess the + is too much for RE
optimizer here then.

 course it makes no difference with such small strings,
 I put it in more out of habit than anything.
  
  + $attr-{value_name} =~ s/\s+/ /;
  
  There can't really be multispace anywhere since get_phrase will trim
  the text.  This would always be a noop.
 
 You're right.  I eliminated the need for regexes in
 a new patch which I've attached to this email.  I
 think I've got the formatting right this time.

The new patch has now been applied.  Thanks.

Regards,
Gisle


Re: How can I PUT a large file?

2004-12-14 Thread Gisle Aas
Rodrigo Ruiz [EMAIL PROTECTED] writes:

 The error appeared on 5.8 version of LWP. My current version is 5.802,
 and it has the error fixed. Is this the first version where the bug is
 fixed? Is it enough to do an == comparison or should I use something
 like:
 
 my $ref = ($LWP::VERSION = 5.8  $LWP::VERSION  5.802) ? \$readFunc
 : $readFunc;

This bug was only present in one version; libwww-perl-5.800.  If you
really still need this workaround I would make it:

   my $readFunc = sub {  };
   $readFunc = \$readFunc if $LWP::VERSION eq 5.800; # workaround buggy LWP 
version

Regards,
Gisle


Re: Libhtml parser 3.43 ??

2004-12-28 Thread Gisle Aas
The Saltydog [EMAIL PROTECTED] writes:

 I am experiencing a strange behaviour on linhtml-parser-perl v.3.43 
 
 The strange behaviour is ONLY on
 this web page:
 
 http://communicator.virgilio.it

HTML::Parser got confused about how quoted strings nest when parsing
one of the script tags.  This made it assume large parts of the
document to be the script element.

This buggy behaviour was introduced in v3.40 (v3.39_91).  The
following patch fixes this problem and will be present in v3.44 when
ready.  I expect that to happen soonish.

Regards,
Gisle


Index: hparser.c
===
RCS file: /cvsroot/libwww-perl/html-parser/hparser.c,v
retrieving revision 2.118
retrieving revision 2.119
diff -u -p -u -r2.118 -r2.119
--- hparser.c   2 Dec 2004 11:52:32 -   2.118
+++ hparser.c   28 Dec 2004 13:47:44 -  2.119
@@ -1,4 +1,4 @@
-/* $Id: hparser.c,v 2.118 2004/12/02 11:52:32 gisle Exp $
+/* $Id: hparser.c,v 2.119 2004/12/28 13:47:44 gisle Exp $
  *
  * Copyright 1999-2004, Gisle Aas
  * Copyright 1999-2000, Michael A. Chase
@@ -1522,7 +1522,7 @@ parse_buf(pTHX_ PSTATE* p_state, char *b
inside_quote = 0;
else if (*s == '\r' || *s == '\n')
inside_quote = 0;
-   else if (*s == '' || *s == '\'')
+   else if (!inside_quote  (*s == '' || *s == '\''))
inside_quote = *s;
}
}



Re: Downloading a page compressed

2004-12-30 Thread Gisle Aas
Andy Lester [EMAIL PROTECTED] writes:

 On Dec 29, 2004, at 6:02 PM, Bjoern Hoehrmann wrote:
 
  Note that LWP does not automatically remove the gzip compression in
  this
  case
 
 WWW::Mechanize does, however.

And LWP does it if you ask for the $response-decoded_content instead
of $response-content.  The decoded_content method was introduced in
LWP-5.802.

Regards,
Gisle



Re: Data::Dump is missing ? t/local/httpsub.t fails

2004-12-30 Thread Gisle Aas
Gabor Szabo [EMAIL PROTECTED] writes:

 I just noticed this test file only exists in the CVS but is not
 distributed.
 
 Still I guess it should be fixed somehow (probably by skipping
 the test if the modul is not there).

It's an unfinished test that I lost interest in completing :-(
If completed, then the Data::Dump reference should clearly go.

Regards,
Gisle


Re: Statistics in mech?

2005-01-14 Thread Gisle Aas
Peter Stevens [EMAIL PROTECTED] writes:

 I am using mech to scrape data from various websites. I wanted to
 collect data about the bytes sent and received by my scraper (I need
 this for sizing purposes). I looked though Mech and LWP,  but did not
 see any methods which give me that information. Is there a way to do
 this?

Not directly, but you can replace the protocol handler with your own
that counts bytes passed by.  This is an example that will count the
bytes sent over http:

#!/usr/bin/perl -w

use LWP::UserAgent;
use LWP::Protocol;

LWP::Protocol::implementor('http', 'MyHTTP');
my $bytes_in = 0;
my $bytes_out = 0;

my $ua = LWP::UserAgent-new(keep_alive = 1);

for (1..3) {
my $res = $ua-get(http://www.example.com;);
print $_: , $res-status_line, \n;
}

print received $bytes_in bytes, send $bytes_out bytes\n;


# Overridden protocol handler that counts the bytes transfered
package MyHTTP;
use base 'LWP::Protocol::http';

package MyHTTP::Socket;
use base 'LWP::Protocol::http::Socket';

sub sysread {
my $self = shift;
my $n = $self-SUPER::sysread(@_);
$bytes_in += $n if defined($n)  $n  0;
return $n;
}

sub syswrite {
my $self = shift;
my $n = $self-SUPER::syswrite(@_);
$bytes_out += $n if defined($n)  $n  0;
return $n;
}

__END__

Regards,
Gisle


Re: Avoiding Alarm Clocks While Spidering

2005-01-18 Thread Gisle Aas
Justin Tang [EMAIL PROTECTED] writes:

   I am currently running a spider program derieved from an open source
 Search Engine program call SWISH-E, the spider.pl file that I am using uses
 the LWP::RobotUA class.  Now, the way I have it set up is that I have a
 program that would prep the spider program with a list of sites to spider,
 the the program would call the spider using the backtick(``).  From there
 on, the spider becomes a zombie program, outputing results to a local
 textfile for me to review later.  The problem I'm running into is that it
 seems like the LWP class has a timeout function implemented that would sleep
 the process after a period of time with a message saying Alarm clock.
 What is happening is that, since my process is a zombie, when it is put to
 sleep the system kills the process.  Is there anyway around this situation?
 Is there a command or flag in LWP::RobotUA that I can set so it would not be
 put to sleep.

There is the 'use_sleep' attribute that you might set to a FALSE value.

Regards,
Gisle



Re: Internal Server Error when GETing with WWW::Mechanize?

2005-01-18 Thread Gisle Aas
James Turnbull [EMAIL PROTECTED] writes:

 The error I get is...
 Error GETing http://www.parcelforce.com:80/portal/pw/track: Internal
 Server Error at track.pl line 5

The server is confused by something in the request that LWP sends.
This is a trace I get with 'lwp-request 
http://www.parcelforce.com:80/portal/pw/track':

GET /portal/pw/track HTTP/1.1
TE: deflate,gzip;q=0.3
Connection: TE, close
Host: www.parcelforce.com:80
User-Agent: lwp-request/2.06

HTTP/1.1 500 Internal Server Error
Content-language: en-US
Content-length: 0
Content-type: text/html; charset=ISO-8859-1
Date: Tue, 18 Jan 2005 12:37:25 GMT
Server: Netscape-Enterprise/6.0
Set-Cookie: FGNCLIID=42b0olsqf5khpzen020dycwtbh27;expires=Thu, 18 Jan 2007 
12:37:26 GMT;path=/
Connection: Close

--Gisle


Re: Internal Server Error when GETing with WWW::Mechanize?

2005-01-18 Thread Gisle Aas
Gisle Aas [EMAIL PROTECTED] writes:

 James Turnbull [EMAIL PROTECTED] writes:
 
  The error I get is...
  Error GETing http://www.parcelforce.com:80/portal/pw/track: Internal
  Server Error at track.pl line 5
 
 The server is confused by something in the request that LWP sends.

This is a buggy server that crashes unless the request sent have an
Accept header.  It does not appear to matter what you put in it, as
demonstrated by running:

  $ lwp-request -H Accept:foo http://www.parcelforce.com:80/portal/pw/track

In your app you can work around this problem by telling LWP to always
send an Accept header using code like:

  $argent-default_header(Accept = text/*);

(The default_header method was introduced in LWP-5.800).

Regards,
Gisle


Re: [PMX:VIRUS] HTML::Parser and entities

2005-01-24 Thread Gisle Aas
Steve Sapovits [EMAIL PROTECTED] writes:

 Is there a way to get HTML::Parser to leave entities in text alone?

Just use 'text' argspec and you get the text exactly as it is.

 There is the attr_encode() method, but that only appears to affect
 attributes.  Basically I have code that wants to selectively remove
 some tags but leave others and entities intact.

The hstrip example does exactly this.

http://search.cpan.org/src/GAAS/HTML-Parser-3.45/eg/hstrip

Regards,
Gisle


Re: URI module problems

2005-04-30 Thread Gisle Aas
Please provide the output of these commands:

   perl -MStorable\ 99

and in the unpacked URI directory run:

   perl Makefile.PL
   make  perl -Mblib t/storable.t
 
 t/storable..FAILED tests 1-3
 Failed 3/3 tests, 0.00% okay
 t/urn-isbn..skipped: Needs the Business::ISBN module installed
 t/urn-oid...ok
 Failed Test  Status Wstat Total Fail  Failed  List of Failed
 
 
 t/storable.t   33 100.00%  1-3
 1 test and 2 subtests skipped.
 Failed 1/31 test scripts, 96.77% okay. 3/466 subtests failed, 99.36%
 okay.
 *** Error code 29
 make: Fatal error: Command failed for target `test_dynamic'
 
 The Storable module path was in the PERL5LIB environmental
 variable when I tried to compile URI.  Is the path.al file
 dependent on URI finding and using the Storable files?

I have no idea what path.al is here.

You can probably also just ignore this test error and then just run
'make install' for URI to get going.  The failure just means that
something prevents URI objects from being stored and retrieved with
Storable.  This might not matter if the code you run does not do this.

Regards,
Gisle


Re: statu_line

2005-05-16 Thread Gisle Aas
The Saltydog [EMAIL PROTECTED] writes:

 This is my simple script:
 
 ==
 require LWP::UserAgent;
  
  my $ua = LWP::UserAgent-new;
  $ua-timeout(10);
  $ua-env_proxy;
  
  my $response = $ua-get('http://search.cpan.org/');
  
  if ($response-is_success) {
  print $response-content;  # or whatever
  }
  else {
  die $response-status_line;
  }
 ===
 
 If I type a wrong url instead of www.cpan.org, the script doesn't
 return a status_line... This is the program output:
 
 HTTP::Response=HASH(0x845d084)-status_line
 
 Where am I wrong?

I bet your script has quotes around the $response-status_line
expression.  The program above does not produce the output you claim.

Regards,
Gisle


Re: HTML::Parser: how can I reset report_tags to report all tags?

2005-06-14 Thread Gisle Aas
Norbert Kiesel [EMAIL PROTECTED] writes:

 I tried to use -ignore_tags(()) and -ignore_tags(qw(none)), but it
 seems that after calling -report_tags() once it alsways uses a positive
 tag filter.

Calling -report_tags() without any arguments should reset the filter.

Regards,
Gisle


Re: parsing bug in HTTP::Message::parse()

2005-06-16 Thread Gisle Aas
Brian Hirt [EMAIL PROTECTED] writes:

 Any news on this?  It's a pretty major bug.

I don't see anything wrong when running your test program.  What
version of LWP are you using?

Regards,
Gisle


Re: printing the redirections responses

2005-07-21 Thread Gisle Aas
Octavian Rasnita [EMAIL PROTECTED] writes:

 my $response = $ua-request($request);
 print $response-as_string();

[...]

 The response is the final page, even though there is a redirection until
 this page is returned. Is it possible to get and print that redirect HTTP
 header?

Just use $ua-simple_request() instead of $ua-request() to dispatch
the request.

--Gisle


<    1   2   3   4   5   6   7   >