Re: url parsing in URI / HTTP::Request
Ulrich Wisser [EMAIL PROTECTED] writes: today I got an error code 400 (bad request) from my url checker. When I tested the url in my browser it worked fine. The url is http://www.leomajken.se?source=digdev I realize that there is a / missing after the domain name. I don't know if the problem is in URI or HTTP::Request. URI seems to accept the URL, but when I try to make an request I get the error code 400. Shouldn't that work? It should. This is an bug in LWP. This is a fix: Index: lib/LWP/Protocol/http.pm === RCS file: /cvsroot/libwww-perl/lwp5/lib/LWP/Protocol/http.pm,v retrieving revision 1.66 diff -u -p -r1.66 http.pm --- lib/LWP/Protocol/http.pm23 Oct 2003 19:11:33 - 1.66 +++ lib/LWP/Protocol/http.pm10 Mar 2004 20:09:36 - @@ -147,7 +147,7 @@ sub request $host = $url-host; $port = $url-port; $fullpath = $url-path_query; - $fullpath = / unless length $fullpath; + $fullpath = /$fullpath unless $fullpath =~ m,^/,; } # connect to remote site
Re: robot/ua-get..........FAILED tests 1-3, 5, 7
ALexander N. Treyner [EMAIL PROTECTED] writes: Could somebody help me to figure out what's wrong? It means that your machine can't talk to itself, probably because the hostname of your machine does not resolve to itself. If you are on a Unix system, then ping `hostname` needs to work. --Gisle
Re: [rfc] HTTP::Multipart
Joshua Hoblitt [EMAIL PROTECTED] writes: I've been kicking around the idea for this module for a few days now and I'd like to commit it to code. The module I'm proposing would be called CHTTP::Multipart. It would accept an CHTTP::Response object and determine if it indeed does contain a multipart HTTP message. If it does then the passed object would be cloned once for every part in the message and the 'Content-Length', 'Content-Type', and 'Content-Range' headers would be adjusted along with the Ccontent value to reflect one of the parts. Then a list of non-multipart CHTTP::Response objects would be returned. I believe this would simplify handling multipart responses. 1) Is this a good idea? The usecase for this seems a bit unclear to me. How would you use this module? I don't understand what needs handling of multipart responses requires. 2) Is HTTP::Multipart a good name? I think all that would be needed for this is a method on HTTP::Message. It could for instance be called 'parts'. If the method is not too long and generally useful then it should just go into that module. 3) Is it appropriate to require CHTTP::Response objects? Would just requiring objects to be ISA CHTTP::Message or CHTTP::Headers be better? It is best not to require any specific object at all. Just depend to a certain interface, i.e. a set of methods to be implemented. 4) Should there be an CHTTP::Multipart object that contains a list of modified CHTTP::Response objects or would a class method be sufficient? What you described above appear to be a simple function that takes one HTTP::Response (or HTTP::Message) and breaks it into (possibly) many smaller. I don't see any need for an extra object or class here. 5) if a class method is sufficient, would should it's name be? (ie., Cparase?) What does 'parase' mean? Regards, Gisle
Re: [rfc] HTTP::Multipart
Thinking some more. This is what I think I would like to see. We introduce the methods 'parent', 'parts' and 'add_part' to HTTP::Message. $msg2 = $msg-parent This attribute point back to the parent message. If defined it makes this message a message part belonging to the parent message. This attribute is set by the other methods described below. We might consider automatic delegation to the parent, but I'm not sure how useful that would be. @parts = $msg-parts This will return a list of HTTP::Message objects. If the content-type of $msg is not multipart/* or message/* then this will return the empty list. The returned message part objects are read only (so that future versions can make it possible to modify the parent by modifying the parts). If the content-type of $msg is message/* then there will only be one part. If the content-type is message/http, then this will return either an HTTP::Request or an HTTP::Response object. $msg-parts( @parts ) $msg-parts( [EMAIL PROTECTED] ) This will set the content of the message to be the provided list of parts. If the old content-type is not multipart/* or message/* then it is set to multipart/mixed and other content-* headers are cleared as well. The part objects now belong to $msg and can not be set to be parts of other messages, but clones can be made part of other messages. This method will croak if the provided parts are not independent. This method will croak if the content type is message/* and more than one part is provided. The array ref form is provided so that an empty list can be provided without any special cases. $msg-add_part( $part ) This will add a part to a message. If the old content-type is not multipart/* then the old content (together with all content-* headers) will be made part #1 and the content-type made multipart/mixed before the new part is added. $part-clone Will return an independent part object (i.e. the parent attribute will always be cleared). This ensures that this works: $msg2-parts([map $_-clone, $msg1-parts]); When the parts are updated via the parts() or add_part() method, then a suitable boundary will be automatically created so that it is unique (like HTTP::Request::Common currently does). If the boundary is set explicitly then it is kept and the user is responsible for ensuring that the string --$boundary does not occur in the content of any part. The current HTTP::Message object also provide the 'protocol()' method that does not make sense for all parts. This method should be moved out or replicated in both HTTP::Request and HTTP::Response. Regards, Gisle
Re: [rfc] HTTP::Multipart
Paul Marquess [EMAIL PROTECTED] writes: Does this interface allow you manipulate nested multi-part messages? Yes. The parts() method on HTTP::Message return HTTP::Message objects so there should not be any problem nesting this as you see fit. The MIME::Entity class provide a method called parts_DFS that return all parts in a depth-first-search order. I don't see a need for it for HTTP::Message, and it can easily be constructed from the parts() method. Regards, Gisle
Re: Cookies Redirection
Paul Marquess [EMAIL PROTECTED] writes: This is from UserAgent::request (LWP 5.76) where it is dealing with a redirect response # These headers should never be forwarded $referral-remove_header('Host', 'Cookie'); I've found that while writing a script to automate logging on to Yahoo Web mail, I've needed to change this behaviour in a private copy of UserAgent::request to retain the Cookies. The reason the Cookie headers are removed is that they will be added automatically again if the redirect goes to a place that requires cookies. This happens even if the redirect goes to the same place as the original request. FYI, logging onto Yahoo involves dealing with a series of 302 responses. The first of these responses (from http://login.yahoo.com), is a 302 that redirects back to itself - this response has a Set-Cookie header that is needed to be applied to the redirection request to continue with the login. That should just work. If it does not it is a bug. Apart from the fact that this behaviour is being used in the wild, my reading of RFC 2109 is that this use of a Set-Cookie is ok because the domain attribute in the Cookie still refers to .yahoo.com. Can you provide a trace of sequence of request/responses that are exchanged and the content of the cookie_jar as this happens. Regards, Gisle
Re: Error when running LWP
Octavian Rasnita [EMAIL PROTECTED] writes: Hi all, I have recieved the following error when I tried to run a simple script that only downloads and prints a page. The script runs fine under Windows, but It give this error when running under Linux. Do you know which could be the cause for this error? Please tell me how can I solve it. The error is: Can't locate auto/Compress/Zlib/autosplit.ix in @INC (@INC contains: /usr/lib/perl5/5.8.0/i386-linux-thread-multi /usr/lib/perl5/5.8.0 /usr/lib/perl5/site_perl/5.8.0/i386-linux-thread-multi /usr/lib/perl5/site_perl/5.8.0 /usr/lib/perl5/site_perl /usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.0 /usr/lib/perl5/vendor_perl .) at /usr/lib/perl5/5.8.0/AutoLoader.pm line 158. at /usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi/Compress/Zlib.pm line 16 It looks like Compress::Zlib is not properly installed on the system. LWP will try to load it if it is available. I bet you get a similar error with: perl -MCompress::Zlib -e1 To fix this situation either remove /usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi/Compress/Zlib.pm or reinstall Compress::Zlib. --Gisle
Re: Bug submitting large HTTP requests
Jamie Lokier [EMAIL PROTECTED] writes: The subroutine Net::HTTP::Methods::write_request calls print, but doesn't check the return value. It's a non-blocking socket, so it's quite normal for the print to do a short write if the string is very large -- larger than the socket transmit buffer. I would belive that print should be responsible for the handing short writes itself. On what system are you running and what perl version are you using? What could make sense to to rewrite Net::HTTP so that it use syswrite all over the place instead. With it we can easily handle short writes outself. Regards, Gisle
Re: [PATCH] LWP::RobotUA case-sensitive check for Disallow
Liam Quinn [EMAIL PROTECTED] writes: LWP::RobotUA won't parse a robots.txt file if the file does not contain Disallow. The check for Disallow is case sensitive, but according to the robot exclusion standard, field names are case insensitive. This causes LWP::RobotUA to ignore some robots.txt files that it should parse. Attached is a patch that makes the check for Disallow case insensitive. The patch is against libwww-perl 5.76 (RobotUA.pm 1.23). Thanks! Applied as: Index: lib/LWP/RobotUA.pm === RCS file: /cvsroot/libwww-perl/lwp5/lib/LWP/RobotUA.pm,v retrieving revision 1.23 retrieving revision 1.24 diff -u -p -r1.23 -r1.24 --- lib/LWP/RobotUA.pm 24 Oct 2003 11:13:03 - 1.23 +++ lib/LWP/RobotUA.pm 6 Apr 2004 11:02:50 - 1.24 @@ -1,10 +1,10 @@ package LWP::RobotUA; -# $Id: RobotUA.pm,v 1.23 2003/10/24 11:13:03 gisle Exp $ +# $Id: RobotUA.pm,v 1.24 2004/04/06 11:02:50 gisle Exp $ require LWP::UserAgent; @ISA = qw(LWP::UserAgent); -$VERSION = sprintf(%d.%02d, q$Revision: 1.23 $ =~ /(\d+)\.(\d+)/); +$VERSION = sprintf(%d.%02d, q$Revision: 1.24 $ =~ /(\d+)\.(\d+)/); require WWW::RobotRules; require HTTP::Request; @@ -126,7 +126,7 @@ sub simple_request my $fresh_until = $robot_res-fresh_until; if ($robot_res-is_success) { my $c = $robot_res-content; - if ($robot_res-content_type =~ m,^text/, $c =~ /Disallow/) { + if ($robot_res-content_type =~ m,^text/, $c =~ /^Disallow\s*:/mi) { LWP::Debug::debug(Parsing robot rules); $self-{'rules'}-parse($robot_url, $c, $fresh_until); } -- Liam Quinn --- LWP/RobotUA.pm.orig 2003-10-24 07:13:03.0 -0400 +++ LWP/RobotUA.pm2004-04-03 17:59:04.0 -0500 @@ -126,7 +126,7 @@ my $fresh_until = $robot_res-fresh_until; if ($robot_res-is_success) { my $c = $robot_res-content; - if ($robot_res-content_type =~ m,^text/, $c =~ /Disallow/) { + if ($robot_res-content_type =~ m,^text/, $c =~ /Disallow/i) { LWP::Debug::debug(Parsing robot rules); $self-{'rules'}-parse($robot_url, $c, $fresh_until); }
Re: [PATCH] WWW::RobotRules user-agent matching
Liam Quinn [EMAIL PROTECTED] writes: WWW::RobotRules attempts to trim the robot's User-Agent before comparing it with the User-agent field of a robots.txt file: # Strip it so that it's just the short name. # I.e., FooBot = FooBot # FooBot/1.2 = FooBot # FooBot/1.2 [http://foobot.int; [EMAIL PROTECTED] = FooBot delete $self-{'loc'}; # all old info is now stale $name = $1 if $name =~ m/(\S+)/; # get first word $name =~ s!/?\s*\d+.\d+\s*$!!; # loose version My robot's name is WDG_SiteValidator/1.5.6. The above code changes the name to WDG_SiteValidator/1., which causes it not to match a robots.txt User-agent field of WDG_SiteValidator. I've attached a patch against libwww-perl 5.76 (WWW::RobotRules 1.26) that replaces the last line above with $name =~ s!/.*!!; # lose version which seems to cover the various cases correctly. Agree. Patch applied. Thanks! Regards, Gisle --- WWW/RobotRules.pm.orig2003-10-23 15:11:33.0 -0400 +++ WWW/RobotRules.pm 2004-04-03 18:06:01.0 -0500 @@ -187,7 +187,7 @@ delete $self-{'loc'}; # all old info is now stale $name = $1 if $name =~ m/(\S+)/; # get first word - $name =~ s!/?\s*\d+.\d+\s*$!!; # loose version + $name =~ s!/.*!!; # lose version $self-{'ua'}=$name; } $old;
Re: Suggest change to WWW::RobotRules
Craig Macdonald [EMAIL PROTECTED] writes: Hi, just a short note to suggest a 1-line change to WWW::RobotRules. When loading, http://www.maths.gla.ac.uk/robots.txt I noticed WWW::RobotRules giving me warnings: RobotRules: Unexpected line: User-agent: * RobotRules: Unexpected line: Disallow: /error/ RobotRules: Unexpected line: Disallow: /tla_review/ etc. The problem is that WWW::RobotRules doesn't support leading space on a robots.txt line. As such, I would suggest adding s/^\s*//; at line 51 of RobotRules.pm. I'm not sure how frequent a problem this might be, but it seems important to make WWW::RobotRules as robust at parsing robots.txt files as possible, in order to prevent parts of sites being crawled that shouldn't be. The spec at http://www.robotstxt.org/wc/norobots.html states that leading space is not allowed, but I agree that LWP should be a bit more liberal when parsing. I've now applied the following patch. Regads, Gisle Index: lib/LWP/RobotUA.pm === RCS file: /cvsroot/libwww-perl/lwp5/lib/LWP/RobotUA.pm,v retrieving revision 1.24 diff -u -p -r1.24 RobotUA.pm --- lib/LWP/RobotUA.pm 6 Apr 2004 11:02:50 - 1.24 +++ lib/LWP/RobotUA.pm 6 Apr 2004 11:36:10 - @@ -126,7 +126,7 @@ sub simple_request my $fresh_until = $robot_res-fresh_until; if ($robot_res-is_success) { my $c = $robot_res-content; - if ($robot_res-content_type =~ m,^text/, $c =~ /^Disallow\s*:/mi) { + if ($robot_res-content_type =~ m,^text/, $c =~ /^\s*Disallow\s*:/mi) { LWP::Debug::debug(Parsing robot rules); $self-{'rules'}-parse($robot_url, $c, $fresh_until); } Index: lib/WWW/RobotRules.pm === RCS file: /cvsroot/libwww-perl/lwp5/lib/WWW/RobotRules.pm,v retrieving revision 1.28 diff -u -p -r1.28 RobotRules.pm --- lib/WWW/RobotRules.pm 6 Apr 2004 11:10:49 - 1.28 +++ lib/WWW/RobotRules.pm 6 Apr 2004 11:36:11 - @@ -54,7 +54,7 @@ sub parse { last if $is_me; # That was our record. No need to read the rest. $is_anon = 0; } -elsif (/^User-Agent:\s*(.*)/i) { +elsif (/^\s*User-Agent\s*:\s*(.*)/i) { $ua = $1; $ua =~ s/\s+$//; if ($is_me) { @@ -68,7 +68,7 @@ sub parse { $is_me = 1; } } - elsif (/^Disallow\s*:\s*(.*)/i) { + elsif (/^\s*Disallow\s*:\s*(.*)/i) { unless (defined $ua) { warn RobotRules: Disallow without preceding User-agent\n; $is_anon = 1; # assume that User-agent: * was intended
Re: [PATCH] redirection in LWP::Simple
Ward Vandewege [EMAIL PROTECTED] writes: I had some trouble using LWP::Simple (v1.36 from Debian's libwww-perl package version 5.69-4) with this url: http://www.tvgids.nl/ It turns out that site does an immediate redirect when hitting that url. The webserver seems to be broken because it writes 'location:' instead of 'Location:' in the HTTP headers. The latest LWP::Simple version (v1.38 from libwww-perl 5.76) does not understand 'location' with lower-case first letter either. The patch below (against v1.38) fixes LWP::Simple to accept a lowercase 'location' header. In the mindset of 'Be liberal in what you receive, and conservative in what you send', is this worth adding to libwww-perl? It sure is. Now applied. Thanks! Regards, Gisle Thanks, Ward Vandewege. --- Simple.pm 2003-12-31 14:15:59.0 -0500 +++ Simple.pm 2003-12-31 14:16:24.0 -0500 @@ -180,7 +180,7 @@ if ($buf =~ m,^HTTP/\d+\.\d+\s+(\d+)[^\012]*\012,) { my $code = $1; #print CODE=$code\n$buf\n; - if ($code =~ /^30[1237]/ $buf =~ /\012Location:\s*(\S+)/) { + if ($code =~ /^30[1237]/ $buf =~ /\012Location:\s*(\S+)/i) { # redirect my $url = $1; return undef if $loop_check{$url}++;
Re: Bug submitting large HTTP requests
Jamie Lokier [EMAIL PROTECTED] writes: Gisle Aas wrote: The subroutine Net::HTTP::Methods::write_request calls print, but doesn't check the return value. It's a non-blocking socket, so it's quite normal for the print to do a short write if the string is very large -- larger than the socket transmit buffer. I would belive that print should be responsible for the handing short writes itself. On what system are you running and what perl version are you using? Red Hat 9, perl-5.8.0-88.3. print normally does handly short writes and keep writing until it's done the whole string. However, it will stop when it gets an error code, and it does: EAGAIN because the socket transmit buffer is full and it's non-blocking. Yes. That's a problem, but it might be argued that the user of $http-write_request() is responsible for checking for the error. The method will return FALSE on error and set $! like print :) What could make sense to to rewrite Net::HTTP so that it use syswrite all over the place instead. With it we can easily handle short writes outself. It's not the short writes as such, it's the EAGAINs. This problem is exactly why LWP::Protocol::http never use write_request() itself, but calls format_request() and then use syswrite() to get the bytes out on the wire. Regards, Gisle
libwww-perl-5.77
I've been going through the backlog in my LWP folder today and managed to apply some of the patches found there. I now have to return to my real work, but I still have lots of email I did not find time to look into. The result so far has just been uploaded to CPAN as libwww-perl-5.77. Feel free to remind me of important patches missing, especially if the patch also comes with updates to the test suite and documentation. These are the changes since version 5.76: LWP::Simple did not handle redirects properly when the Location header used uncommon letter casing. Patch by Ward Vandewege [EMAIL PROTECTED]. LWP::UserAgent passed the wrong request to redirect_ok(). Patch by Ville Skyttä [EMAIL PROTECTED]. https://rt.cpan.org/Ticket/Display.html?id=5828 LWP did not handle URLs like http://www.example.com?foo=bar properly. LWP::RobotUA construct now accept key/value arguments in the same way as LWP::UserAgent. Based on patch by Andy Lester [EMAIL PROTECTED]. LWP::RobotUA did not parse robots.txt files that contained Disallow: using uncommon letter casing. Patch by Liam Quinn [EMAIL PROTECTED]. WWW::RobotRules now allow leading space when parsing robots.txt file as suggested by Craig Macdonald [EMAIL PROTECTED]. We now also allow space before the colon. WWW::RobotRules did not handle User-Agent names that use complex version numbers. Patch by Liam Quinn [EMAIL PROTECTED]. Case insensitive handling of hosts and domain names in HTTP::Cookies. https://rt.cpan.org/Ticket/Display.html?id=4530 The bundled media.types file now match video/quicktime with the .mov extension, as suggested by Michel Koppelaar [EMAIL PROTECTED]. Experimental support for composite messages, currently implemented by the HTTP::MessageParts module. Based on ideas from Joshua Hoblitt [EMAIL PROTECTED]. Fixed libscan in Makefile.PL. Patch by Andy Lester [EMAIL PROTECTED]. The HTTP::Message constructor now accept a plain array reference as its $headers argument. The return value of the HTTP::Message as_string() method now better conforms to the HTTP wire layout. No additional \n are appended to the as_string value for HTTP::Request and HTTP::Response. The HTTP::Request as_string now replace missing method or URI with - instead of [NO METHOD] and [NO URI]. We don't want values with spaces in them, because it makes it harder to parse. Enjoy! Regards, Gisle
Re: Latest LWP fails tests
Scott R. Godin [EMAIL PROTECTED] writes: It seems to require Data::Dump which I do not have installed. libwww-perl-5.78 has been uploaded. It fixes this problem. Regards, Gisle
Re: libwww-perl-5.77
Gisle Aas [EMAIL PROTECTED] writes: [EMAIL PROTECTED] (François Pons) writes: Gisle Aas [EMAIL PROTECTED] writes: I've been going through the backlog in my LWP folder today and managed to apply some of the patches found there. I now have to return to my real work, but I still have lots of email I did not find time to look into. The result so far has just been uploaded to CPAN as libwww-perl-5.77. Feel free to remind me of important patches missing, especially if the patch also comes with updates to the test suite and documentation. I wonder about the code in HTML::Form to handle forms with disabled state, which are enabled back using JavaScript code. This is a simple modifications but not handled, I will agree there are no RFC allowing this, which is more a hack than anything else but nothing allow us (to me) to get the disabled input back on live. Does there are any reasons not using this very simple patch ? I think the patch is wrong. The data from a disabled input should not be sent back unless it is enabled. Your patch effectively always enable them. To get this right we would have to have to track enabledness and then provide a way of tweaking this attribute. This a patch implemting this. It exposes the 'readonly' and 'disabled' attribute for form inputs. The patch has been applied :) Regards, Gisle Index: lib/HTML/Form.pm === RCS file: /cvsroot/libwww-perl/lwp5/lib/HTML/Form.pm,v retrieving revision 1.38 diff -u -p -r1.38 Form.pm --- lib/HTML/Form.pm23 Oct 2003 19:11:32 - 1.38 +++ lib/HTML/Form.pm9 Apr 2004 14:14:30 - @@ -188,10 +188,11 @@ sub push_input Carp::carp(Unknown input type '$type') if $^W; $class = TextInput; } -$class = IgnoreInput if exists $attr-{disabled}; $class = HTML::Form::$class; +my @extra; +push(@extra, readonly = 1) if $type eq hidden; -my $input = $class-new(type = $type, %$attr); +my $input = $class-new(type = $type, %$attr, @extra); $input-add_to_form($self); } @@ -769,6 +770,41 @@ sub value_names { return } +=item $bool = $input-readonly + +=item $input-readonly( $bool ) + +This method is used to get/set the value of the readonly attribute. +You are allowed to modify the value of readonly inputs, but setting +the value will generate some noise when warnings are enabled. Hidden +fields always start out readonly. + +=cut + +sub readonly { +my $self = shift; +my $old = $self-{readonly}; +$self-{readonly} = shift if @_; +$old; +} + +=item $bool = $input-disabled + +=item $input-disabled( $bool ) + +This method is used to get/set the value of the disabled attribute. +Disabled inputs do not contribute any key/value pairs for the form +value. + +=cut + +sub disabled { +my $self = shift; +my $old = $self-{disabled}; +$self-{disabled} = shift if @_; +$old; +} + =item $input-form_name_value Returns a (possible empty) list of key/value pairs that should be @@ -781,6 +817,7 @@ sub form_name_value my $self = shift; my $name = $self-{'name'}; return unless defined $name; +return if $self-{disabled}; my $value = $self-value; return unless defined $value; return ($name = $value); @@ -833,9 +870,8 @@ sub value my $old = $self-{value}; $old = unless defined $old; if (@_) { - if (exists($self-{readonly}) || $self-{type} eq hidden) { - Carp::carp(Input '$self-{name}' is readonly) if $^W; - } +Carp::carp(Input '$self-{name}' is readonly) + if $^W $self-{readonly}; $self-{value} = shift; } $old; @@ -1068,6 +1104,7 @@ sub form_name_value return unless $clicked; my $name = $self-{name}; return unless defined $name; +return if $self-{disabled}; return ($name.x = $clicked-[0], $name.y = $clicked-[1] ); @@ -1154,6 +1191,7 @@ sub form_name_value { my $name = $self-name; return unless defined $name; +return if $self-{disabled}; my $file = $self-file; my $filename = $self-filename;
Using Content-Loction as base
---BeginMessage--- Apologies if this is not an appropriate place to report issues with libwww - if which case if you could let me know a better address I'd be very grateful. I've noticed at least one case where $response-base does not match what would be set by a normal web browser. For the url http://www.stateline.org/stateline/ the HTTP headers returned are: HTTP/1.1 200 OK Date: Tue, 20 Jan 2004 16:28:28 GMT Server: Orion/1.5.2 Content-Location: http://www.stateline.org:9090/jsp/staticSite/index2.jsp Set-Cookie: JSESSIONID=KPDJDBGMOFOL; Domain=.stateline.org; Path=/ Cache-Control: private Connection: Close Content-Type: text/html Transfer-Encoding: chunked From this $response-base is set to http://www.stateline.org:9090/jsp/staticSite/index2.jsp which means any relative URIs start with http://www.stateline.org:9090/ Unfortunately the server is not listening on 9090 (or more likely firewalled), so attempts to download any links fail. Normal web browsers do not set port 9090 in the base so can access links and content without problem. Trivial testlink script, run with testlink http://www.stateline.org/stateline/ Thanks #!/usr/pkg/bin/perl -wT use strict; use LWP; my $browser = LWP::UserAgent-new(agent = 'Mozilla/5.0'); my $response = $browser-get($ARGV[0]); if ($response-is_success $response-content_type eq 'text/html') { my $base = $response-base; my $data = $response-content; print Base: $base\n; while ($data =~ s/.*?\b(src|link\b[^]*\s+href)\s*=\s*([^]+)//is) { my $link = URI-new_abs($2, $base); print Link: $link\n; } } -- David Brownlee -- [EMAIL PROTECTED] ---End Message---
Re: multi part form posts
petersm [EMAIL PROTECTED] writes: I am new to LWP and WWW::Mech and have used them on a couple of projects. I was wondering what is the best way to do a multipart post (file upload) using LWP. From LWP you just do a post with something like: $ua-post('http://www.example.com', content_type = 'form-data', content = [ foo = 1, file = [foo.txt], ]); More details available by reading the HTTP::Request::Common manpage. Regards, Gisle
libwww-perl-5.79
Another release has been uploaded to CPAN with quite a few enhancements from Ville, and then some HTTP::Headers hacks by me. These are the changes since 5.78: HTML::Form now exposes the 'readonly' and 'disabled' attribute for inputs. This allow your program to simulate JavaScript code that modifies these attributes. RFC 2616 says that http: referer should not be sent with https: requests. The lwp-rget program, the $req-referer method and the redirect handling code now try to enforce this. Patch by Ville Skyttä [EMAIL PROTECTED]. WWW::RobotRules now look for the string found in robots.txt as a case insensitive substring from its own User-Agent string, not the other way around. Patch by Ville Skyttä [EMAIL PROTECTED]. HTTP::Headers: New method 'header_field_names' that return a list names as suggested by its name. HTTP::Headers: $h-remove_content_headers will now also remove the headers Allow, Expires and Last-Modified. These are also part of the set that RFC 2616 denote as Entity Header Fields. HTTP::Headers: $h-content_type is now more careful in removing embedded space in the returned value. It also now returns all the parameters as the second return value as documented. HTTP::Headers: $h-header() now croaks. It used to silently do nothing. HTTP::Headers: Documentation tweaks. Documented a few bugs discovered during testing. Typo fixes to the documentation all over the place by Ville Skyttä [EMAIL PROTECTED]. Updated tests. and since 5.78 was not really announced these are the changes applied to 5.77 to make it 5.78: Removed stray Data::Dump reference from test suite. Added the parse(), clear(), parts() and add_part() methods to HTTP::Message. The HTTP::MessageParts module of 5.77 is no more. Added clear() and remove_content_headers() methods to HTTP::Headers. The as_string() method of HTTP::Message now appends a newline if called without arguments and the non-empty content does not end with a newline. This ensures better compatibility with 5.76 and older versions of libwww-perl. Use case insensitive lookup of hostname in $ua-credentials. Patch by Andrew Pimlott [EMAIL PROTECTED]. Enjoy! Regards, Gisle
Re: getting webpage from different server than the url points to?
hubert depesz lubaczewski [EMAIL PROTECTED] writes: Charles C. Fu wyrzebi(a): If 10.2.1.7 complies even even minimally with HTTP/1.1, then you can force requests to be sent to it by setting 10.2.1.7 to be your proxy server. If limiting yourself to LWP::Simple, then the proxy server is set through environment variables (e.g., set http_proxy to http://10.2.1.7/). See the LWP::UserAgent man page for more details. i'm not limiting myself to anything. right now i did it using plain sockets. in fact i was not thinking about using webserwer as proxy, and for some reasone i find this idea rather unpleasant. i would just to be able to send the request someplace else - without all this proxy things. There is basically nothing more to the proxy concept, than the fact that you send the request someplace else. Another way of doing it is to plug in an alternative LWP::Protocol::http module that for instance pick up the IP address from a request header. Or you can try this: local @LWP::Protocol::http::EXTRA_SOCK_OPTS = (PeerAddr = 10.2.1.7); print $ua-get(http://www.example.com/foo;); Regards, Gisle
Re: HTML::TreeBuilder and lwp-request
Jacinta Richardson [EMAIL PROTECTED] writes: I've noticed that HTML::TreeBuilder is a subclass of HTML::Parser and that HTML::Parser is required by LWP although it doesn't appear that HTML::TreeBuilder is. I've recently noticed that the /usr/bin tools POST, GET, HEAD and lwp-request provided by LWP are dependant on the deprecated HTML::Parse module from the HTML::Tree package. I presume that at some point LWP moved away from HTML::Parse but these tools were forgotten. These tools fail to work in certain situations without this module being installed (with HTML::Parse isn't in @INC errors) but do not mention in their documentation this dependancy. I bring this up because I had a question today which mentioned that this person's bash script worked perfectly up until when he tried to pass the -o switch to lwp-request. He didn't understand Perl and didn't understand the @INC error message. I don't think he should have had to just to use this tool. I have a few questions: Is HTML::TreeBuilder only required for these tools or does it appear in other parts of the distribution? This is the only place. Is there any specific design decision to leave HTML::TreeBuilder out of the list of required modules? Just because we want to limit the number of dependencies. It is pretty obscure that additional modules are required if you use the -o option of lwp-request. If you know perl it should be pretty obvious what is wrong if you fail to have the module installed. Note that extra HTML::Format* modules might also be needed by -o. Is there someone actively maintaining these tools who I should consult before patching them to not use HTML::Parse and to test (reporting failure reasonably) that modules exist before requireing them? Send suggested patches to this mailing list. Is there a reason why all four of these files appear to be identical but they're not installed as hard links? I have not been able to convince MakeMaker to do this. At some time we tried to install the GET, HEAD, POST aliases as symlinks, but it never worked properly. Regards, Gisle
Re: HTML-Parser
matthew zip [EMAIL PROTECTED] writes: Having problems getting this module to work with my new Perl 5.8.4 on Linux. I followed the instructions but when I attempt to use HTML/LinkExtor I get: HTML::Parser object version 3.36 does not match bootstrap parameter 3.26 at /usr/lib/perl5/5.8.4/i686-linux/DynaLoader.pm line 253. Is this package compatible with Perl 5.8? It sure is. Your installation seems to be mixing incompatible versions of the HTML/Parser.so and HTML/Parser.pm file. This should not happen if you let 'make install' install the module. I would try to reinstall HTML-Parser. Regards, Gisle
Re: :mechanize issues/mechanize.pm dies!!
Darrell Gammill [EMAIL PROTECTED] writes: Look back a the output of 'print $b-current_form()-dump();' Do you see where the option for 'Anthropology' appears by itself? This is because the HTML is not being parsed right. The following line seems to be the offender: option value=ANT Name=AnthropologyAnthropology/option The 'Name' attribute seems to be confusing the form parser so Anthropology is not one of the available options. I don't believe that this can confuse HTML::Form. It does not care about the Name attribute at all. Care to explain better what you think happens here? Regards, Gisle Aas
Re: :mechanize issues/mechanize.pm dies!!
Darrell Gammill [EMAIL PROTECTED] writes: The 'Anthropology' option is being interpreted as its own separate input rather then part of the 'u_input' input. To test this, I used the section of code below with the results right after it. Thanks for the test case. This is a bug in HTML::Form. The 'name' from the option tag overrides the 'name' from the select tag when it should not. We also get in trouble with (illegal) option attributes like 'disabled', 'multiple', 'type' etc. The following patch fixes these problems. It will be in the next libwww-perl. Regards, Gisle Index: Form.pm === RCS file: /cvsroot/libwww-perl/lwp5/lib/HTML/Form.pm,v retrieving revision 1.39 diff -u -p -r1.39 Form.pm --- Form.pm 9 Apr 2004 14:17:32 - 1.39 +++ Form.pm 3 Jun 2004 09:13:44 - @@ -136,15 +136,26 @@ sub parse $f-push_input(textarea, $attr); } elsif ($tag eq select) { - $attr-{select_value} = $attr-{value} - if exists $attr-{value}; + # rename attributes reserved to come for the option tag + for (value, value_name) { + $attr-{select_$_} = delete $attr-{$_} + if exists $attr-{$_}; + } while ($t = $p-get_tag) { my $tag = shift @$t; last if $tag eq /select; next if $tag =~ m,/?optgroup,; next if $tag eq /option; if ($tag eq option) { - my %a = (%$attr, %{$t-[0]}); + my %a = %{$t-[0]}; + # rename keys so they don't clash with %attr + for (keys %a) { + next if $_ eq value; + $a{option_$_} = delete $a{$_}; + } + while (my($k,$v) = each %$attr) { + $a{$k} = $v; + } $a{value_name} = $p-get_trimmed_text; $a{value} = delete $a{value_name} unless defined $a{value}; @@ -192,6 +203,7 @@ sub push_input my @extra; push(@extra, readonly = 1) if $type eq hidden; +delete $attr-{type}; # don't confuse the type argument my $input = $class-new(type = $type, %$attr, @extra); $input-add_to_form($self); } @@ -913,9 +925,9 @@ sub new } else { $self-{menu} = [$value]; - my $checked = exists $self-{checked} || exists $self-{selected}; + my $checked = exists $self-{checked} || exists $self-{option_selected}; delete $self-{checked}; - delete $self-{selected}; + delete $self-{option_selected}; if (exists $self-{multiple}) { unshift(@{$self-{menu}}, undef); $self-{value_names} = [off, $value_name];
Re: [PATCH] Make URI::sip honor the new_abs(), abs(), rel() contract
Ville Skyttä [EMAIL PROTECTED] writes: URI::sip(s) does not honor the URI API contract of returning the original URI if it cannot be made absolute in new_abs() or abs(), or relative in rel(). Fix along with a couple of test cases attached. Applied. Thanks! Regards, Gisle
Re: libwww-perl: Patch to support not sending Content-Length...
Matt Christian [EMAIL PROTECTED] writes: This patch kills the Content-Length both for the request itself and for the multipart/* parts. I think only the later is what you really want. I think the better fix is to simply remove the content-length for the parts. There is probably nothing that really requires them, even though they ought to be harmless. Yes, that was on purpose. The broken web server I need to interact with doesn't understand Content-Length for the request or multipart/* parts. If I send *any* Content-Length headers, it dies with a 5xx error. But the Content-Length header for the request itself will be added by the protocol handler if the request does not have any. It means that not adding the Content-Length to the request itself should make no difference for the server. I don't want really want to introduce yet another ugly global. I don't like the ugly global either so I'm open to suggestions on how to better handle it. Maybe add another option to LWP::UserAgent-new(%options) ? Would that be preferred? No. POST() can be called without using LWP::UserAgent at all. What are the chances of a (possibly modified) version of my patch making it into libwww-perl proper? I'm open to suggestions... I'm willing to apply the following patch if you can confirm that it fixes your problem. Index: lib/HTTP/Request/Common.pm === RCS file: /cvsroot/libwww-perl/lwp5/lib/HTTP/Request/Common.pm,v retrieving revision 1.22 diff -u -p -r1.22 Common.pm --- lib/HTTP/Request/Common.pm 23 Oct 2003 19:11:32 - 1.22 +++ lib/HTTP/Request/Common.pm 3 Jun 2004 13:31:05 - @@ -152,7 +152,6 @@ sub form_data # RFC1867 local($/) = undef; # slurp files $content = $fh; close($fh); - $h-header(Content-Length = length($content)); } unless ($ct) { require LWP::MediaTypes; Regards, Gisle
Re: [patch] HTTP::Message-is_multipart
Joshua Hoblitt [EMAIL PROTECTED] writes: After writing this bit of ugly code... if ( $res-can( 'parts' ) ) { die multipart messages are not supported unless scalar @{[ $res-parts ]} = 1; } I decided that an is_multipart method might be handy. Would anyone else find this functionality useful? I don't really like it. I would have expected a method like $res-is_multipart to actually test for $res-content_type =~ m,^multipart/,. Seems like I should have made 'parts' return the number of parts in scalar context instead of the first one. That would be more useful here. To stay compatible it seems like the best route is to add a method called 'num_parts', but it is not clear to me why you want to handle multipart messages with one part but not those with more. If the need for testing the number of parts is not a common use case I think it is better to leave this method out. Another approach for you is to simply put this sub into your app: sub HTTP::Message::have_many_parts { my $self = shift; return 0 unless $self-can('parts'); return @{[ $self-parts ]} = 1; } and then you can write: die multipart messages are not supported if $res-have_many_parts; Regards, Gisle
Re: [patch] HTTP::Message-is_multipart
Joshua Hoblitt [EMAIL PROTECTED] writes: Seems like I should have made 'parts' return the number of parts in scalar context instead of the first one. That would be more useful here. To stay compatible it seems like the best route is to add a method called 'num_parts', but it is not clear to me why you want to handle multipart messages with one part but not those with more. If the need for testing the number of parts is not a common use case I think it is better to leave this method out. I had some discussion about this on freenode/#perl before submitting the patch. Everyone asked why the parts count wasn't returned in scalar context. :) I had wanted to maintain backwards compatibility but, now that I think about it, I doubt many are using that method in scalar context. Why don't you just fix the behavior now? Because I don't know if anybody is using that method in scalar context and I don't want to break published APIs. It might be unlikely that any code actually breaks, but the benefit of doing this change is also very small. Why isn't parts count returned in scalar context? Because I felt that it would be more useful to not have to force array context when you want to extract the single part of a 'message/*' message if ($res-content_type =~ m,^message/,) { if (my $part = $res-parts) { # do something with the part ... } } and if there was a strong demand for getting the number of parts we could always add a method for that purpose. If I redid this now I think I would make 'parts' return the number and then add a 'part' method (without the 's') that always returns the first part regardless of context. Regards, Gisle
Re: HTTP::Message, setting content with a ref
Joshua Hoblitt [EMAIL PROTECTED] writes: I would like the ability to set the content of an HTTP::Message object by passing in a ref to scalar. This would be a 1 x content savings of memory, which can be significant for large messages. This would require some re-pluming so that $mess-{_content} becomes a ref to scalar (instead of a scalar) and the addition of a mutator, eg. $mess-set_content_ref. It's too bad that lvalues are still problematic. Comments? The LWP API does not use set_ methods or lvalues. The value of the content_ref attribute would be updated if you pass an argument to the method. I think this is a good idea since we already have the content_ref method. I tried to implement it too since I thought it would be trivial. The change got a lot bigger than trivial before I was happy with how this interacted with the 'parts*' methods. This is the patch I ended up with. It is likely to be part of the next LWP release. Regards, Gisle Index: lib/HTTP/Message.pm === RCS file: /cvsroot/libwww-perl/lwp5/lib/HTTP/Message.pm,v retrieving revision 1.42 diff -u -p -r1.42 Message.pm --- lib/HTTP/Message.pm 9 Apr 2004 15:07:04 - 1.42 +++ lib/HTTP/Message.pm 9 Jun 2004 10:53:50 - @@ -75,7 +75,7 @@ sub clone sub clear { my $self = shift; $self-{_headers}-clear; -$self-{_content} = ; +$self-content(); delete $self-{_parts}; return; } @@ -84,16 +84,33 @@ sub clear { sub protocol { shift-_elem('_protocol', @_); } sub content { -my $self = shift; -if (defined(wantarray) !exists $self-{_content}) { - $self-_content; + +my $self = $_[0]; +if (defined(wantarray)) { + $self-_content unless exists $self-{_content}; + my $old = $self-{_content}; + _set_content if @_ 1; + $old = $$old if ref($old) eq SCALAR; + return $old; } -my $old = $self-{_content}; -if (@_) { - $self-{_content} = shift; - delete $self-{_parts}; + +if (@_ 1) { + _set_content; +} +else { + Carp::carp(Useless content call in void context) if $^W; } -$old; +} + +sub _set_content { +my $self = $_[0]; +if (ref($self-{_content}) eq SCALAR) { + ${$self-{_content}} = $_[1]; +} +else { + $self-{_content} = $_[1]; +} +delete $self-{_parts} unless $_[2]; } @@ -101,11 +118,18 @@ sub add_content { my $self = shift; $self-_content unless exists $self-{_content}; -if (ref($_[0])) { - $self-{'_content'} .= ${$_[0]}; # for backwards compatability +my $chunkref = \$_[0]; +$chunkref = $$chunkref if ref($$chunkref); # legacy + +my $ref = ref($self-{_content}); +if (!$ref) { + $self-{_content} .= $$chunkref; +} +elsif ($ref eq SCALAR) { + ${$self-{_content}} .= $$chunkref; } else { - $self-{'_content'} .= $_[0]; + Carp::croak(Can't append to $ref content); } delete $self-{_parts}; } @@ -116,7 +140,14 @@ sub content_ref my $self = shift; $self-_content unless exists $self-{_content}; delete $self-{_parts}; -\$self-{'_content'}; +my $old = \$self-{_content}; +$old = $$old if ref($$old); +if (@_) { + my $new = shift; + Carp::croak(Setting content_ref to a non-ref) unless ref($new); + $self-{_content} = $new; +} +return $old; } @@ -144,7 +175,7 @@ sub headers_as_string { shift-{'_heade sub parts { my $self = shift; -if (defined(wantarray) !exists $self-{_parts}) { +if (defined(wantarray) (!exists $self-{_parts} || ref($self-{_content}) eq SCALAR)) { $self-_parts; } my $old = $self-{_parts}; @@ -160,7 +191,7 @@ sub parts { $self-content_type(multipart/mixed); } $self-{_parts} = [EMAIL PROTECTED]; - delete $self-{_content}; + _stale_content($self); } return @$old if wantarray; return $old-[0]; @@ -174,15 +205,27 @@ sub add_part { $self-content_type(multipart/mixed); $self-{_parts} = [$p]; } -elsif (!exists $self-{_parts}) { +elsif (!exists $self-{_parts} || ref($self-{_content}) eq SCALAR) { $self-_parts; } push(@{$self-{_parts}}, @_); -delete $self-{_content}; +_stale_content($self); return; } +sub _stale_content { +my $self = shift; +if (ref($self-{_content}) eq SCALAR) { + # must recalculate now + $self-_content; +} +else { + # just invalidate cache + delete $self-{_content}; +} +} + # delegate all other method calls the the _headers object. sub AUTOLOAD @@ -219,7 +262,7 @@ sub _parts { die Assert unless @h; my %h = @{$h[0]}; if (defined(my $b = $h{boundary})) { - my $str = $self-{_content}; + my $str = $self-content; $str =~ s/\r?\n--\Q$b\E--\r?\n.*//s; if ($str =~ s
Re: Patch to support --full-time in File::Listing
Christopher J. Madsen [EMAIL PROTECTED] writes: Attached is a patch against LWP 5.79 to allow File::Listing to interpret the output of GNU ls's --full-time option. This allows you to get timestamps accurate to the second, instead of the minute-based ones you get with a normal ls -l. The patch did not apply here. Are you patching from a pristine 5.79? [EMAIL PROTECTED] lwp5]$ patch -p0 full-time.patch patching file lib/File/Listing.pm Hunk #2 FAILED at 372. Hunk #3 FAILED at 1. Hunk #4 FAILED at 84. 3 out of 4 hunks FAILED -- saving rejects to file lib/File/Listing.pm.rej Anyway, this is how --full-time comes out here (Redhat 9). It does not appear to be the same format you try to parse. [EMAIL PROTECTED] lwp5]$ ls -l --full-time total 368 -rw-rw-r--1 gislegisle3800 2004-04-07 12:44:47.0 +0200 AUTHORS drwxrwxr-x3 gislegisle4096 2004-06-14 14:59:56.0 +0200 bin drwxrwxr-x7 gislegisle4096 2004-06-14 14:59:58.0 +0200 blib -rw-rw-r--1 gislegisle 83867 2004-06-14 19:30:48.0 +0200 Changes [EMAIL PROTECTED] lwp5]$ ls --version ls (coreutils) 4.5.3 Written by Richard Stallman and David MacKenzie. Copyright (C) 2002 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Regards, Gisle I believe it also handles BSD ls's -T option, but I don't have a BSD system to test. I'm just working off the OpenBSD manpage. The new time formats are recognized automatically; you just call parse_dir like you normally would.
Re: lwp-request patch to display response body on error
Lucas Gonze [EMAIL PROTECTED] writes: The lwp-request, GET, HEAD, POST - Simple WWW user agent utilities never display the response body if the response code is an error. For RESTful web services this suppresses potential debug information. You don't state what version of LWP you are using, but libwww-perl-5.71 (2003-10-14) had this fix: lwp-request now prints unsuccessful responses in the same way as successsful ones. The status will still indicate failures. Based on a patch by Steve Hay [EMAIL PROTECTED]. Didn't that address this concern? Regards, Gisle Background: I am writing an API for my web app; documentation (out of date but enough to get the gist) on what I am doing is at http://webjay.org/help/api. The client is expected to be a program, not a browser, so I use response status codes to communicate specifics about errors and the response body to communicate useful debugging hints. A typical error response is: ... HTTP/1.1 409 Conflict Content-Type: text/plain There is already a playlist with this title. ... However, requests made using lwp-request never display the response body if there is an error. lwp-request does this: if ($response-is_success){ ... } else { print STDERR $response-error_as_HTML unless $options{'d'}; } And that turns into the boilerplate HTML in HTTP/Response.pm: sub error_as_HTML { my $self = shift; my $title = 'An Error Occurred'; my $body = $self-status_line; return EOM; HTML HEADTITLE$title/TITLE/HEAD BODY H1$title/H1 $body /BODY /HTML EOM } I am expecting clients to be shell scripts using the lwp-request utilities, so it's important for the debug messages to be displayed. The fix: in GET, I have added a -D flag to display the response body even if there is an error. This seemed like a good cognate next to -d, which always suppresses the response body. Here is the patch, diff'd against my local copy of GET, which may not be the most recent: bash-2.05a$ diff /usr/bin/GET GET 282a283 'D', # LG patch -- display response body even on error 477a479,482 # LG patch to support my added -D flag if( $options{'D'} ){ print STDERR $response-content unless $options{'d'}; } else { 479a485 } - Lucas Gonze
libwww-perl-5.800
A brand new libwww-perl release should be out on CPAN now. In fear of running out of version numbers less than 5.9 I've added one more digit. I want to reserve 5.9 for betas for 6.0 if that should ever happen. The next release will be 5.801, so this scheme should keep us going for a while. The changes since 5.79 are: HTML::Form will allow individual menu entries to be disabled. This was needed to support input type=radio disabled value=foo and selectoption disabledfoo. HTML::Form now avoids name clashes between the select and option attributes. HTML::Form now implicitly close select elements when it sees another input or /form. This is closer to the MSIE behaviour. HTML::Form will now support keygen-inputs. It will not calculate a key by itself. The user will have to set its value for it to be returned by the form. HTTP::Headers now special case field names that start with a ':'. This is used as an escape mechanism when you need the header names to not go through canonicalization. It means that you can force LWP to use a specific casing and even underscores in header names. The ugly $TRANSLATE_UNDERSCORE global has been undocumented as a result of this. HTTP::Message will now allow an external 'content_ref' to be set. This can for instance be used to let HTTP::Request objects pick up content data from some scalar variable without having to copy it. HTTP::Request::Common. The individual parts will no longer have a Content-Length header for file uploads. This improves compatibility with normal browsers. LWP::Simple doc patch for getprint. Contributed by Yitzchak Scott-Thoennes [EMAIL PROTECTED]. LWP::UserAgent: New methods default_header() and default_headers(). These can be used to set up headers that area automatically added to requests as they are sent. This can for instance be used to initialize various Accept headers. Various typo fixes by Ville Skyttä [EMAIL PROTECTED]. Fixed test failure under perl-5.005. LWP::Protocol::loopback: This is a new protocol handler that works like the HTTP TRACE method, it will return the request provided to it. This is sometimes useful for testing. It can for instance be invoked by setting the 'http_proxy' environment variable to 'loopback:'. Enjoy! Regards, Gisle
Re: libwww-perl-5.800
\(William\) Wenjie Wang [EMAIL PROTECTED] writes: Failed Test Stat Wstat Total Fail Failed List of Failed live/activestate.t 255 65280 23 150.00% 1-2 live/jigsaw-auth-b.t 33 100.00% 1-3 live/jigsaw-auth-d.t 11 100.00% 1 live/jigsaw-chunk.t 9 2304 58 160.00% 1-5 live/jigsaw-md5-get.t22 100.00% 1-2 live/jigsaw-md5.t22 100.00% 1-2 live/jigsaw-neg-get.t11 100.00% 1 live/jigsaw-neg.t11 100.00% 1 live/validator.t 2 512 24 200.00% 1-2 Failed 9/41 test scripts, 78.05% okay. 19/761 subtests failed, 97.50% okay. NMAKE : fatal error U1077: 'C:\Perl\bin\perl.exe' : return code '0x2' Stop. I think this must be a local problem at your site. Are the machine you're testing from properly connected to the Internet? Do you have to go through some proxy? Regards, Gisle
Re: libwww-perl-5.800
I would be greatful if you had the time to figure out why the tests fail and perhaps even propose patches to work around the issue. There might be simple tweaks that can be done to them to make them work in your envirionment. You might run tests individually like this: cd libwww-perl-5.800 perl -Ilib t/live/jigsaw-md5.t --Gisle
Re: a suggestion for URI or URI::Heuristic
[EMAIL PROTECTED] writes: How about this for the next version of URI: URI-new(%68ttp://www.example.com/)-canonical eq http://www.example.com/; Why? This appears just wrong. RFC 2396 does not allow escapes in the scheme part. Is this used out in the wild? Regards, Gisle
Re: support for multiple outgoing IPs
Jeff 'japhy' Pinyan [EMAIL PROTECTED] writes: I'm going to release these subclasses, but I'd like to know if the libwww suite can perhaps be rewritten in the future to allow for this type of thing... It is already sort of supported. You can set the outgoing address by tweaking the @LWP::Protocol::http::EXTRA_SOCK_OPTS. What is your suggested change to support this? Regards, Gisle
Re: support for multiple outgoing IPs
Jeff 'japhy' Pinyan [EMAIL PROTECTED] writes: On Jul 13, Gisle Aas said: Jeff 'japhy' Pinyan [EMAIL PROTECTED] writes: I'm going to release these subclasses, but I'd like to know if the libwww suite can perhaps be rewritten in the future to allow for this type of thing... It is already sort of supported. You can set the outgoing address by tweaking the @LWP::Protocol::http::EXTRA_SOCK_OPTS. What is your suggested change to support this? (That's not in the FTP protocol module, by the way...) I know :( I see that, and I'm using it in my subclass, but it's a matter of getting the stuff *to* EXTRA_SOCK_OPTS. The data (the array of IPs to use) shouldn't necessarily belong to the LWP::Protocol::http subclass; I'd expect it to belong to the LWP::UserAgent object, or in this case, the HTTP::Proxy object. Either that or we could attach it to the request object. Attaching it to the request give more flexibility and it could potentially be defaulted from the $ua-default_header settings. And I haven't found a way to create my own LWP::Protocol::http subclass that is used instead of the original one. That's why I had to subclass LWP::UserAgent and LWP::Protocol as well. You should be able to override protocol handlers with: LWP::Protocol::implementor(http, MyClass) The server I did my work on is currently down, but I'll provide my code tomorrow. Ok, I'll take a look then. Regards, Gisle
Re: Patch to Form.pm to recognize button type=submit
Michael Alan Dorman [EMAIL PROTECTED] writes: Because the input type=button tag doesn't allow text other than the value to be displayed in the button, I've had to start using the button tag on some of my pages. Imagine my dismay when this caused WWW::Mechanize to no longer recognize that my form had buttons! Nobody complained before so I guess they are not used much. After poking around WWW::Mechanize for a bit, I was led to HTML::Form, which doesn't currently recognize button tags. This patch certainly fixes the issue I was having, and I think it represents a generally applicable enhancement. I'd love to see it included in the next drop of libwww-perl. Looks good. Would be even better if you also updated the test suite. I'll get in included in the next release. Currely I have problems accessing soureforge so I'm not able to get it checked in. --- Form.pm.orig 2004-06-16 06:41:23.0 -0400 +++ Form.pm 2004-07-19 11:03:31.0 -0400 @@ -96,7 +96,7 @@ my $p = HTML::TokeParser-new(ref($html) ? $html-content_ref : \$html); eval { # optimization - $p-report_tags(qw(form input textarea select optgroup option keygen)); + $p-report_tags(qw(button form input textarea select optgroup option keygen)); }; unless (defined $base_uri) { @@ -130,6 +130,11 @@ $attr-{value_name} = $p-get_phrase; $f-push_input($type, $attr); } + elsif ($tag eq button) { + my $type = delete $attr-{type} || submit; + $attr-{value_name} = $p-get_phrase; + $f-push_input($type, $attr); I don't think we should support button type=checkbox and similar so I suggest we only push the input if the $type is submit at this point. + } elsif ($tag eq textarea) { $attr-{textarea_value} = $attr-{value} if exists $attr-{value}; Regards, Gisle
Re: Problem uploading large files with PUT
Rodrigo Ruiz [EMAIL PROTECTED] writes: Yesterday I updated my LWP module from version 5.75 to the current 5.8 version. From this update, one of my scripts has stopped working. Oops! Sorry. The script creates a PUT request, specifying a subroutine as the content, for dynamic content retrieval. The original code does: my $req = HTTP::Request-new(PUT, $url, $header, $readFunc); But now it dies with a Not a SCALAR reference error. I tried to reproduce this error but it did not happen for me. Are you able to provide a complete little program that demonstrate this failure. I have been debugging the LWP code, and I have found the following workaround: my $req = HTTP::Request-new(PUT, $url, $header, \$readFunc); That is, pass the function reference, by reference. Unfortunately, this change makes my script fail with older LWP versions. My questions are: Is there a more elegant workaround that do not break compatibility with older LWP versions? You could always do; $LWP::VERSION 5.800 ? $readFunc : \$readFunc but I rather fix this problem in 5.801. A test case that reproduce this would be very helpful. If not, and I put these two lines in an if-else sentence, comparing the $LWP::VERSION value with a threshold , which exact version should I compare to? This change went into 5.800. I'm quite sure it must be the culprit: |HTTP::Message will now allow an external 'content_ref' |to be set. This can for instance be used to let HTTP::Request |objects pick up content data from some scalar variable without |having to copy it. Regards, Gisle
Re: LWP
DePriest, Mitch [EMAIL PROTECTED] writes: Will LWP::Simple run an activestate Not really sure what you are asking about here, but LWP is part of the standard ActivePerl distribution from ActiveState. This includes the LWP::Simple module. A system with ActivePerl will always have LWP::Simple. Regards, Gisle Aas, ActiveState
Re: Simulate HTTP transactions
William McKee [EMAIL PROTECTED] writes: On Thu, Aug 19, 2004 at 02:07:27PM -0700, Jaime Rodriguez wrote: Somebody tell me that in Perl, this is easily handled with the lib-www-perl module. I had never used Perl and I wonder if you now what I'm talking about and even if you can help me with some guidance of how to do it. You could do this with LWP but there are at least a couple helper modules that will make your life easier: HTTP::WebTest WWW::Mechanize But if he never used Perl then it might be a good idea to learn that first. Reading books could be a way to do that. --Gisle
Re: scriptscript bug in HTML::TokeParser?
ashley [EMAIL PROTECTED] writes: Hey everyone. I think this is the right list to bring this up, please forgive me if I'm wrong. This list should be right. While writing a simple HTML validator, forbidden tag stripper, I came across what might be a problem, though it might be expected and appropriate behavior, I thought I'd better bring it up. A script following a script is interpreted as text. This is expected behaviour. After script no tags are recognized until /script is seen. Everything in between is reported as text and should be passed to whatever is able to parse the script if your interested in it. The same behaviour occurs for style, textarea and xmp. I realize that the actual script is text but maybe it should be loaded into PI (or D or C) instead of T? If not, plain stripping of the HTML leaves a potentially problematic situation. How? If you strip all script text there should not be a problem. Regards, Gisle Demo of the problem is below. Thank you for looking! -Ashley use HTML::TokeParser; use Data::Dumper; $Data::Dumper::Terse = 1; my $text = join '', DATA; my $p = HTML::TokeParser-new( \$text ); while ( my $token = $p-get_token ) { print Dumper $token; } __DATA__ This is my spurious or malicious htmlscriptscriptalert('boo!')/script
Re: is_success() returning tru even though server was down
James Cloos [EMAIL PROTECTED] writes: I have some code that does: my $req = HTTP::Request-new(GET = http://$foo/bar;); my $res = $ua-request($req); push @good, $foo if ($res-is_success); in a loop. I tested that is_success did the right thing if the file bar was not in teh server's $SERVER_ROOT, and I presumed it would return false if $foo was not up. But in fact, $res-_rc is 200 when the remote box is down just like when the file bar exists. Tested on gentoo w/ latest ebuilds of perl and libwww, and freebsd 5.2.1 w/ their ports. Why does _rc == 200 when their was no reply from the server? I've never seen that happen. Can you provide me with the full $res-as_string output in this case. It might also be instructive to strace the client as it runs to see what happens at the syscall level. If you get a 200 response it must mean that the connection to the server succeeded. I presume part of the problem is that it appears to be sending a HTTP 0.9 GET rather than a 1.0 GET. I don't see anything in the docs about forcing the latter. How is that done? LWP always sends HTTP/1.1 GETs. Or should I do a HEAD instead of a GET, given that I'm only testing for the existence of the file and the network connection between the two boxen? The HEAD might be cheaper, but not all servers implement it for all resources. Regards, Gisle
Re: URI::file not RFC 1738 compliant?
Ville Skyttä [EMAIL PROTECTED] writes: As far as I can tell, RFC 1738, section 3.10, as well as the BNF in section 5 explicitly say that file: URI must have two forward slashes before the optional hostname, followed by another forward slash, and then the path. RFC 1738 is becoming a bit stale. I do believe that the intent is for 'file' URIs to also follow the RFC 2396 syntax for hierarchical namespaces which clearly states at the 'authority' is optional. absoluteURI = scheme : ( hier_part | opaque_part ) hier_part = ( net_path | abs_path ) [ ? query ] net_path = // authority [ abs_path ] abs_path = / path_segments However: $ perl -MURI::file -e 'print URI::file-new_abs(/foo), \n' file:/foo I would have expected file:///foo. I just find 'file:///foo' very ugly so I try to avoid using the triple slash whenever I can. There is also a slight semantic difference between these two forms. If the authority is missing it means that it is unknown, while authority of is documented to be synonym for localhost. Perhaps this can be used to argue that 'file:///foo' is more correct. These one-slash file: URIs cause various interoperability problems here and there with applications or other libraries that require strict RFC compliance. For example, XML::LibXML::SAX seems to treat file:/foo as a literal relative path from the current directory (ie. $PWD/file:/foo), whereas file:///foo works with it as expected. Do you have other examples? Would it be possible to have this fixed in URI? Sure. Especially if I'm told about more apps that can't inter operate with authority-less-file-URIs. I might want to make it an option. Regards, Gisle
Re: URI::file not RFC 1738 compliant?
Bjoern Hoehrmann [EMAIL PROTECTED] writes: Unhandled Exception: System.UriFormatException: Invalid URI: The format of the URI could not be determined. Ok. URI-1.32 has just been uploaded and it revise how we map filenames to file URIs. Some examples with the new module: $ perl -MURI::file -le 'print URI::file-new(foo, unix)' foo $ perl -MURI::file -le 'print URI::file-new(/foo, unix)' file:///foo $ perl -MURI::file -le 'print URI::file-new(foo, win32)' foo $ perl -MURI::file -le 'print URI::file-new(/foo, win32)' file:///foo $ perl -MURI::file -le 'print URI::file-new(//h/foo, win32)' file:h/foo $ perl -MURI::file -le 'print URI::file-new(c:foo, win32)' file:///c:foo $ perl -MURI::file -le 'print URI::file-new(c:\\foo, win32)' file:///c:/foo
Re: Download manager - problem solved
Octavian Rasnita [EMAIL PROTECTED] writes: I have found how to insert HTTP headers in the line: $ua-get($url, :content_file = $file); ...But I still can't find how to get and send cookies in other way than trying to manually add the headers for Cookies. Just enable the cookie jar, with $ua-cookie_jar({}), and LWP will manage the cookies for you. Regards, Gisle
Re: Breaking a keep alive connection
Bill Moseley [EMAIL PROTECTED] writes: I'm using keep alives and the form of $ua-get() that uses a callback function to read the data as it arrives. If the callback function dies will the connection always be broken? Yes, unless it dies after the last part of the response has actually been read. That is, will the next request to that server be a new connection, not an existing open connection from a previous keep alive request? I assume if I've only read the first chunk out of a very large response then the connection will be broken. But, I'm not clear what happens if the fetched document is very small (like the first chunk is the entire document). Does size matter? Or would LWP drop the connection regardless? If LWP has provided you with the complete content when your callback dies, then the connection is kept up. Also, is there a way to ask LWP if a request would be to an open connection before making the actual connection? You could roam around in the $ua-conn_cache object, but it is not really documented how the connections are indexed, or what the connection objects actually are. If you want to make sure LWP has no more connections open, you can call $au-conn_cache-drop; Regards, Gisle
Re: Byte Order Mark mucks up headers
Phil Archer [EMAIL PROTECTED] writes: I've read Sean Burke's book, I've looked through the archives of this list and done other searches but can't find an answer to a problem I have found with LWP. If the character coding for a website has a byte order mark (things like utf-16, all that big endian/little endian stuff) then LWP can't interpret HTML headers in the usual way. Does anyone know a way around this? HML::HeadParser needs to be fixed. It will assume that there is no head section when it sees text before anything else. The part of the code responsible for this currently allows whitespace, but needs to be tought that BOM is harmless too. Look at the 'text' method. Do you want to try to provide a patch? Regards, Gisle
Re: mirror.al
Bill Moseley [EMAIL PROTECTED] writes: I'm trying to understand this error: $ perl -MLWP::UserAgent -le 'LWP::UserAgent-new-mirror(http://perl.org;, perl.org )' Can't locate auto/LWP/UserAgent/mirror.al in @INC (@INC contains: /usr/local/lib/perl5/5.8.0/sun4-solaris /usr/local/lib/perl5/5.8.0 /usr/local/lib/perl5/site_perl/5.8.0/sun4-solaris /usr/local/lib/perl5/site_perl/5.8.0 /usr/local/lib/perl5/site_perl .) at -e line 1 Looks like LWP was not installed with the normal 'perl Makefile.PL make install' drill. This creates the .al files. There's a few versions of Perl installed on this machine -- so I'm wondering if there's some kind of conflict. Or if it's just a broken Sun package. $ perl -MLWP -le 'print $LWP::VERSION' 5.10 Newer versions of LWP do not use the autoloader and would not run into this problem. This version of LWP is more than 7 years old. I recommend something newer. Regards, Gisle
Re: LWP installation failed make test: base/date
Craig Cummings [EMAIL PROTECTED] writes: I'm trying to install Bundle::LWP on my Debian Linux system. Interesting. What does these commands print on your system: perl -le 'print scalar gmtime(0)' perl -le 'print scalar gmtime(760233600)' perl -le 'print scalar gmtime(3915993600)'
Re: Promoting Mechanize
Andy Lester [EMAIL PROTECTED] writes: Gisle, can we put some kind of mention of WWW::Mechanize in LWP::UserAgent? Plenty of people know about LWP, but want to do the rest of the stuff that Mech does. See this as an example: http://www.perlmonks.org/?node_id=405988 The LWP::UserAgent manpage[1] already says: | See WWW::Mechanize and WWW::Search for examples of more specialized | user agents based on LWP::UserAgent. Do you want some other reference as well? [1] http://search.cpan.org/dist/libwww-perl/lib/LWP/UserAgent.pm#SEE_ALSO
Re: Segfault using HTML::Parser and URI::URL
Thibaut Britz [EMAIL PROTECTED] writes: the following produces a segfault using the latest version of libwww. I see segfaults with ActivePerl 810 but not with our latests builds. What version of perl are you using? The segfault appears to be a bug in perl I would like to find out if the problem has really been fixed. As it seems, HTML::Parser is marking non UTF8 strings as UTF8 strings. Did you enable the Unicode support when you installed HTML-Parser? It seems like this would be the only time this happens, but I want to be sure. or to see it: #!/usr/bin/perl use warnings; use strict; use Devel::Peek; use HTML::Parser; my $html = qq{img title=rsquo;\260}; my $p = HTML::Parser-new(api_version=3,start_h=[sub{Dump(shift- {title})}, attr]); $p-parse($html); What output do you get?
Re: Segfault using HTML::Parser and URI::URL
The following patch should make sure that HTML::Parser does not produce badly encoded SVs. That avoid the problem demonstrated, but I still need to track down why perl itself segfaulted because of this. Regards, Gisle Index: util.c === RCS file: /cvsroot/libwww-perl/html-parser/util.c,v retrieving revision 2.20 retrieving revision 2.21 diff -u -p -r2.20 -r2.21 --- util.c 8 Nov 2004 14:14:35 - 2.20 +++ util.c 10 Nov 2004 13:32:56 - 2.21 @@ -209,23 +209,21 @@ decode_entities(pTHX_ SV* sv, HV* entity } if (!SvUTF8(sv) repl_utf8) { - STRLEN len = t - SvPVX(sv); - if (len) { - /* need to upgrade the part that we have looked though */ - STRLEN old_len = len; - char *ustr = bytes_to_utf8(SvPVX(sv), len); - STRLEN grow = len - old_len; - if (grow) { - /* XXX It might already be enough gap, so we don't need this, - but it should not hurt either. - */ - grow_gap(aTHX_ sv, grow, t, s, end); - Copy(ustr, SvPVX(sv), len, char); - t = SvPVX(sv) + len; - } - Safefree(ustr); - } + /* need to upgrade sv before we continue */ + STRLEN before_gap_len = t - SvPVX(sv); + char *before_gap = bytes_to_utf8(SvPVX(sv), before_gap_len); + STRLEN after_gap_len = end - s; + char *after_gap = bytes_to_utf8(s, after_gap_len); + + sv_setpvn(sv, before_gap, before_gap_len); + sv_catpvn(sv, after_gap, after_gap_len); SvUTF8_on(sv); + + Safefree(before_gap); + Safefree(after_gap); + + s = t = SvPVX(sv) + before_gap_len; + end = SvPVX(sv) + before_gap_len + after_gap_len; } else if (SvUTF8(sv) !repl_utf8) { repl = bytes_to_utf8(repl, repl_len); Index: t/uentities.t === RCS file: /cvsroot/libwww-perl/html-parser/t/uentities.t,v retrieving revision 1.8 retrieving revision 1.9 diff -u -p -r1.8 -r1.9 --- t/uentities.t 8 Nov 2004 14:14:42 - 1.8 +++ t/uentities.t 10 Nov 2004 13:33:03 - 1.9 @@ -14,7 +14,7 @@ unless (HTML::Entities::UNICODE_SUPPORT exit; } -print 1..13\n; +print 1..14\n; print not unless decode_entities(euro) eq \x{20AC}; print ok 1\n; @@ -90,3 +90,6 @@ print ok 12\n; print not unless decode_entities(#56256) eq chr(0xFFFD); print ok 13\n; + +print not unless decode_entities(\260rsquo;\260) eq \x{b0}\x{2019}\x{b0}; +print ok 14\n;
Re: Segfault using HTML::Parser and URI::URL
Gisle Aas [EMAIL PROTECTED] writes: The following patch should make sure that HTML::Parser does not produce badly encoded SVs. That avoid the problem demonstrated, but I still need to track down why perl itself segfaulted because of this. Perl crashed because the regexp engine did deal properly with bad UTF8. This will be fixed in perl-5.8.6 by this patch: http://public.activestate.com/cgi-bin/perlbrowse?patch=23261 Regards, Gisle
Re: HTML::Parser plaintext tag
Alex Kapranoff [EMAIL PROTECTED] writes: As far as I can understand HTML::Parser simply ignores closing /plaintext tag. I read the tests and Changes so I see that this is intended behaviour and plaintext is special-cased of all CDATA elements. Does someone know the reasoning of this decision? :) It is just plain interesting. A long time ago the HTTP protocol did not have MIME-like headers. The client sent a GET foo line and the server responded with HTML and then closed the connection. Since there was no way for the server to indicate any other Content-Type than text/html the plaintext tag was introduced so that text files could be served by just prefixing the file content with this tag. This was before the img tag was invented so luckily we don't have a similar unclosed gif tag :) Does HTML::Parser imitate some old browser here? Yes, it was there in the beginning but still seems well supported. Of my current browsers both Konqueror and MSIE support this. Firefox support it in the same way as xmp, i.e. it allow you to escape out of it with /plaintext. The plaintext tag is described in this historic document: http://www.w3.org/History/19921103-hypertext/hypertext/WWW/MarkUp/Tags.html#7 It results in weird effects for me as I write a HTML sanitizer for WebMail. Howcome? Do you have a need to suppress this behaviour in HTML::Parser? Regards, Gisle
Re: HTML::Parser plaintext tag
Alex Kapranoff [EMAIL PROTECTED] writes: * Alex Kapranoff [EMAIL PROTECTED] [November 11 2004, 11:11]: It results in weird effects for me as I write a HTML sanitizer for WebMail. Howcome? Do you have a need to suppress this behaviour in HTML::Parser? Yes, I'd like to have an option to resume parsing after `/plaintext' just as firefox does. As I understand the original intentions now I'll try to produce a patch. I've filed a ticket 8362 in rt.cpan.org with the patch. It creates an additional boolean attribute `closing_plaintext'. Not that I insist on naming. Seems good; and I've just uploaded HTML-Parser-3.38 with this patch.
Re: [PATCH] Caching/reusing WWW::RobotRules(::InCore)
Ville Skyttä [EMAIL PROTECTED] writes: The current behaviour of LWP::RobotUA, when passed in an existing WWW::RobotRules::InCore object is counterintuitive to me. I am of this opinion because of the documentation of $rules in LWP::RobotUA-new() and WWW::RobotRules-agent(), as well as the implementation in WWW::RobotRules::AnyDBM_File. Currently, W::R::InCore empties the cache always when agent() is called, regardless if the agent name changed or not. W::R::AnyDBM_File does not seem to have this problem. I suggest applying the attached patch to fix this. Applied. Will be in 5.801. Regards, Gisle Index: lib/WWW/RobotRules.pm === RCS file: /cvsroot/libwww-perl/lwp5/lib/WWW/RobotRules.pm,v retrieving revision 1.30 diff -a -u -r1.30 RobotRules.pm --- lib/WWW/RobotRules.pm 9 Apr 2004 15:09:14 - 1.30 +++ lib/WWW/RobotRules.pm 12 Oct 2004 06:39:34 - @@ -185,10 +185,12 @@ # FooBot/1.2 = FooBot # FooBot/1.2 [http://foobot.int; [EMAIL PROTECTED] = FooBot - delete $self-{'loc'}; # all old info is now stale $name = $1 if $name =~ m/(\S+)/; # get first word $name =~ s!/.*!!; # get rid of version - $self-{'ua'}=$name; + unless ($old $old eq $name) { + delete $self-{'loc'}; # all old info is now stale + $self-{'ua'} = $name; + } } $old; }
Re: Patch for WWW::RobotsRules.pm
Bill Moseley [EMAIL PROTECTED] writes: I've got a spider that uses LWP::RobotUA (WWW::RobotRules) and a few users of the spider have complained that the warning messages were not obvious enough. I guess I can agree because when they are spidering multiple hosts the message doesn't tell them what robots.txt had a problem. The patch I've now applied is this one: Index: lib/WWW/RobotRules.pm === RCS file: /cvsroot/libwww-perl/lwp5/lib/WWW/RobotRules.pm,v retrieving revision 1.31 retrieving revision 1.32 diff -u -p -u -r1.31 -r1.32 --- lib/WWW/RobotRules.pm 12 Nov 2004 16:05:09 - 1.31 +++ lib/WWW/RobotRules.pm 12 Nov 2004 16:14:25 - 1.32 @@ -1,8 +1,8 @@ package WWW::RobotRules; -# $Id: RobotRules.pm,v 1.31 2004/11/12 16:05:09 gisle Exp $ +# $Id: RobotRules.pm,v 1.32 2004/11/12 16:14:25 gisle Exp $ -$VERSION = sprintf(%d.%02d, q$Revision: 1.31 $ =~ /(\d+)\.(\d+)/); +$VERSION = sprintf(%d.%02d, q$Revision: 1.32 $ =~ /(\d+)\.(\d+)/); sub Version { $VERSION; } use strict; @@ -70,7 +70,7 @@ sub parse { } elsif (/^\s*Disallow\s*:\s*(.*)/i) { unless (defined $ua) { - warn RobotRules: Disallow without preceding User-agent\n; + warn RobotRules $robot_txt_uri: Disallow without preceding User-agent\n if $^W; $is_anon = 1; # assume that User-agent: * was intended } my $disallow = $1; @@ -97,7 +97,7 @@ sub parse { } } else { - warn RobotRules: Unexpected line: $_\n; + warn RobotRules $robot_txt_uri: Unexpected line: $_\n if $^W; } } So maybe something like: --- RobotRules.pm.old 2004-04-09 08:37:08.0 -0700 +++ RobotRules.pm 2004-09-16 09:46:03.0 -0700 @@ -70,7 +70,7 @@ } elsif (/^\s*Disallow\s*:\s*(.*)/i) { unless (defined $ua) { - warn RobotRules: Disallow without preceding User-agent\n; + warn RobotRules: [$robot_txt_uri] Disallow without preceding User-agent\n; $is_anon = 1; # assume that User-agent: * was intended } my $disallow = $1; @@ -97,7 +97,7 @@ } } else { - warn RobotRules: Unexpected line: $_\n; + warn RobotRules: [$robot_txt_uri] Unexpected line: $_\n; } }
Re: WWW::RobotRules warning could be more helpful
[EMAIL PROTECTED] writes: If you spider several sites and one of them has a broken robots.txt file you can't tell which one since the warning doesn't tell you. This will be better in 5.801. I've applied a variation of Bill Moseley's suggested patch for the same problem. Around line 73 of RobotRules.pm change: warn RobotRules: Disallow without preceding User-agent\n; to # [EMAIL PROTECTED]: added $netloc warn RobotRules: $netloc Disallow without preceding User-agent\n;
Re: / uri escaped in LWP::Protocol::file
Moshe Kaminsky [EMAIL PROTECTED] writes: It appears to me there is a small bug in LWP::Protocol::file. The '/' added to the end of each directory member which is itself a directory, is escaped when turning it into a url, making the url quite useless. I suggest the following patch: Finally applied. Thanks! Regards, Gisle --- /usr/lib/perl5/vendor_perl/5.8.4/LWP/Protocol/file.old2004-09-19 22:56:35.786858776 +0300 +++ /usr/lib/perl5/vendor_perl/5.8.4/LWP/Protocol/file.pm 2004-09-19 22:56:24.0 +0300 @@ -96,14 +96,13 @@ closedir(D); # Make directory listing +my $pathe = $path . ( $^O eq 'MacOS' ? ':' : '/'); for (@files) { - if($^O eq MacOS) { - $_ .= / if -d $path:$_; - } - else { - $_ .= / if -d $path/$_; - } my $furl = URI::Escape::uri_escape($_); +if ( -d $pathe$_ ) { +$furl .= '/'; +$_ .= '/'; +} my $desc = HTML::Entities::encode($_); $_ = qq{LIA HREF=$furl$desc/A}; }
Re: HTML::HeadParser
David Hofmann [EMAIL PROTECTED] writes: I'm currently using your perl module for processing input from a spider I wrote. The problem I'm encountering is some pages have in the title. Example HTML: TITLE274500 - XL: Save Changes in Bookname Prompt Even If No Changes Are Made/TITLE The result I get back is XL: Save Changes in . Also the description, keywords and last-modified come back bank on these pages if they were after the title on the page. It looks like most other browsers parse title stuff in what the HTML::Parser sources call literal mode. I've now applied the following patch to my sources, but I'm not really sure this is a good idea. I might still decide to revert it before release. Index: hparser.c === RCS file: /cvsroot/libwww-perl/html-parser/hparser.c,v retrieving revision 2.98 retrieving revision 2.99 diff -u -p -u -r2.98 -r2.99 --- hparser.c 11 Nov 2004 10:12:51 - 2.98 +++ hparser.c 15 Nov 2004 22:19:49 - 2.99 @@ -1,4 +1,4 @@ -/* $Id: hparser.c,v 2.98 2004/11/11 10:12:51 gisle Exp $ +/* $Id: hparser.c,v 2.99 2004/11/15 22:19:49 gisle Exp $ * * Copyright 1999-2002, Gisle Aas * Copyright 1999-2000, Michael A. Chase @@ -27,6 +27,7 @@ literal_mode_elem[] = {5, style, 1}, {3, xmp, 1}, {9, plaintext, 1}, +{5, title, 0}, {8, textarea, 0}, {0, 0, 0} }; The problem here is that other browsers seems to switch into a mode where tags inside title is still recognized if no /title end tag was found in the document. HTML-Parser does not have brains to do something like this. It tries to parse the document in a stream-like fashion, and buffering of it all to figure out what quirk-mode to parse in does not seem attractive. Regards, Gisle
libwww-perl-5.801
Eventually I found time to fix the problem with code references as content that was introduced by 5.800 and integrate some more patches. I probably will make a 5.802 later this week so if there are new or old patches you really want applied this might be is a good time to speak up. The changes since 5.800 are: HTTP::Message improved content/content_ref interaction. Fixes DYNAMIC_FILE_UPLOAD and other uses of code content in requests. HTML::Form: - Handle clicking on nameless image. - Don't let $form-click invoke a disabled submit button. HTTP::Cookies could not handle a old-style cookie named Expires. HTTP::Headers work-around for thread safety issue in perl = 5.8.4. HTTP::Request::Common improved documentation. LWP::Protocol: Check that we can write to the file specified in $ua-request(..., $file) or $ua-mirror. LWP::UserAgent clone() dies if proxy was not set. Patch by Andy Lester [EMAIL PROTECTED] HTTP::Methods now avoid use of uninitialized-warning when server replies with incomplete status line. lwp-download will now actually tell you why it aborts if it runs out of disk space of fails to write some other way. WWW::RobotRules: only display warning when running under 'perl -w' and show which robots.txt file they correspond to. Based on patch by Bill Moseley. WWW::RobotRules: Don't empty cache when agent() is called if the agent name does not change. Patch by Ville Skyttä [EMAIL PROTECTED]. Enjoy! Regards, Gisle
HTML-Parser-3.39_90
I just uploaded HTML-Parser-3.39_90 to CPAN. It is supposed to have proper handling of Unicode on perl-5.8 or better. The compile time option to select decoding of Unicode entities is gone. This release also make title.../title parse in literal mode. If there are many pages out there with non-terminated title elements this might not be such a good idea, so this change might not stay. Please try it out to see if you find problems with it. Regards, Gisle
Re: URI doesn't accept a semi-colon as query parameter separator
Brian Cassidy [EMAIL PROTECTED] writes: I was testing an app at the command line which does some query and URL manipulation. At one point, I pass the URL as generated from CGI.pm, which happens to use a semi-colon (rather than an ampersand) as the query parameter separator. Once I tried to access the params from the hash URI returns from query_form(), I noticed that there was only 1 param instead of the many more I was expecting. Is using the URI::Semi class workable for you? If not, why? http://www.rosat.mpe-garching.mpg.de/mailing-lists/libwww-perl/2002-09/msg00022.html
Re: user agents
Zed Lopez [EMAIL PROTECTED] writes: I'd like to suggest these differences be documented. I agree this is wrong. Do you want to suggest a doc patch? Does anyone know why _trivial_http_get uses its own user agent and HTTP version? Because it is a totally different client implementation with it's own bugs and limitations. You can force always using the full LWP client implementaion by importing $ua from LWP::Simple. Regards, Gisle
HTML-Parser-3.41
HTML-Parser-3.41 is available from CPAN. The major news is that HTML::Parser should now do the right thing with Unicode strings and that the compile time option to enable Unicode entities is gone. There is a new 'utf8_mode' that allow saner parsing of raw undecoded UTF-8. The Unicode support is only available if you use perl-5.8 or better. Other noteworthy recent changes: - title content parsed in literal mode - script and style skip quoted strings when looking for matching end tag - if no matching end tag is found for script, style, xmp title, textarea then generate one where the next tag starts. - will decode unterminated entities in 'dtext', i.e. foonbspbar become foo bar. Enjoy!
libwww-perl-5.802
libwww-perl-5.802 is available from CPAN. The changes since 5.801 are: The HTTP::Message object now have a decoded_content() method. This will return the content after any Content-Encodings and charsets has been decoded. Compress::Zlib is now a prerequisite module. HTTP::Request::Common: The POST() function created an invalid Content-Type header for file uploads with no parameters. Net::HTTP: Allow Transfer-Encoding with trailing whitespace. http://rt.cpan.org/Ticket/Display.html?id=3929 Net::HTTP: Don't allow empty content to be treated as a valid HTTP/0.9 response. http://rt.cpan.org/Ticket/Display.html?id=4581 http://rt.cpan.org/Ticket/Display.html?id=6883 File::Protocol::file: Fixup directory links in HTML generated for directories. Patch by Moshe Kaminsky [EMAIL PROTECTED]. Makefile.PL will try to discover misconfigured systems that can't talk to themselves and disable tests that depend on this. Makefile.PL will now default to 'n' when asking about whether to install the GET, HEAD, POST programs. There has been too many name clashes with these common names. Enjoy!
decoded_content
Gisle Aas [EMAIL PROTECTED] writes: The HTTP::Message object now have a decoded_content() method. This will return the content after any Content-Encodings and charsets has been decoded. The current $mess-decoded_content implementation is quite naïve in it's mapping of charsets. It need to either start using Björn's HTML::Encoding module or start doing similar sniffing to better guess the charset when the Content-Header does not provide any. I also plan to expose a $mess-charset method that would just return the guessed charset, i.e. something similar to encoding_from_http_message() provided by HTML::Encoding. Another problem is that I have no idea how well the charset names found in the HTTP/HTML maps to the encoding names that the perl Encode module supports. Anybody knows what the state here is? When this works the next step is to figure out the best way to do streamed decoding. This is needed for the HeadParser that LWP invokes. The main motivation for decoded_content is that HTML::Parser now works better if properly decoded Unicode can be provided to it, but it still fails here: $ lwp-request -d www.microsoft.com Parsing of undecoded UTF-8 will give garbage when decoding entities at lib/LWP/Protocol.pm line 114. Here decoded_content needs to sniff the BOM to be able to guess that they use UTF-8 so that a properly decoded string can be provided to HTML::HeadParser. The decoded_content also solve the frequent request of supporting compressed content. Just do something like this: $ua = LWP::UserAgent-new; $ua-default_header(Accept-Encoding = gzip, deflate); $res = $ua-get(http://www.example.com;); print $res-decoded_content(charset = none); Regards, Gisle
Re: user agents
Zed Lopez [EMAIL PROTECTED] writes: On 01 Dec 2004 01:35:13 -0800, Gisle Aas [EMAIL PROTECTED] wrote: Zed Lopez [EMAIL PROTECTED] writes: I'd like to suggest these differences be documented. I agree this is wrong. Do you want to suggest a doc patch? I'm working on the doc patch... would it be considered desirable to document that a user can get get() to drive HTTP::Request by setting $LWP::Simple::FULL_LWP to a true value? Or that one can use get_old() to drive HTTP::Request? Obviously, no one wants to add a lot of complexity to a ::Simple module, but right now the behavior includes: the user agent and HTTP version are subject to change if an HTTP proxy is in use or if the requested page does a redirect. And there's no way to code around that within this module's public interface. It is documented (barely) that the module export the variable '$ua'. A side effect of importing this variable is that this forces the full LWP::UserAgent implementation to be used, otherwise settings on the $ua object would have no effect. I want to declare this as the official interface to force this and not document either get_old or $FULL_LWP. Regards, Gisle
Re: HTML::Parser 3.40/3.41 and UTF8 on perl 5.8.0
Reed Russell - rreed [EMAIL PROTECTED] writes: The sv_catpvn_utf8_upgrade macro used in hparser.c in versions 3.40 and 3.41 of HTML::Parser doesn't seem to exist in Perl 5.8.0. Can the macro be replaced, so that the module is compatible with this version of Perl? Sure. Applied. I've simpified your patch to be: Index: hparser.c === RCS file: /cvsroot/libwww-perl/html-parser/hparser.c,v retrieving revision 2.117 diff -u -p -r2.117 hparser.c --- hparser.c 2 Dec 2004 11:14:59 - 2.117 +++ hparser.c 2 Dec 2004 11:50:59 - @@ -300,8 +300,10 @@ report_event(PSTATE* p_state, sv_catpvn(p_state-pend_text, beg, end - beg); } else { - SV *tmp = NULL; - sv_catpvn_utf8_upgrade(p_state-pend_text, beg, end - beg, tmp); + SV *tmp = newSVpvn(beg, end - beg); + sv_utf8_upgrade(tmp); + sv_catsv(p_state-pend_text, tmp); + SvREFCNT_dec(tmp); } #else sv_catpvn(p_state-pend_text, beg, end - beg); @@ -639,8 +641,10 @@ IGNORE_EVENT: #ifdef UNICODE_HTML_PARSER } else { - SV *tmp = NULL; - sv_catpvn_utf8_upgrade(p_state-skipped_text, beg, end - beg, tmp); + SV *tmp = newSVpvn(beg, end - beg); + sv_utf8_upgrade(tmp); + sv_catsv(p_state-pend_text, tmp); + SvREFCNT_dec(tmp); } #endif }
Re: HTTP::Response inconsistency
Harald Joerg [EMAIL PROTECTED] writes: HTTP::Response::clone doesn't clone the protocol either. This, however, can be fixed easily: Thanks. Applied this patch to HTTP::Message so that also Requests clone their protocol attribute. --- Response.pm.1.502004-12-02 21:36:42.43750 +0100 +++ Response.pm 2004-12-02 21:37:18.34375 +0100 @@ -47,4 +47,5 @@ my $self = shift; my $clone = bless $self-SUPER::clone, ref($self); +$clone-protocol($self-protocol); $clone-code($self-code); $clone-message($self-message); -- Cheers, haj
Re: HTTP::Response inconsistency
Harald Joerg [EMAIL PROTECTED] writes: As a fallback, HTTP::Response::parse could set the protocol to undef if it turns out to be a three-digit number, assigning this value to the code (after assigning to the message what was parsed as the code). This is my preferred fix. Just make HTTP::Response::parse deal with what as_string spits out. I would just make it look at the string before spliting it. If it starts with /\d/ split in 2 instead of 3. Maybe the best fallback would be to write some undefined value in HTTP::Response::as_string if the protocol is undefined: my $status_line = $code; my $proto = $self-protocol; - $status_line = $proto $status_line if $proto; + $status_line = $proto ? $proto $status_line + : UNKNOWN $status_line; But again, this might break existing code. I also find this quite ugly. I could submit patches for all the fallbacks and workarounds - That would be very much appreciated. Regards, Gisle
Re: Bug in HTML::Form label support
Dan Kubb [EMAIL PROTECTED] writes: label input type=radio name=r1 value=1One /label Is label in common use? What browsers support it? Regards, Gisle
Re: HTML::Parser 3.42: some tests fail on MSWin32
Bjoern Hoehrmann [EMAIL PROTECTED] writes: HTML::Parser 3.41/3.42 fails on some tests on MSWin32, see This should be fixed in 3.43 that I just uploaded. The SvUTF8 flag was not propagated correctly when replacing unterminated entities. Regards, Gisle
Re: suggestion for $ua-env_proxy method
bulk 88 [EMAIL PROTECTED] writes: Can the env_proxy method return the result of getting the proxy settings from the enviroment so that this will work? $EnvProxyResult = $ua-env_proxy; I would like it so I can have a proper Using proxy settings from enviroment. line. Or a Forced proxy settings from enviroment not found. Currently, it just returns false if it gets the proxy settings or not. I don't see a problem with this. It is more likely to happen if you are able to provide a patch. Especially if the patch also updates the documentation and the test suite appropriately. Regards, Gisle
Re: libwww@perl.org
Tony [EMAIL PROTECTED] writes: I've been trying to install the LWP bundle for several days. I saw that URI-1.34.tar.gz was unavailable. I had to go to http://cpan.n0i.net/modules/by-module/URI/ to download URI-1.35.tar.gz. Why does cpan go after the old version? Stale local index cache? The cpan:modules/02packages.details.txt.gz index points to URI-1.35 as it should. Regards, Gisle
Re: Can't use www::mechanize with an array form field
Tim [EMAIL PROTECTED] writes: I have a website written in PHP/MySQL. I'm using www::mechanize and www::mechanize::formfiller to test the site. I declare one of the form fields as an array in PHP like so: echo input type=\checkbox\ name=\cat[]\ value=\$cat_id\.VarPrepForDisplay($title).; which in turn creates the following HTML code that www::mechanize uses to test the code. input type=checkbox name=cat[] value=164 this makes the cat field an array. the problem is that when I try to use www::mechanize to submit values to this filed I get the following error: Illegal value '211' for field 'cat[]' at /path.pl line 89 Does anyone know how I can submit values to an array based form field? I don't know what it takes in WWW::Mechanize land, but if you have access to the HTML::Form object you can do this: $form-param(cat[], 211, 213); This will turn on the 211 and 213 check box and all the other cat[] checkboxes off. Regards, Gisle
Re: libwww-perl-5.802
Moshe Kaminsky [EMAIL PROTECTED] writes: * Gisle Aas [EMAIL PROTECTED] [01/12/04 12:02]: libwww-perl-5.802 is available from CPAN. The changes since 5.801 are: The HTTP::Message object now have a decoded_content() method. This will return the content after any Content-Encodings and charsets has been decoded. For some reason, the original content is killed in the response object when I use this method - the content() method returns an empty string after calling decoded_content. The reason appears to be passing $$content_ref to Encode::decode in line 220 of HTTP/Message.pm. I guess it's probably some problem with decode(), but in any case, replacing that line with my $cont = $$content_ref; $content_ref = \Encode::decode($charset, $cont, Encode::FB_CROAK()); Solved the problem. This is with HTTP::Message version 1.52, perl version 5.8.6, Encode version 2.08 on linux. Thanks for your report. There was a similar issue with memGunzip and the patch I applied for it will also fix this problem. Also, I would like to suggest adding a flag, which will cause the content() method to return the output of decoded_content(). This will allow scripts which ignored the charset to automatically do the right thing by simply setting this flag. I'm not too happy about this suggestion asis. One option is to introduce a '$mess-decode_content' method and then make LWP::UserAgent grow some option that makes it automatically call this for all responses it receives. The 'decode_content' would be like $resp-content(encode_utf8($res-decoded_content)); but would also fix up the Content-Encoding and Content-Type header. Regards, Gisle
Re: calling decoded_content on gzipped content destroys raw content
Andreas Beckmann [EMAIL PROTECTED] writes: I found the new decoded_content method destroying the raw content if Content-Encoding was gzip. This happens because: Compress::Zlib::memGunzip ... The contents of the buffer parameter are destroyed after calling this function. I fixed this the following way: HTTP/Message.pm: -$content_ref = \Compress::Zlib::memGunzip($$content_ref); +$content_ref = \Compress::Zlib::memGunzip(my $buf = $$content_ref); I didn't check the other decoding functions, so this could happen at other places, too. Encode::decode() also destroy its argument. I've now applied the patch below. Thanks for the decoded_content funktion - this makes using compression a lot easier :-) Perhaps an option to replace the current raw content could be added, this would also have to change the Content-Encoding and Content-Type/Charset headers. I can see that might be useful. The 'content' is supposed to be bytes so the result would have to be encoded UTF-8, while 'decoded_content' returns decoded UTF-8. I think it is better to have a 'decode_content' method (a verb) then for 'decoded_content' to suddenly have a side effect on the message when given an option. Regards, Gisle Index: lib/HTTP/Message.pm === RCS file: /cvsroot/libwww-perl/lwp5/lib/HTTP/Message.pm,v retrieving revision 1.54 retrieving revision 1.55 diff -u -p -r1.54 -r1.55 --- lib/HTTP/Message.pm 3 Dec 2004 08:35:41 - 1.54 +++ lib/HTTP/Message.pm 6 Dec 2004 13:27:20 - 1.55 @@ -1,10 +1,10 @@ package HTTP::Message; -# $Id: Message.pm,v 1.54 2004/12/03 08:35:41 gisle Exp $ +# $Id: Message.pm,v 1.55 2004/12/06 13:27:20 gisle Exp $ use strict; use vars qw($VERSION $AUTOLOAD); -$VERSION = sprintf(%d.%02d, q$Revision: 1.54 $ =~ /(\d+)\.(\d+)/); +$VERSION = sprintf(%d.%02d, q$Revision: 1.55 $ =~ /(\d+)\.(\d+)/); require HTTP::Headers; require Carp; @@ -161,6 +161,7 @@ sub decoded_content { my($self, %opt) = @_; my $content_ref; +my $content_ref_iscopy; eval { @@ -183,6 +184,12 @@ sub decoded_content next unless $ce || $ce eq identity; if ($ce eq gzip || $ce eq x-gzip) { require Compress::Zlib; + unless ($content_ref_iscopy) { + # memGunzip is documented to destroy its buffer argument + my $copy = $$content_ref; + $content_ref = \$copy; + $content_ref_iscopy++; + } $content_ref = \Compress::Zlib::memGunzip($$content_ref); die Can't gunzip content unless defined $$content_ref; } @@ -190,11 +197,13 @@ sub decoded_content require Compress::Bzip2; $content_ref = Compress::Bzip2::decompress($$content_ref); die Can't bunzip content unless defined $$content_ref; + $content_ref_iscopy++; } elsif ($ce eq deflate) { require Compress::Zlib; $content_ref = \Compress::Zlib::uncompress($$content_ref); die Can't inflate content unless defined $$content_ref; + $content_ref_iscopy++; } elsif ($ce eq compress || $ce eq x-compress) { die Can't uncompress content; @@ -202,10 +211,12 @@ sub decoded_content elsif ($ce eq base64) { # not really C-T-E, but should be harmless require MIME::Base64; $content_ref = \MIME::Base64::decode($$content_ref); + $content_ref_iscopy++; } elsif ($ce eq quoted-printable) { # not really C-T-E, but should be harmless require MIME::QuotedPrint; $content_ref = \MIME::QuotedPrint::decode($$content_ref); + $content_ref_iscopy++; } else { die Don't know how to decode Content-Encoding '$ce'; @@ -218,7 +229,16 @@ sub decoded_content $charset = lc($charset); if ($charset ne none) { require Encode; - $content_ref = \Encode::decode($charset, $$content_ref, Encode::FB_CROAK()); + if (do{my $v = $Encode::VERSION; $v =~ s/_//g; $v} 2.0901 + !$content_ref_iscopy) + { + # LEAVE_SRC did not work before Encode-2.0901 + my $copy = $$content_ref; + $content_ref = \$copy; + $content_ref_iscopy++; + } + $content_ref = \Encode::decode($charset, $$content_ref, + Encode::FB_CROAK() | Encode::LEAVE_SRC()); } } };
Re: [patch] Allow a directory as lwp-download's 2nd argument
Radoslaw Zielinski [EMAIL PROTECTED] writes: The attached patch allows specifying a directory as lwp-download's second argument. Also makes 0 valid destination file name. Thanks. Applied. Regards, Gisle
Re: [PATCH] HTTP::Daemon defaults
Kees Cook [EMAIL PROTECTED] writes: I'd like to see this patch added so that HTTP::Daemon::SSL can more cleanly overload the url function without having to totally reimplement it. Thanks. Applied. But I made the defaults 80 and http :) Also, could HTTP::Daemon::SSL be made part of the libwww bundle? I don't have a problem with that if its author wants the same. Regards, Gisle --- libwww-perl-5.802/lib/HTTP/Daemon.pm 2004-04-09 13:21:43.0 -0700 +++ libwww-perl-5.802-kees/lib/HTTP/Daemon.pm 2004-12-10 10:13:30.0 -0800 @@ -37,10 +37,22 @@ } +sub _default_port { +443; +} + + +sub _default_scheme { +https; +} + + +# Implemented with calls to _default_port and _default_scheme so that +# HTTP::Daemon::SSL can overload them and still use this function. sub url { my $self = shift; -my $url = http://;; +my $url = $self-_default_scheme().://; my $addr = $self-sockaddr; if (!$addr || $addr eq INADDR_ANY) { require Sys::Hostname; @@ -50,7 +62,7 @@ $url .= gethostbyaddr($addr, AF_INET) || inet_ntoa($addr); } my $port = $self-sockport; -$url .= :$port if $port != 80; +$url .= :$port if $port != $self-_default_port(); $url .= /; $url; }
Re: HTTP::Response::base fails if the response has no request
Harald Joerg [EMAIL PROTECTED] writes: Once more I'd like to suggest a patch for HTTP::Response. When working with my homegrown responses I found that the base method fails fatally if the response doesn't have a request inside: Can't call method uri on an undefined value at /usr/lib/perl5/site_perl/5.8.5/HTTP/Response.pm line 78. I can work around this by defining a fake request for my responses, but I'd prefer if HTTP::Response::base would simply return undef if it finds neither a base-defining header nor an embedded request. Seems fine. I tweaked your patch into this one before I applied it. Thanks! Regards, Gisle Index: lib/HTTP/Response.pm === RCS file: /cvsroot/libwww-perl/lwp5/lib/HTTP/Response.pm,v retrieving revision 1.50 retrieving revision 1.51 diff -u -p -r1.50 -r1.51 --- lib/HTTP/Response.pm30 Nov 2004 12:00:22 - 1.50 +++ lib/HTTP/Response.pm11 Dec 2004 14:30:00 - 1.51 @@ -75,9 +75,20 @@ sub base my $base = $self-header('Content-Base') || # used to be HTTP/1.1 $self-header('Content-Location') || # HTTP/1.1 $self-header('Base');# HTTP/1.0 -return $HTTP::URI_CLASS-new_abs($base, $self-request-uri); -# So yes, if $base is undef, the return value is effectively -# just a copy of $self-request-uri. +if ($base $base =~ /^$URI::scheme_re:/o) { + # already absolute + return $HTTP::URI_CLASS-new($base); +} + +my $req = $self-request; +if ($req) { +# if $base is undef here, the return value is effectively +# just a copy of $self-request-uri. +return $HTTP::URI_CLASS-new_abs($base, $req-uri); +} + +# can't find an absolute base +return undef; } @@ -366,6 +377,9 @@ received some redirect responses first. =back +If neither of these sources provide an absolute URI, undef is +returned. + When the LWP protocol modules produce the HTTP::Response object, then any base URI embedded in the document (step 1) will already have initialized the Content-Base: header. This means that this method
Re: HTTP::Response inconsistency
Harald Joerg [EMAIL PROTECTED] writes: Gisle Aas writes: Harald Joerg [EMAIL PROTECTED] writes: As a fallback, HTTP::Response::parse could set the protocol to undef if it turns out to be a three-digit number, assigning this value to the code (after assigning to the message what was parsed as the code). This is my preferred fix. Just make HTTP::Response::parse deal with what as_string spits out. I would just make it look at the string before spliting it. If it starts with /\d/ split in 2 instead of 3. Patch is attached. Thanks. Applied. Regards, Gisle --- Response.pm.1.502004-12-02 21:36:42.43750 +0100 +++ Response.pm 2004-12-03 22:10:27.421875000 +0100 @@ -35,5 +35,11 @@ my $self = $class-SUPER::parse($str); -my($protocol, $code, $message) = split(' ', $status_line, 3); +my($protocol, $code, $message); +if ($status_line =~ /^\d{3} /) { + # Looks like a response created by HTTP::Response-new + ($code, $message) = split(' ', $status_line, 2); +} else { + ($protocol, $code, $message) = split(' ', $status_line, 3); +} $self-protocol($protocol) if $protocol; $self-code($code) if defined($code);
Re: How can I PUT a large file?
Rodrigo Ruiz [EMAIL PROTECTED] writes: I need to perform a PUT operation and send a very large file (several hundred MBytes). I have been using the following code to do this: ... my $header = HTTP::Headers-new; $header-content_type('application/octet-stream'); $header-content_length($fileSize); $header-authorization_basic($usr, $pwd); my $readFunc = sub { read(FH, my $buf, 65536); return $buf; }; my $req = HTTP::Request-new(PUT, $url, $header, $readFunc); ... Seems sane. But after updating to the 5.802 version of LWP this code has stopped working. When I execute my script, it prints a warning telling me that the Content-Length header has been fixed, and the file in the destination server is corrupted. Looking at the code of the library, I have found these lines: # Set (or override) Content-Length header my $clen = $request_headers-header('Content-Length'); if (defined($$content_ref) length($$content_ref)) { $has_content++; if (!defined($clen) || $clen ne length($$content_ref)) { if (defined $clen) { warn Content-Length header value was wrong, fixed; hlist_remove([EMAIL PROTECTED], 'Content-Length'); } push(@h, 'Content-Length' = length($$content_ref)); } } elsif ($clen) { warn Content-Length set when there is not content, fixed; hlist_remove([EMAIL PROTECTED], 'Content-Length'); } I think these lines prevent the use of a function as the content reference. Is this a bug, or the support for function references has been removed? No this is supposed to work. This code block should not be entered as there is a test for code reference content just above it. Can you figure out why the: if (ref($content_ref) eq 'CODE') { test fails? What is $content_ref in this case? Regards, Gisle
Re: How can I PUT a large file?
Gisle Aas [EMAIL PROTECTED] writes: No this is supposed to work. I've now verified that using request code content like this, does indeed work for me when posting to my own server. Unless you can debug this problem directly with your app, please try to create a complete (short) example program the demonstrates this problem and send it to this list. Regards, Gisle
Re: Bug in HTML::Form label support
Dan Kubb [EMAIL PROTECTED] writes: Hi Gisle, Are there other form elements than input that might take labels? Yes, all the normal form elements can take labels. I'm just not sure how you would use them without adding to or changing the interface in HTML::Form. For input tags that are radio or checkboxes its easy.. just set the value_name attribute with the label name and the existing interface will use it. I can do that for other elements, but some of them inherit a noop value_names() method -- I didn't want to change this method's behaviour because it says in the docs that the values it returns correspond 1 to 1 with the return values from possible_values(). Still it would be nice to set the values of a text input value like this: $form-value('First Name'); Rather than: $form-value('contact.name.first'); I wasn't going to propose any interface changes in my patch without checking with the you first. Seems like it might be a good idea to introduce a 'label' attribute for inputs, but perhaps that creates the wrong expectation for radio and checkbox entries. Got to ponder that some more. Indentation is not consistent with the rest of the code. What's your indenting style for patches? I'm a two-space intender myself. The patch you received had tabs inserted manually just as I was finishing up. I tried to find a pattern in HTML::Form, but the style wasn't consistent enough for me to pick one up -- I figured there must be a lot of different maintainers ;) It seems consistent to me. Perhaps you have tweaked your tab-stop to not be the standard 8. + 1 while $attr-{value_name} =~ s/\s\z//; why not '$attr-{value_name} =~ s/\s+\z//;' Just finished a project with some large file processing.. the 1 while version is faster (strangely enough), there were some benchmarks on Perlmonks I believe. You learn something new every day. I guess the + is too much for RE optimizer here then. course it makes no difference with such small strings, I put it in more out of habit than anything. + $attr-{value_name} =~ s/\s+/ /; There can't really be multispace anywhere since get_phrase will trim the text. This would always be a noop. You're right. I eliminated the need for regexes in a new patch which I've attached to this email. I think I've got the formatting right this time. The new patch has now been applied. Thanks. Regards, Gisle
Re: How can I PUT a large file?
Rodrigo Ruiz [EMAIL PROTECTED] writes: The error appeared on 5.8 version of LWP. My current version is 5.802, and it has the error fixed. Is this the first version where the bug is fixed? Is it enough to do an == comparison or should I use something like: my $ref = ($LWP::VERSION = 5.8 $LWP::VERSION 5.802) ? \$readFunc : $readFunc; This bug was only present in one version; libwww-perl-5.800. If you really still need this workaround I would make it: my $readFunc = sub { }; $readFunc = \$readFunc if $LWP::VERSION eq 5.800; # workaround buggy LWP version Regards, Gisle
Re: Libhtml parser 3.43 ??
The Saltydog [EMAIL PROTECTED] writes: I am experiencing a strange behaviour on linhtml-parser-perl v.3.43 The strange behaviour is ONLY on this web page: http://communicator.virgilio.it HTML::Parser got confused about how quoted strings nest when parsing one of the script tags. This made it assume large parts of the document to be the script element. This buggy behaviour was introduced in v3.40 (v3.39_91). The following patch fixes this problem and will be present in v3.44 when ready. I expect that to happen soonish. Regards, Gisle Index: hparser.c === RCS file: /cvsroot/libwww-perl/html-parser/hparser.c,v retrieving revision 2.118 retrieving revision 2.119 diff -u -p -u -r2.118 -r2.119 --- hparser.c 2 Dec 2004 11:52:32 - 2.118 +++ hparser.c 28 Dec 2004 13:47:44 - 2.119 @@ -1,4 +1,4 @@ -/* $Id: hparser.c,v 2.118 2004/12/02 11:52:32 gisle Exp $ +/* $Id: hparser.c,v 2.119 2004/12/28 13:47:44 gisle Exp $ * * Copyright 1999-2004, Gisle Aas * Copyright 1999-2000, Michael A. Chase @@ -1522,7 +1522,7 @@ parse_buf(pTHX_ PSTATE* p_state, char *b inside_quote = 0; else if (*s == '\r' || *s == '\n') inside_quote = 0; - else if (*s == '' || *s == '\'') + else if (!inside_quote (*s == '' || *s == '\'')) inside_quote = *s; } }
Re: Downloading a page compressed
Andy Lester [EMAIL PROTECTED] writes: On Dec 29, 2004, at 6:02 PM, Bjoern Hoehrmann wrote: Note that LWP does not automatically remove the gzip compression in this case WWW::Mechanize does, however. And LWP does it if you ask for the $response-decoded_content instead of $response-content. The decoded_content method was introduced in LWP-5.802. Regards, Gisle
Re: Data::Dump is missing ? t/local/httpsub.t fails
Gabor Szabo [EMAIL PROTECTED] writes: I just noticed this test file only exists in the CVS but is not distributed. Still I guess it should be fixed somehow (probably by skipping the test if the modul is not there). It's an unfinished test that I lost interest in completing :-( If completed, then the Data::Dump reference should clearly go. Regards, Gisle
Re: Statistics in mech?
Peter Stevens [EMAIL PROTECTED] writes: I am using mech to scrape data from various websites. I wanted to collect data about the bytes sent and received by my scraper (I need this for sizing purposes). I looked though Mech and LWP, but did not see any methods which give me that information. Is there a way to do this? Not directly, but you can replace the protocol handler with your own that counts bytes passed by. This is an example that will count the bytes sent over http: #!/usr/bin/perl -w use LWP::UserAgent; use LWP::Protocol; LWP::Protocol::implementor('http', 'MyHTTP'); my $bytes_in = 0; my $bytes_out = 0; my $ua = LWP::UserAgent-new(keep_alive = 1); for (1..3) { my $res = $ua-get(http://www.example.com;); print $_: , $res-status_line, \n; } print received $bytes_in bytes, send $bytes_out bytes\n; # Overridden protocol handler that counts the bytes transfered package MyHTTP; use base 'LWP::Protocol::http'; package MyHTTP::Socket; use base 'LWP::Protocol::http::Socket'; sub sysread { my $self = shift; my $n = $self-SUPER::sysread(@_); $bytes_in += $n if defined($n) $n 0; return $n; } sub syswrite { my $self = shift; my $n = $self-SUPER::syswrite(@_); $bytes_out += $n if defined($n) $n 0; return $n; } __END__ Regards, Gisle
Re: Avoiding Alarm Clocks While Spidering
Justin Tang [EMAIL PROTECTED] writes: I am currently running a spider program derieved from an open source Search Engine program call SWISH-E, the spider.pl file that I am using uses the LWP::RobotUA class. Now, the way I have it set up is that I have a program that would prep the spider program with a list of sites to spider, the the program would call the spider using the backtick(``). From there on, the spider becomes a zombie program, outputing results to a local textfile for me to review later. The problem I'm running into is that it seems like the LWP class has a timeout function implemented that would sleep the process after a period of time with a message saying Alarm clock. What is happening is that, since my process is a zombie, when it is put to sleep the system kills the process. Is there anyway around this situation? Is there a command or flag in LWP::RobotUA that I can set so it would not be put to sleep. There is the 'use_sleep' attribute that you might set to a FALSE value. Regards, Gisle
Re: Internal Server Error when GETing with WWW::Mechanize?
James Turnbull [EMAIL PROTECTED] writes: The error I get is... Error GETing http://www.parcelforce.com:80/portal/pw/track: Internal Server Error at track.pl line 5 The server is confused by something in the request that LWP sends. This is a trace I get with 'lwp-request http://www.parcelforce.com:80/portal/pw/track': GET /portal/pw/track HTTP/1.1 TE: deflate,gzip;q=0.3 Connection: TE, close Host: www.parcelforce.com:80 User-Agent: lwp-request/2.06 HTTP/1.1 500 Internal Server Error Content-language: en-US Content-length: 0 Content-type: text/html; charset=ISO-8859-1 Date: Tue, 18 Jan 2005 12:37:25 GMT Server: Netscape-Enterprise/6.0 Set-Cookie: FGNCLIID=42b0olsqf5khpzen020dycwtbh27;expires=Thu, 18 Jan 2007 12:37:26 GMT;path=/ Connection: Close --Gisle
Re: Internal Server Error when GETing with WWW::Mechanize?
Gisle Aas [EMAIL PROTECTED] writes: James Turnbull [EMAIL PROTECTED] writes: The error I get is... Error GETing http://www.parcelforce.com:80/portal/pw/track: Internal Server Error at track.pl line 5 The server is confused by something in the request that LWP sends. This is a buggy server that crashes unless the request sent have an Accept header. It does not appear to matter what you put in it, as demonstrated by running: $ lwp-request -H Accept:foo http://www.parcelforce.com:80/portal/pw/track In your app you can work around this problem by telling LWP to always send an Accept header using code like: $argent-default_header(Accept = text/*); (The default_header method was introduced in LWP-5.800). Regards, Gisle
Re: [PMX:VIRUS] HTML::Parser and entities
Steve Sapovits [EMAIL PROTECTED] writes: Is there a way to get HTML::Parser to leave entities in text alone? Just use 'text' argspec and you get the text exactly as it is. There is the attr_encode() method, but that only appears to affect attributes. Basically I have code that wants to selectively remove some tags but leave others and entities intact. The hstrip example does exactly this. http://search.cpan.org/src/GAAS/HTML-Parser-3.45/eg/hstrip Regards, Gisle
Re: URI module problems
Please provide the output of these commands: perl -MStorable\ 99 and in the unpacked URI directory run: perl Makefile.PL make perl -Mblib t/storable.t t/storable..FAILED tests 1-3 Failed 3/3 tests, 0.00% okay t/urn-isbn..skipped: Needs the Business::ISBN module installed t/urn-oid...ok Failed Test Status Wstat Total Fail Failed List of Failed t/storable.t 33 100.00% 1-3 1 test and 2 subtests skipped. Failed 1/31 test scripts, 96.77% okay. 3/466 subtests failed, 99.36% okay. *** Error code 29 make: Fatal error: Command failed for target `test_dynamic' The Storable module path was in the PERL5LIB environmental variable when I tried to compile URI. Is the path.al file dependent on URI finding and using the Storable files? I have no idea what path.al is here. You can probably also just ignore this test error and then just run 'make install' for URI to get going. The failure just means that something prevents URI objects from being stored and retrieved with Storable. This might not matter if the code you run does not do this. Regards, Gisle
Re: statu_line
The Saltydog [EMAIL PROTECTED] writes: This is my simple script: == require LWP::UserAgent; my $ua = LWP::UserAgent-new; $ua-timeout(10); $ua-env_proxy; my $response = $ua-get('http://search.cpan.org/'); if ($response-is_success) { print $response-content; # or whatever } else { die $response-status_line; } === If I type a wrong url instead of www.cpan.org, the script doesn't return a status_line... This is the program output: HTTP::Response=HASH(0x845d084)-status_line Where am I wrong? I bet your script has quotes around the $response-status_line expression. The program above does not produce the output you claim. Regards, Gisle
Re: HTML::Parser: how can I reset report_tags to report all tags?
Norbert Kiesel [EMAIL PROTECTED] writes: I tried to use -ignore_tags(()) and -ignore_tags(qw(none)), but it seems that after calling -report_tags() once it alsways uses a positive tag filter. Calling -report_tags() without any arguments should reset the filter. Regards, Gisle
Re: parsing bug in HTTP::Message::parse()
Brian Hirt [EMAIL PROTECTED] writes: Any news on this? It's a pretty major bug. I don't see anything wrong when running your test program. What version of LWP are you using? Regards, Gisle
Re: printing the redirections responses
Octavian Rasnita [EMAIL PROTECTED] writes: my $response = $ua-request($request); print $response-as_string(); [...] The response is the final page, even though there is a redirection until this page is returned. Is it possible to get and print that redirect HTTP header? Just use $ua-simple_request() instead of $ua-request() to dispatch the request. --Gisle