Re: 2 Gb limitation

2002-01-11 Thread Ian Abbott

On 10 Jan 2002 at 17:09, Matt Butt wrote:

 I've just tried to download a 3Gb+ file (over a network using HTTP) with
 WGet and it died at exactly 2Gb.  Can this limitation be removed?

In principle, changes could be made to allow wget to be 
configured
for large file support, by using the appropriate data 
types (i.e.
'off_t' instead of 'long').

The logging code would be more complicated as there is 
no portable
way to handle the data type in a printf-style function, 
so these
would have to be converted to strings by a bespoke 
routine and the
converted strings passed to the printf-style function. 
This would
also slow down the operation of wget a little bit.

A version of wget configured for large file support 
would also be
slower in general than a version not configured for 
large file
support - at least on a 32-bit machine.

Large file support should probably be added to the TODO 
list at
least. Quite a few people use wget to download .iso 
images of
CD-ROMs at the moment; in the future, those same people 
are
likely to want to use wget to download DVD-ROM images!




Re: Using -pk, getting wrong behavior for frameset pages...Suggestions?

2002-01-11 Thread Picot Chappell

Thanks for your response.  I tried the same command, using your URL, and it
worked fine.  So I took a look at the site I was retrieving for the failed
test.

It's a ssl site (didn't think about it before) and I noticed 2 things.  The
Frame source pages were not downloaded (they were for www.mev.co.uk) and the
links were converted to full URLs.
ie. FRAME src=menulayer.cgi. became FRAME
src=https://www.someframed.page/menulayer.cgi; ...

So the content was still reachable, but not really local (this is the original
problem).  I tried it without the --convert-links, and the frame source
remained defined as menulayer.cgi  but menulayer.cgi was not downloaded.

Do you think this might be an issue with framesets and ssl sites?  or an issue
with framesets and cgi source files?

Thanks again, and will try --no-http-keep-alive at some point.
Picot

Ian Abbott wrote:

 On 10 Jan 2002 at 12:39, Picot Chappell wrote:

  Has anyone solved this issue?  I am downloading a single html page,
  without recursion, and not getting the 'one hop further' that should
  occur for framesets.
 
  I'm using wget 1.8.1, on Solaris 8.  According to the documentation,
  options -p and -k should work to download everything, and from previous
  postings I see mention that -p should go at least one more hop (also
  confirmed in the News items on GNU Wget news).

 Well it seems to work as advertised on my employer's
 web-site
 (www.mev.co.uk), at least on my machine. Can you
 provide an example
 which fails on your machine?

  Below is the gist of my call:
 
./wget --ignore-length --html-extension --tries=3 --timeout=60
--cookies=off --page-requisites --convert-links -- www.someframed.page

 That looks okay. I substituted in www.mev.co.uk and got
 the index
 frameset page, two frames and the images on those
 frames as
 expected.

 The '--ignore-length' switch slows things down rather a
 lot though,
 due to keep-alive connections. Adding '--no-http-keep-
 alive' to the
 above will speed it up.




wget does not parse .netrc properly

2002-01-11 Thread Alexey Aphanasyev

Hello everyone,

I'm using wget compiled from the latest CVS sources (GNU Wget
1.8.1+cvs). I use it to mirror several ftp sites. I keep ftp accounts in
.netrc file which looks like this:

quote file=.netrc
# My ftp accounts

machine host1
login user1
password pwd1

machine host2
login user2
password pwd2
macdef init
#   quote site dirstyle
prompt
binary
cd database

machine host3
login user3
password pwd3
macdef init
prompt
binary
cd download
/quote

Problem is that when I try to get data from machine host3, wget tries to
log in as anonymous. It looks like it doesn't find host3 in .netrc while
it works fine with host1 and host2.

If I put machine host3 at first position in .netrc, then it works. But
in this case, it doesn't work with neither host1 nor host2.

I guess macdef directive confuses wget.

The trouble is that Wget 1.6.1 used to work with this .netrc.

Best regards,
Alexis



Re: Using -pk, getting wrong behavior for frameset pages...Suggestions?

2002-01-11 Thread Thomas Reinke

 
 Do you think this might be an issue with framesets and ssl sites?  or an issue
 with framesets and cgi source files?

This is not a problem with frames - it IS a problem with SSL.
wget, while it appears to have SSL support, didn't quite
get it right. The internal schems being used don't treat
https: as an http protocol, and thus don't recurse down
into sub pages. (wget specifically avoids recursing into
unknown protocols,and https was treated as one of these).

A previous post on a patch for this exists.  The patch
is as follows:

--- src/recur.c Wed Dec 19 09:27:29 2001
+++ ../wget-1.8.1.esoft/src/recur.c Sat Dec 29 16:17:40 2001
@@ -437,7 +437,7 @@
  the list.  */
 
   /* 1. Schemes other than HTTP are normally not recursed into. */
-  if (u-scheme != SCHEME_HTTP
+  if (u-scheme != SCHEME_HTTP  u-scheme!= SCHEME_HTTPS
!(u-scheme == SCHEME_FTP  opt.follow_ftp))
 {
   DEBUGP ((Not following non-HTTP schemes.\n));
@@ -446,7 +446,7 @@
 
   /* 2. If it is an absolute link and they are not followed, throw it
  out.  */
-  if (u-scheme == SCHEME_HTTP)
+  if (u-scheme == SCHEME_HTTP || u-scheme == SCHEME_HTTPS)
 if (opt.relative_only  !upos-link_relative_p)
   {
DEBUGP ((It doesn't really look like a relative link.\n));
@@ -534,7 +534,7 @@
   }
 
   /* 8. */
-  if (opt.use_robots  u-scheme == SCHEME_HTTP)
+  if (opt.use_robots  (u-scheme == SCHEME_HTTP || u-scheme ==
SCHEME_HTTPS))
 {
   struct robot_specs *specs = res_get_specs (u-host, u-port);
   if (!specs)


OR, alternatively, simply edit recur.c according to the following
instructions:

  Line 440: change to
 if (u-scheme != SCHEME_HTTP  u-scheme!= SCHEME_HTTPS 

  Line 449: change to
if (u-scheme == SCHEME_HTTP || u-scheme == SCHEME_HTTPS) 

  Line 537: change to
if (opt.use_robots  (u-scheme == SCHEME_HTTP || u-scheme ==
SCHEME_HTTPS))
 
and that should work better.

Thomas

 
 Thanks again, and will try --no-http-keep-alive at some point.
 Picot
 
 Ian Abbott wrote:
 
  On 10 Jan 2002 at 12:39, Picot Chappell wrote:
 
   Has anyone solved this issue?  I am downloading a single html page,
   without recursion, and not getting the 'one hop further' that should
   occur for framesets.
  
   I'm using wget 1.8.1, on Solaris 8.  According to the documentation,
   options -p and -k should work to download everything, and from previous
   postings I see mention that -p should go at least one more hop (also
   confirmed in the News items on GNU Wget news).
 
  Well it seems to work as advertised on my employer's
  web-site
  (www.mev.co.uk), at least on my machine. Can you
  provide an example
  which fails on your machine?
 
   Below is the gist of my call:
  
 ./wget --ignore-length --html-extension --tries=3 --timeout=60
 --cookies=off --page-requisites --convert-links -- www.someframed.page
 
  That looks okay. I substituted in www.mev.co.uk and got
  the index
  frameset page, two frames and the images on those
  frames as
  expected.
 
  The '--ignore-length' switch slows things down rather a
  lot though,
  due to keep-alive connections. Adding '--no-http-keep-
  alive' to the
  above will speed it up.

-- 

E-Soft Inc. http://www.e-softinc.com
Publishers of SecuritySpace http://www.securityspace.com
Tel: 1-905-331-2260  Fax: 1-905-331-2504   
Tollfree in North America: 1-800-799-4831



Re: Using -pk, getting wrong behavior for frameset pages...Suggestions?

2002-01-11 Thread Ian Abbott

On 11 Jan 2002 at 10:51, Picot Chappell wrote:

 Thanks for your response.  I tried the same command, using your URL, and it
 worked fine.  So I took a look at the site I was retrieving for the failed
 test.
 
 It's a ssl site (didn't think about it before) and I noticed 2 things.  The
 Frame source pages were not downloaded (they were for www.mev.co.uk) and the
 links were converted to full URLs.
 ie. FRAME src=menulayer.cgi. became FRAME
 src=https://www.someframed.page/menulayer.cgi; ...
 
 So the content was still reachable, but not really local (this is the original
 problem).  I tried it without the --convert-links, and the frame source
 remained defined as menulayer.cgi  but menulayer.cgi was not downloaded.
 
 Do you think this might be an issue with framesets and ssl sites?  or an issue
 with framesets and cgi source files?

Do you have SSL support compiled in?

Also it is possible that the .cgi script on the server is checking
HTTP request headers and cookies, doesn't like what it sees and is
returning an error. It is sometimes useful to lie to the server 
about the HTTP user agent using the -U option, e.g.:

-U Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0)

or include something similar in the wgetrc file:

useragent = Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0)

Some log entries would be useful, particularly with the -d option.
You can mask any sensitive bits of the log if you want.




-H suggestion

2002-01-11 Thread Fred Holmes

WGET suggestion

The -H switch/option sets host-spanning.  Please provide a way to specify a 
different limit on recursion levels for files retrieved from foreign hosts.

-r -l0 -H2

for example would allow unlimited recursion levels on the target host, but 
only 2 [addtional] levels when a file is being retrieved from a foreign host.

Second suggestion:

The -i switch provides for a file listing the URLs to be downloaded.

Please provide for a list file for URLs to be avoided when -H is enabled.

Thanks for listening.

And thanks for a marvelous product.

Fred Holmes  [EMAIL PROTECTED]




Suggestion on job size

2002-01-11 Thread Fred Holmes

It would be nice to have some way to limit the total size of any job, and 
have it exit gracefully upon reaching that size, by completing the -k -K 
process upon termination, so that what one has downloaded is useful.  A 
switch that would set the total size of all downloads --total-size=600MB 
would terminate the run when the total bytes downloaded reached 600 MB, and 
process the -k -K.  What one had already downloaded would then be properly 
linked for viewing.

Probably more difficult would be a way of terminating the run manually 
(Ctrl-break??), but then being able to run the -k -K process on the 
already-downloaded files.

Fred Holmes




Re: Suggestion on job size

2002-01-11 Thread Jens Rösner

Hi Fred!

First, I think this would rather belong in the normal wget list, 
as I cannot see a bug here.
Sorry to the bug tracers, I am posting to the normal wget List and
cc-ing Fred, 
hope that is ok.

To your first request: -Q (Quota) should do precisely what you want.
I used it with -k and it worked very well.
Or am I missing your point here?

Your second wish is AFAIK not possible now.
Maybe in the future wget could write the record 
of downloaded files in the appropriate directory.
After exiting wget, this file could then be used 
to process all the files mentioned in it.
Just an idea, I would normally not think that 
this option is an often requested one.
HOWEVER: 
-K works (when I understand it correctly) on the fly, as it decides on
the run, 
if the server file is newer, if a previously converted file exists and
what to do.
So, only -k would work after the download, right?

CU
Jens

http://www.JensRoesner.de/wgetgui/

 It would be nice to have some way to limit the total size of any job, and
 have it exit gracefully upon reaching that size, by completing the -k -K
 process upon termination, so that what one has downloaded is useful.  A
 switch that would set the total size of all downloads --total-size=600MB
 would terminate the run when the total bytes downloaded reached 600 MB, and
 process the -k -K.  What one had already downloaded would then be properly
 linked for viewing.
 
 Probably more difficult would be a way of terminating the run manually
 (Ctrl-break??), but then being able to run the -k -K process on the
 already-downloaded files.
 
 Fred Holmes