--base does not consider references to root directory
Consider this example, which happens to be how I realised this problem: wget http://www.mxpx.com/ -r --base=. Here, I want the entire site to be downloaded with each link pointing to the local file. This works for some links, but it does not take references to the root directory into account, such as this: a href=/index.phpHome/a Here, wget just ignores the --base parameter and leaves the link as /index.php. I realise that this may seem like a sticky situation, but consider this solution: Let's say that I have a photo album on my personal homepage with the following directory scheme: / /photos/ /photos/hawaii /photos/concerts In /photos/concerts/index.html, I have a link to /index.html. When wget parses the html, it could then become: ../../index.html. All we need to know is how many directories deep we are. Would this be an acceptable solution? If so, I'd be glad to write a patch.
Re: --base does not consider references to root directory
So you would suggest handling in the way that when I use wget --base=/some/serverdir http://server/serverdir/ /.* will be interpreted as /some/.* so if you have a link like /serverdir/ it would go back to /some/serverdir, right? I guess this would be ok. Just one question if there is a Link back to /serverdir/ and base is something like /my/dir/ shouldn't this also be fetched from inside /my/dir/ and not /my/serverdir/? Greetings Matthias Josh Williams wrote: I realise that this may seem like a sticky situation, but consider this solution: Let's say that I have a photo album on my personal homepage with the following directory scheme: / /photos/ /photos/hawaii /photos/concerts In /photos/concerts/index.html, I have a link to /index.html. When wget parses the html, it could then become: ../../index.html. All we need to know is how many directories deep we are. Would this be an acceptable solution? If so, I'd be glad to write a patch.
Re: --base does not consider references to root directory
On 7/14/07, Matthias Vill [EMAIL PROTECTED] wrote: So you would suggest handling in the way that when I use wget --base=/some/serverdir http://server/serverdir/ /.* will be interpreted as /some/.* so if you have a link like /serverdir/ it would go back to /some/serverdir, right? Correct. I guess this would be ok. Just one question if there is a Link back to /serverdir/ and base is something like /my/dir/ shouldn't this also be fetched from inside /my/dir/ and not /my/serverdir/? Take a look at the directory structure: /my/dir /my/dir/www.foo.bar /my/dir/www.foo.bar/serverdir Suppose we have a link in /my/dir/www.foo.bar/serverdir like this: a href=/jobs.phpJobs/a This link (if called locally) would try to fetch a file on the root directory of the operating system, not the website. It would probably get a 403 or a 404 error. What we would want it to look like is this: a href=../jobs.phpJobs/a This method will work no matter what the --base parameter is.
Re: --base does not consider references to root directory
I think I got your point: All in all this is still a matter of comparing the first against the current url and counting the common dirs from the left side. Then you compare that number(a) to the depth of the first url(b) and add b-a ../ so you get to the right position inside your base. By that way if you call wget -r --base=/somedir /server/otherdir/ a later reference to /server/otherdir/ is correctly found as duplicate of the first one. I first thought of a different solution... like appending initial-depth-times .. to base. I admit this is silly. Now i think this could result in different problems like what schould happen with wget -r --base=/home/matthias/tmp http://server/with/a/complicated/structure/and/to/many/dirs/a.php; If you now have a link to /index.html you would try to access some file above / or am I wrong? Greeting Matthias
Re: --base does not consider references to root directory
On 7/14/07, Matthias Vill [EMAIL PROTECTED] wrote: I think I got your point: Now i think this could result in different problems like what schould happen with wget -r --base=/home/matthias/tmp http://server/with/a/complicated/structure/and/to/many/dirs/a.php; If you now have a link to /index.html you would try to access some file above / or am I wrong? In the case of http://server/with/a/complicated/structure/and/to/many/dirs/a.php, a link to /index.php would look like this: a href=../../../../../../../../index.phpHome/a (Assuming I counted it correctly.) It's just a matter of knowing how many directories deep we are so we know how many times to concatenate the ../
Re: Two wget patches: min-size/max-size and nc options
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Christian Roche wrote: Hi there, Hi! please find attached two small patches that could be considered for wget (against revision 2276). patch-utils changes the file renaming mechanism when the -nc option is in effect. Instead of trying to rename a file to file.1, file.2 etc, it tries prefix-1.suffix, prefix-2.suffix etc, thus preserving the filename extension if any. This seems reasonable. This is necessary to avoid a bug otherwise when the -A option is used: renamed files are rejected because they don't match the required suffix, although they should really be kept. Regardless of whether this particular approach is taken, this needs to be addressed. patch-http provides two new options, --min-size (-s) and --max-size (-M), although the shortcuts could obviously be changed. Non-HTML files that don't fit these constraints (expressed in kB) will simply not be retrieved. This relies on the Content-Length HTTP header and will not work for FTP. This is quite useful when retrieving jpeg images from a site to avoid thumbnails for instance, as explained in the related documentation paragraph. This seems reasonable as well. We should probably allow for it to be expressed in a variety of other units, though (bytes, megabytes). Also, I'm not keen on spending any of our few remaining small options on this. Thanks for these! - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGmS6I7M8hyUobTrERCFIgAJ9GGrlwuUKbyJtfEcM9AedvacFhKgCfRJsE iejCLNP6afhqchhrjz3AFz8= =jOZg -END PGP SIGNATURE-
Re: --base does not consider references to root directory
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Josh Williams wrote: Consider this example, which happens to be how I realised this problem: wget http://www.mxpx.com/ -r --base=. Here, I want the entire site to be downloaded with each link pointing to the local file. This works for some links, but it does not take references to the root directory into account, such as this: a href=/index.phpHome/a Here, wget just ignores the --base parameter and leaves the link as /index.php. I realise that this may seem like a sticky situation, but consider this solution: Let's say that I have a photo album on my personal homepage with the following directory scheme: / /photos/ /photos/hawaii /photos/concerts In /photos/concerts/index.html, I have a link to /index.html. When wget parses the html, it could then become: ../../index.html. All we need to know is how many directories deep we are. Would this be an acceptable solution? If so, I'd be glad to write a patch. As I mentioned to Josh in IRC, the desired behavior is accomplished with the -k option. The --base option isn't meant to have any effect on the downloaded files or anything; it's intended to be the equivalent to the HTML base/ element's href attribute; and I'd be very, very reluctant to change it away from that meaning. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGmTtG7M8hyUobTrERCO+OAJ4gLRhhHF/2QlaWkg9ILaq/K2aOgACZAdd6 kolQgo8dljpJrX5M+NWRo28= =A156 -END PGP SIGNATURE-