--base does not consider references to root directory

2007-07-14 Thread Josh Williams

Consider this example, which happens to be how I realised this problem:

wget http://www.mxpx.com/ -r --base=.

Here, I want the entire site to be downloaded with each link pointing
to the local file. This works for some links, but it does not take
references to the root directory into account, such as this:

a href=/index.phpHome/a

Here, wget just ignores the --base parameter and leaves the link as
/index.php.

I realise that this may seem like a sticky situation, but consider
this solution: Let's say that I have a photo album on my personal
homepage with the following directory scheme:

/
/photos/
/photos/hawaii
/photos/concerts

In /photos/concerts/index.html, I have a link to /index.html. When
wget parses the html, it could then become: ../../index.html. All we
need to know is how many directories deep we are.

Would this be an acceptable solution? If so, I'd be glad to write a patch.


Re: --base does not consider references to root directory

2007-07-14 Thread Matthias Vill
So you would suggest handling in the way that when I use
wget --base=/some/serverdir http://server/serverdir/
/.* will be interpreted as /some/.* so if you have a link like
/serverdir/ it would go back to /some/serverdir, right?

I guess this would be ok. Just one question if there is a Link back to
/serverdir/ and base is something like /my/dir/ shouldn't this also be
fetched from inside /my/dir/ and not /my/serverdir/?

Greetings

Matthias

Josh Williams wrote:
 I realise that this may seem like a sticky situation, but consider
 this solution: Let's say that I have a photo album on my personal
 homepage with the following directory scheme:
 
 /
 /photos/
 /photos/hawaii
 /photos/concerts
 
 In /photos/concerts/index.html, I have a link to /index.html. When
 wget parses the html, it could then become: ../../index.html. All we
 need to know is how many directories deep we are.
 
 Would this be an acceptable solution? If so, I'd be glad to write a patch.
 


Re: --base does not consider references to root directory

2007-07-14 Thread Josh Williams

On 7/14/07, Matthias Vill [EMAIL PROTECTED] wrote:

So you would suggest handling in the way that when I use
wget --base=/some/serverdir http://server/serverdir/
/.* will be interpreted as /some/.* so if you have a link like
/serverdir/ it would go back to /some/serverdir, right?


Correct.


I guess this would be ok. Just one question if there is a Link back to
/serverdir/ and base is something like /my/dir/ shouldn't this also be
fetched from inside /my/dir/ and not /my/serverdir/?


Take a look at the directory structure:

/my/dir
/my/dir/www.foo.bar
/my/dir/www.foo.bar/serverdir

Suppose we have a link in /my/dir/www.foo.bar/serverdir like this:

a href=/jobs.phpJobs/a

This link (if called locally) would try to fetch a file on the root
directory of the operating system, not the website. It would probably
get a 403 or a 404 error. What we would want it to look like is this:

a href=../jobs.phpJobs/a

This method will work no matter what the --base parameter is.


Re: --base does not consider references to root directory

2007-07-14 Thread Matthias Vill
I think I got your point:

All in all this is still a matter of comparing the first against the
current url and counting the common dirs from the left side.
Then you compare that number(a) to the depth of the first url(b) and add
 b-a ../ so you get to the right position inside your base.

By that way if you call wget -r --base=/somedir /server/otherdir/ a
later reference to /server/otherdir/ is correctly found as duplicate of
the first one.

I first thought of a different solution... like appending
initial-depth-times .. to base. I admit this is silly.

Now i think this could result in different problems like what schould
happen with wget -r --base=/home/matthias/tmp
http://server/with/a/complicated/structure/and/to/many/dirs/a.php;

If you now have a link to /index.html you would try to access some
file above / or am I wrong?

Greeting

Matthias


Re: --base does not consider references to root directory

2007-07-14 Thread Josh Williams

On 7/14/07, Matthias Vill [EMAIL PROTECTED] wrote:

I think I got your point:
Now i think this could result in different problems like what schould
happen with wget -r --base=/home/matthias/tmp
http://server/with/a/complicated/structure/and/to/many/dirs/a.php;

If you now have a link to /index.html you would try to access some
file above / or am I wrong?


In the case of 
http://server/with/a/complicated/structure/and/to/many/dirs/a.php,
a link to /index.php would look like this:

a href=../../../../../../../../index.phpHome/a

(Assuming I counted it correctly.) It's just a matter of knowing how
many directories deep we are so we know how many times to concatenate
the ../


Re: Two wget patches: min-size/max-size and nc options

2007-07-14 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Christian Roche wrote:
 Hi there,

Hi!

 please find attached two small patches that could be
 considered for wget (against revision 2276).
 
 patch-utils changes the file renaming mechanism when
 the -nc option is in effect.  Instead of trying to
 rename  a file to file.1, file.2 etc, it tries
 prefix-1.suffix, prefix-2.suffix etc, thus preserving
 the filename extension if any.

This seems reasonable.

 This is necessary to
 avoid a bug otherwise when the -A option is used:
 renamed files are rejected because they don't match
 the required suffix, although they should really be
 kept.

Regardless of whether this particular approach is taken, this needs to
be addressed.

 patch-http provides two new options, --min-size (-s)
 and --max-size (-M), although the shortcuts could
 obviously be changed.  Non-HTML files that don't fit
 these constraints (expressed in kB) will simply not be
 retrieved.  This relies on the Content-Length HTTP
 header and will not work for FTP. This is quite useful
 when retrieving jpeg images from a site to avoid
 thumbnails for instance, as explained in the related
 documentation paragraph.

This seems reasonable as well. We should probably allow for it to be
expressed in a variety of other units, though (bytes, megabytes). Also,
I'm not keen on spending any of our few remaining small options on this.

Thanks for these!

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGmS6I7M8hyUobTrERCFIgAJ9GGrlwuUKbyJtfEcM9AedvacFhKgCfRJsE
iejCLNP6afhqchhrjz3AFz8=
=jOZg
-END PGP SIGNATURE-


Re: --base does not consider references to root directory

2007-07-14 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Josh Williams wrote:
 Consider this example, which happens to be how I realised this problem:
 
 wget http://www.mxpx.com/ -r --base=.
 
 Here, I want the entire site to be downloaded with each link pointing
 to the local file. This works for some links, but it does not take
 references to the root directory into account, such as this:
 
 a href=/index.phpHome/a
 
 Here, wget just ignores the --base parameter and leaves the link as
 /index.php.
 
 I realise that this may seem like a sticky situation, but consider
 this solution: Let's say that I have a photo album on my personal
 homepage with the following directory scheme:
 
 /
 /photos/
 /photos/hawaii
 /photos/concerts
 
 In /photos/concerts/index.html, I have a link to /index.html. When
 wget parses the html, it could then become: ../../index.html. All we
 need to know is how many directories deep we are.
 
 Would this be an acceptable solution? If so, I'd be glad to write a patch.

As I mentioned to Josh in IRC, the desired behavior is accomplished with
the -k option.

The --base option isn't meant to have any effect on the downloaded files
or anything; it's intended to be the equivalent to the HTML base/
element's href attribute; and I'd be very, very reluctant to change it
away from that meaning.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGmTtG7M8hyUobTrERCO+OAJ4gLRhhHF/2QlaWkg9ILaq/K2aOgACZAdd6
kolQgo8dljpJrX5M+NWRo28=
=A156
-END PGP SIGNATURE-