Re: bug retrieving embedded images with --page-requisites

2005-11-09 Thread Hrvoje Niksic
"Jean-Marc MOLINA" <[EMAIL PROTECTED]> writes:

> Hrvoje Niksic wrote:
>> More precisely, it doesn't use the file name advertised by the
>> Content-Disposition header.  That is because Wget decides on the file
>> name it will use based on the URL used, *before* the headers are
>> downloaded.  This unfortunate design decision is the cause of all
>> these problems, and will take some work to be undone.
>
> Implementing the "Content-Disposition" header is on the TODO list :
>
> * Honor `Content-Disposition: XXX; filename="FILE"' when creating the
>   file name.  If possible, try not to break `-nc' and friends when
>   doing that.

It is, indeed -- I wrote that entry.  :-)  The problem is that
implementing this is not as easy or straightforward as it sounds.
This is shared by most TODO list items.


How come I get spammed by wget@sunsite.dk ?

2005-11-09 Thread Jean-Marc MOLINA
Hello,

Since I began to post here I got some spam from [EMAIL PROTECTED] Mostly it
sends replies to my posts and never subscribed to any "mailing list". Does
unsuscribing from the list will stop it ?

Thanks and sorry, I'm not accustomed to mailing list but don't understand
how come I got subscribed,
JM.





Re: bug retrieving embedded images with --page-requisites

2005-11-09 Thread Jean-Marc MOLINA
Tony Lewis wrote:
> The --convert-links option changes the website path to a local file
> system path. That is, it changes the directory, not the file name.

Thanks I didn't understand it that way.

> IMO, your suggestion has merit, but it would require wget to maintain
> a list of MIME types and corresponding renaming rules.

Well it seems implementing the "Content-Type" header is planned since a long
time and there are two items about it in the "TODO" document of the wget
distrib.

Maintaining a list of MIME types is not an issue as there are already lists
around :
* "File suffixes and MIME types" at Duke University :
http://www.duke.edu/websrv/file-extensions.html
* "MIME Types" category at Google :
http://www.google.com/Top/Computers/Data_Formats/MIME_Types
* ...

Just a word about how HTTrack handles MIME types and extensions. It has a
powerful "--assume" option that allows users to assign a MIME type to
extensions. For example : "All .php files are PNG images". Everything is
explained on the "Option panel : MIME Types" page at
http://www.httrack.com/html/step9_opt11.html. I think wget could use such an
option.

JM.





Re: bug retrieving embedded images with --page-requisites

2005-11-09 Thread Jean-Marc MOLINA
Hrvoje Niksic wrote:
> More precisely, it doesn't use the file name advertised by the
> Content-Disposition header.  That is because Wget decides on the file
> name it will use based on the URL used, *before* the headers are
> downloaded.  This unfortunate design decision is the cause of all
> these problems, and will take some work to be undone.

Implementing the "Content-Disposition" header is on the TODO list :

* Honor `Content-Disposition: XXX; filename="FILE"' when creating the
  file name.  If possible, try not to break `-nc' and friends when
  doing that.

JM.





RE: bug retrieving embedded images with --page-requisites

2005-11-09 Thread Tony Lewis
Jean-Marc MOLINA wrote:

> For example if a PNG image is generated using a "gen_png_image.php" PHP
> script, I think wget should be able to download it if the option
> "--page-requisites" is used, because it's part of the page and it's not
> an external resource, get its MIME type, "image/png", and using the
> option "--convert-links" should also rename the script-image to
> "gen_png_image.png".

The --convert-links option changes the website path to a local file system
path. That is, it changes the directory, not the file name. IMO, your
suggestion has merit, but it would require wget to maintain a list of MIME
types and corresponding renaming rules.

Tony




Re: bug retrieving embedded images with --page-requisites

2005-11-09 Thread Hrvoje Niksic
"Jean-Marc MOLINA" <[EMAIL PROTECTED]> writes:

> As I don't know anything about wget sources, I can't tell how it
> innerworks but I guess it doesn't check the MIME types of resources
> linked from the "src" attribute of a "img" elements. And that would
> be a bug... And I think some kind of RFC or spec should confirm it.

More precisely, it doesn't use the file name advertised by the
Content-Disposition header.  That is because Wget decides on the file
name it will use based on the URL used, *before* the headers are
downloaded.  This unfortunate design decision is the cause of all
these problems, and will take some work to be undone.


Re: bug retrieving embedded images with --page-requisites

2005-11-09 Thread Jean-Marc MOLINA
Gavin Sherlock wrote:
> i.e. the image is generated on the fly from a script, which then
> essentially prints the image back to the browser with the correct
> mime type.  While this is a non-standard way to include an image on a
> page, the --page-requisites are not fulfilled when retrieving this
> web page.

I don't think you can consider this a "non-standard way". I'm sure there's a
whole paragraph in a RFC (HTML 4.01 spec) about properly dealing with URI,
linked resources and MIME types. For example if a PNG image is generated
using a "gen_png_image.php" PHP script, I think wget should be able to
download it if the option "--page-requisites" is used, because it's part of
the page and it's not an external resource, get its MIME type, "image/png",
and using the option "--convert-links" should also rename the script-image
to "gen_png_image.png".

I tried the "--page-requisites" option and got my test page, at
http://jmmolina.free.fr/t_39638/, perfectly archived. Original names and
page is 100% offline browsable. The script name is still
"gen_png_image.php". Then I used the "--convert-links" option to see if the
script was renamed to a PNG image, it wasn't.

To compare this behaviour with HTTrack, I tried to archive the same page
with it. By default it converted the PHP script to a HTML page. It's logical
because HTTrack has some default ext/MIME mappings. So I removed the ".php
to text/html" and got a nice PNG image instead. I don't really know how to
force it not to rename the script but it doesn't really matter.

As I don't know anything about wget sources, I can't tell how it innerworks
but I guess it doesn't check the MIME types of resources linked from the
"src" attribute of a "img" elements. And that would be a bug... And I think
some kind of RFC or spec should confirm it.

JM.