"Andrzej " <[EMAIL PROTECTED]> writes:

> It's not the end of troubles though! 
> It works correctly *only* for the first time! 
> When I (or cron) run the same mirroring commands again over already 
> mirrored files to renew the mirror, then the correctly converted link of 
> the gif file (on the main mirror web page):
> http://mineraly.feedle.com/Gify/ChemFan.gif
> is exchanged to the incorrect one:
> http://znik.wbc.lublin.pl/Mineraly/Gify/ChemFan.gif

The problem is that Wget is re-converting the files it decided it
didn't want to download due to timestamping.  For example:

1st time:
URL:  http://znik.wbc.lublin.pl/Mineraly/
link: <img src="http://znik.wbc.lublin.pl/ChemFan/Gify/ChemFan.gif";>

Since the image is downloaded to Gify/ChemFan.gif, this is converted
to:
      <img src="Gify/ChemFan.gif">

2nd time:
URL:  http://znik.wbc.lublin.pl/Mineraly/      (using local copy of that URL)
link: <img src="Gify/ChemFan.gif">

Since no such image is downloaded, Wget converts the link back to
absolute one.  Merging "http://znik.wbc.lublin.pl/Mineraly/"; with
"Gify/ChemFan.gif" results in the totally bogus
"http://znik.wbc.lublin.pl/Mineraly/Gify/ChemFan.gif"; that you're
seeing.

That explains the mechanics of the bug, but not what to do about it.
There are two solutions:

1. If an HTML file is re-downloaded because of time-stamping, it
   should not be re-converted because (since the file hasn't changed)
   there is no reason to do so.  I'm trying to think of a scenario
   where this would break things, but I can't come up with any.

2. If --backup-converted is in use (which it is in your case), link
   conversion could read the pristine ".orig" file and write it to the
   resulting HTML.  This is a bit more complex, but might help if
   solution #1 turns out to break some scenarios.

Here is a patch that implements #1.  (It applies to the CVS source,
but it's easy enough to manually apply it to the source of 1.9.1.)
With that patch the mirror seems correct in the 2nd run.  Please let
me know if it works for you.

Index: src/http.c
===================================================================
RCS file: /pack/anoncvs/wget/src/http.c,v
retrieving revision 1.173
diff -u -r1.173 http.c
--- src/http.c  2005/04/28 13:56:31     1.173
+++ src/http.c  2005/05/02 14:58:53
@@ -2318,6 +2318,11 @@
                             local_filename);
                  free_hstat (&hstat);
                  xfree_null (dummy);
+                 /* The file is the same; assume that the links have
+                    already been converted.  Otherwise we run the
+                    risk of converting links twice, which is
+                    wrong.  */
+                 *dt |= DT_DISABLE_CONVERSION;
                  return RETROK;
                }
              else if (tml >= tmr)
Index: src/retr.c
===================================================================
RCS file: /pack/anoncvs/wget/src/retr.c,v
retrieving revision 1.95
diff -u -r1.95 retr.c
--- src/retr.c  2005/04/16 20:12:43     1.95
+++ src/retr.c  2005/05/02 14:58:55
@@ -761,7 +761,7 @@
          register_download (u->url, local_file);
          if (redirection_count && 0 != strcmp (origurl, u->url))
            register_redirection (origurl, u->url);
-         if (*dt & TEXTHTML)
+         if ((*dt & TEXTHTML) && !(*dt & DT_DISABLE_CONVERSION))
            register_html (u->url, local_file);
        }
     }
Index: src/wget.h
===================================================================
RCS file: /pack/anoncvs/wget/src/wget.h,v
retrieving revision 1.57
diff -u -r1.57 wget.h
--- src/wget.h  2005/04/27 21:08:40     1.57
+++ src/wget.h  2005/05/02 14:58:55
@@ -233,7 +233,8 @@
   HEAD_ONLY            = 0x0004,       /* only send the HEAD request */
   SEND_NOCACHE         = 0x0008,       /* send Pragma: no-cache directive */
   ACCEPTRANGES         = 0x0010,       /* Accept-ranges header was found */
-  ADDED_HTML_EXTENSION = 0x0020         /* added ".html" extension due to -E */
+  ADDED_HTML_EXTENSION = 0x0020,       /* added ".html" extension due to -E */
+  DT_DISABLE_CONVERSION = 0x0040       /* disable link conversion */
 };
 
 /* Universal error type -- used almost everywhere.  Error reporting of

Reply via email to