Wget converts links correctly *only* for the first time [was Re: links conversion; non-existent index.html].

2005-05-02 Thread Andrzej
> Yup. So I assume that the problem you see is not that of wget mirroring, but
> a combination of saving to a custom dir (with --cut-dirs and the like) and
> conversion of the links. Obviously, the link to
> http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/index.html which would be
> correct for a standard "wget -m URL" was carried over while the custom link
> to http://mineraly.feedle.com/Ftp/UpLoad/index.html was not created.
> My test with wget 1.5 just was a simple "wget15 -m -np URL" and it worked. 
> So maybe the convert/rename problem/bug was solved with 1.9.1
> This would also explain the "missing" gif file, I think.

It's not the end of troubles though! 
It works correctly *only* for the first time! 
When I (or cron) run the same mirroring commands again over already 
mirrored files to renew the mirror, then the correctly converted link of 
the gif file (on the main mirror web page):
http://mineraly.feedle.com/Gify/ChemFan.gif
is exchanged to the incorrect one:
http://znik.wbc.lublin.pl/Mineraly/Gify/ChemFan.gif

You may see it in any mirror of the site.

Source:
http://znik.wbc.lublin.pl/Mineraly/

The two mirrors:
http://mineraly.feedle.com/
http://mineraly.pg.gda.pl/

So it seems that there is a bug in 1.9.1 version after all. :(

These are the commands I run under wget 1.9.1 to mirror the whole 
Mineraly site:

cd $HOME/web/mineraly
/usr/local/bin/wget -m -nv -k -K -E -nH --cut-dirs=1 -np -t 1000 -D 
wbc.lublin.pl -o $HOME/logiwget/logmineraly -p 
http://znik.wbc.lublin.pl/Mineraly/ && \
cd $HOME/web/mineraly/arch && \
/usr/local/bin/wget -m -nv -k -K -E -nH -np --cut-dirs=2 -t 1000 -D 
lists.man.lodz.pl -o $HOME/logiwget/logmineralyarchive -p 
http://lists.man.lodz.pl/pipermail/mineraly/index.html && \
cp $HOME/web/domirrora/mineraly/Archiwum/index.html 
$HOME/web/mineraly/archiwum1/ && \
cd mineralyftp && \
/usr/local/bin/wget -m -nv -k -K -E -nH -np --cut-dirs=4 -t 1000 -D 
ftp.man.lodz.pl --follow-ftp -o $HOME/logiwget/logmineralyarchiveftp -p 
ftp://ftp.man.lodz.pl/pub/doc/LISTY-DYSKUSYJNE/MINERALY/ && \
cd .. && \
perl -pi -e 's{\Qftp://ftp.man.lodz.pl/pub/doc/LISTY-
DYSKUSYJNE/MINERALY\E}{./mineralyftp/}g' index.html

There is yet another problem with another link.
Because as you see above I have to mirror this site with 3 wget sessions 
in order to have complete mirror of the site, I have to the correct links 
in two places. In one I do it just by copying one file (with manually 
edited link) over the other, and in the other case I do it with perl 
script.
However, again, it works correctly only for the first time.
On that mirror page:
http://mineraly.pg.gda.pl/arch/
(compare it with the mirrored source:
http://lists.man.lodz.pl/pipermail/mineraly/ )
there should be a link:
http://mineraly.pg.gda.pl/arch/mineralyftp/
(and it was after the first run of the above quoted commands)
but now instead of that link there is an incorrect one:
http://lists.man.lodz.pl/pipermail/mineraly/mineralyftp/

The same error is in the other mirror:
http://mineraly.feedle.com/arch/
there is an incorrect link:
http://lists.man.lodz.pl/pipermail/mineraly/mineralyftp/

Can someone solve those two mysteries?

a. 


Re: links conversion; non-existent index.html

2005-05-01 Thread "Jens Rösner"
> The problem was that that link:
> http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/
> instead of being properly converted to:
> http://mineraly.feedle.com/Ftp/UpLoad/
Or, in fact, wget's default:
http://mineraly.feedle.com/Ftp/UpLoad/index.html

> was left like this on the main mirror page:
> http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/index.html
> and hence while clicking on it:
> "Not Found
> The requested URL /Mineraly/Ftp/UpLoad/index.html was not found on this 
> server."

Yup. So I assume that the problem you see is not that of wget mirroring, but
a combination of saving to a custom dir (with --cut-dirs and the like) and
conversion of the links. Obviously, the link to
http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/index.html which would be
correct for a standard "wget -m URL" was carried over while the custom link
to http://mineraly.feedle.com/Ftp/UpLoad/index.html was not created.
My test with wget 1.5 just was a simple "wget15 -m -np URL" and it worked. 
So maybe the convert/rename problem/bug was solved with 1.9.1
This would also explain the "missing" gif file, I think.

Jens



-- 
+++ GMX - die erste Adresse für Mail, Message, More +++

10 GB Mailbox, 100 FreeSMS  http://www.gmx.net/de/go/topmail


Re: links conversion; non-existent index.html

2005-05-01 Thread Hrvoje Niksic
"Jens Rösner" <[EMAIL PROTECTED]> writes:

>> Well, if wget "has to" put index.html is such situations then wget is not 
>> suitable for mirroring such sites, 
> What exactly do you mean? It seems to work for me, e.g. index.html looks
> like the apache-generated directory listing. When mirroring, index.html will
> be re-written if/when it has changed on the server since the last mirroring.

I agree.  Wget's handling of empty trailing path element
(e.g. "http://server/dir1/dir2/";) is not perfect, but in this case I
think it does the right thing.


Re: links conversion; non-existent index.html

2005-05-01 Thread Hrvoje Niksic
"Andrzej" <[EMAIL PROTECTED]> writes:

> Yes it works for me as well when I already mirrored it with 1.9.1
> version. Only then the two problems I described before
> dissapeared. So it was the fault of the old 1.8.1 version.

Many bugs have been fixed from 1.8.1 to 1.9.1.  It is always a good
idea to try the latest (stable) version first, it saves a lot of time
for both you and the people who respond to bug reports.

It is really regretful that the still-in-use Debian "stable" ships
with such an ancient Wget.


Re: links conversion; non-existent index.html

2005-05-01 Thread Andrzej
> Which link? The non-working one on your incorrect mirror or the working one
> on my correct mirror on my HDD?

The non-working one on my mirror.

> No need to get snappy, Andrzej.

You're right, I am *really* sorry!

> > The problem is 
> > solved though by running the 1.9.1 wget version.
> I still am wondering, because even wget 1.5 correctly generates the
> index.html from the server output, when called on my local box.
> I really do not know what is happening on your remote machine, but my wget
> 1.5 is able to mirror the site. It creates the
> Mineraly/Ftp/UpLoad/index.html file and the correct link to it. 
> I understand that it is not what you want (having an index.html), but wget
> 1.5 creates a working mirror - as it is supposed to do.

No, I don't mind having index.html as long as there is a correct content 
in it. :)
The problem was that that link:
http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/
instead of being properly converted to:
http://mineraly.feedle.com/Ftp/UpLoad/
was left like this on the main mirror page:
http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/index.html
and hence while clicking on it:
"Not Found
The requested URL /Mineraly/Ftp/UpLoad/index.html was not found on this 
server."

a.

PS. Still not everything is perfect. I think that I finally delete 
everything and start from scratch. :/


Re: links conversion; non-existent index.html

2005-05-01 Thread "Jens Rösner"
> > Wget saves a mirror to your harddisk. Therefore, it cannot rely on an
> apache
> > server generating a directory listing. Thus, it created an index.html as
> 
> Apparently you have not tried to open that link, 
Which link? The non-working one on your incorrect mirror or the working one
on my correct mirror on my HDD?

> got it now?
No need to get snappy, Andrzej.

>From your other mail:
> No, you did not understand. I run wget on remote machines. 
Ah! Sorry, missed that.

> The problem is 
> solved though by running the 1.9.1 wget version.
I still am wondering, because even wget 1.5 correctly generates the
index.html from the server output, when called on my local box.
I really do not know what is happening on your remote machine, but my wget
1.5 is able to mirror the site. It creates the
Mineraly/Ftp/UpLoad/index.html file and the correct link to it. 
I understand that it is not what you want (having an index.html), but wget
1.5 creates a working mirror - as it is supposed to do.

CU
Jens





-- 
+++ Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl


Re: links conversion; non-existent index.html

2005-05-01 Thread Andrzej
> Wget saves a mirror to your harddisk. Therefore, it cannot rely on an apache
> server generating a directory listing. Thus, it created an index.html as

Apparently you have not tried to open that link, otherwise you would have 
noticed that there was an error and that was why I was complaining, got 
it now?

> Tony Lewis explained. Now, _you_ uploaded (If I understood correctly) the
> copy from your HDD but did not save the index.html. Otherwise it would be
> there and it would work.

Not when downloading under 1.8.1 version or older.

a.


RE: links conversion; non-existent index.html

2005-05-01 Thread Post, Mark K
Probably because you're the only one that thinks it is a problem, instead of 
the way it needs to function?  Nah, that couldn't be it.


Mark Post

-Original Message-
From: Andrzej Kasperowicz [mailto:[EMAIL PROTECTED] 
Sent: Sunday, May 01, 2005 2:54 PM
To: Jens Rösner; wget@sunsite.dk
Subject: Re: links conversion; non-existent index.html


-snip-
> You "expect"??

Yes, of course. Why are you so surprised?

a.


Re: links conversion; non-existent index.html

2005-05-01 Thread Andrzej
> IMO, this is not correct. index.html will include the info the directory
> listing contains at the point of download.
> This works for me with znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/ as well -
> what seemed to be  problem according to your other post.

Yes it works for me as well when I already mirrored it with 1.9.1 
version. Only then the two problems I described before dissapeared. So it 
was the fault of the old 1.8.1 version.  

> What exactly do you mean? It seems to work for me, e.g. index.html looks
> like the apache-generated directory listing. When mirroring, index.html will
> be re-written if/when it has changed on the server since the last mirroring.

Try it with ver. 1.8.1 or older and it would not work.
 
> > and I expect that problem to be 
> > corrected in future wget versions.
> You "expect"??

Yes, of course. Why are you so surprised?

a.


Re: links conversion; non-existent index.html

2005-05-01 Thread Andrzej Kasperowicz
> IMO, this is not correct. index.html will include the info the
> directory listing contains at the point of download. This works for me
> with znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/ as well -> what seemed to
> be problem according to your other post. 

Yes it works for me as well when I already mirrored it with 1.9.1 
version. Only then the two problems I described before dissapeared. So it 
was the fault of the old 1.8.1 version.  

> What exactly do you mean? It seems to work for me, e.g. index.html
> looks like the apache-generated directory listing. When mirroring,
> index.html will be re-written if/when it has changed on the server
> since the last mirroring. 

Try it with ver. 1.8.1 or older and it would not work.
 
> > and I expect that problem to be 
> > corrected in future wget versions.
> You "expect"??

Yes, of course. Why are you so surprised?

a.



Re: links conversion; non-existent index.html

2005-05-01 Thread Ulf Harnhammar
The following Unix command removes all files called "index.html"
in the current and below directories:

find . -type f -name 'index.html' -print -exec rm -f {} \;

That might be one way to solve your problem.

// Ulf Härnhammar



Re: links conversion; non-existent index.html

2005-05-01 Thread "Jens Rösner"
Do I understand correctly that the mirror at feeble is created by you and
wget?

> > Yes, because this is in th HTML file itself:
> > "http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/index.html";
> > It does not work in a browser, so why should it work in wget?
> It works in the browser:
> http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/
> The is no index.html and the content of the directory is displayed.
I assume I was confused by the different sites you wrote about. I was sure
that both included the same link to ...index.html and the same gif-address.

> http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/index.html
> The link was not converted properly, it should be:
> http://mineraly.feedle.com/Ftp/UpLoad/
> and it should be without any index.html, because there is none in the 
> original.
Wget saves a mirror to your harddisk. Therefore, it cannot rely on an apache
server generating a directory listing. Thus, it created an index.html as
Tony Lewis explained. Now, _you_ uploaded (If I understood correctly) the
copy from your HDD but did not save the index.html. Otherwise it would be
there and it would work.

Jens

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++

10 GB Mailbox, 100 FreeSMS  http://www.gmx.net/de/go/topmail


Re: links conversion; non-existent index.html

2005-05-01 Thread "Jens Rösner"

> I know! But that is intentionally left without index.html. It should 
> display content of the directory, and I want that wget mirror it 
> correctly.
> Similar situation is here:
> http://chemfan.pl.feedle.com/arch/chemfanftp/
> it is left intentionally without index.html so that people could download 
> these archives. 
Is something wrong with my browser?
This looks not like a simple directory listing, this file has formatting and
even a background image. http://chemfan.pl.feedle.com/arch/chemfanftp/ looks
the same as http://chemfan.pl.feedle.com/arch/chemfanftp/index.html in my
Mozilla and wget downloads it correctly.

> If wget put here index.html in the mirror of such site 
> then there will be no access to these files.
IMO, this is not correct. index.html will include the info the directory
listing contains at the point of download.
This works for me with znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/ as well -
what seemed to be  problem according to your other post.

> Well, if wget "has to" put index.html is such situations then wget is not 
> suitable for mirroring such sites, 
What exactly do you mean? It seems to work for me, e.g. index.html looks
like the apache-generated directory listing. When mirroring, index.html will
be re-written if/when it has changed on the server since the last mirroring.

> and I expect that problem to be 
> corrected in future wget versions.
You "expect"??

Jens

-- 
+++ Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl


Re: links conversion; non-existent index.html

2005-05-01 Thread Andrzej
> When you specify a directory, it is up to the web server to determine 
> what resource gets returned. Some web servers will return a directory 
> listing, some will return some file (such as index.html), and others will 
> return an error.  

I know! But that is intentionally left without index.html. It should 
display content of the directory, and I want that wget mirror it 
correctly.
Similar situation is here:
http://chemfan.pl.feedle.com/arch/chemfanftp/
it is left intentionally without index.html so that people could download 
these archives. If wget put here index.html in the mirror of such site 
then there will be no access to these files.
 
> If the web server returns any information, wget has to save the 
> information that is returned in *some* local file. It chooses to name 
> that local file "index.html" since it has no way of knowing where the 
> information might have actually been stored on the server.  

Well, if wget "has to" put index.html is such situations then wget is not 
suitable for mirroring such sites, and I expect that problem to be 
corrected in future wget versions.

a.


RE: links conversion; non-existent index.html

2005-05-01 Thread Tony Lewis
Andrzej wrote:

> Two problems:
>
> There is no index.html under this link:
> http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/
[snip]
> it creates a non existing link:
> http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/index.html

When you specify a directory, it is up to the web server to determine what
resource gets returned. Some web servers will return a directory listing,
some will return some file (such as index.html), and others will return an
error.

For example, Apache might return (in this order): index.html, index.htm, a
directory listing (or a 403 Forbidden response if the configuration
disallows directory listings). The actual list of files that Apache will
search for and the order in which they are selected is determined by the
configuration.

If the web server returns any information, wget has to save the information
that is returned in *some* local file. It chooses to name that local file
"index.html" since it has no way of knowing where the information might have
actually been stored on the server.

Hope that helps,

Tony





Re: links conversion; non-existent index.html

2005-05-01 Thread Andrzej
> Yes, because this is in th HTML file itself:
> "http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/index.html";
> It does not work in a browser, so why should it work in wget?

It works in the browser:
http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/
The is no index.html and the content of the directory is displayed.

In mirror mineraly.feedle.com it is different as you can see.

> > > as you can see in the mirror:
> > > http://mineraly.feedle.com/
> Both the original site you are starting from
> (http://znik.wbc.lublin.pl/Mineraly/ ) and this mirror seem identical to me
> in this aspect?!

No, they are not! In the original it is:
http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/
and in the mirror
http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/index.html
The link was not converted properly, it should be:
http://mineraly.feedle.com/Ftp/UpLoad/
and it should be without any index.html, because there is none in the 
original.

Hrvoje mentioned that it works correctly under the newest Wget version 
though. So perhaps it is just a fault of a bug in older version.

> > it changed the address to non-existent:
> > http://znik.wbc.lublin.pl/Mineraly/Gify/ChemFan.gif
> > and the picture is not displayed.
> > Why is it doing that?
> Probably because this is the way it is written in the HTML of the page:
> "http://znik.wbc.lublin.pl/Mineraly/Gify/ChemFan.gif";
> Also, the image does not display in a browser, 
> so why should wget find it?

It displays!
That is the address of the image in the source:
http://znik.wbc.lublin.pl/ChemFan/Gify/ChemFan.gif
(source page is: http://znik.wbc.lublin.pl/Mineraly/ )

and that is in the mirror:
http://znik.wbc.lublin.pl/Mineraly/Gify/ChemFan.gif

while it rather should be
http://mineraly.feedle.com/Gify/ChemFan.gif

> I hope I am not misunderstanding you, but I can see no fault in wget's
> behaviour. 

Hope you see it now. That is however in 1.8.1 version.

a.


Re: links conversion; non-existent index.html

2005-05-01 Thread Hrvoje Niksic
"Andrzej" <[EMAIL PROTECTED]> writes:

> [Wget] creates a non existing link:
> http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/index.html
[...]
> it also created (when the above wget commad was run for the first 
> time) from the original link to the gif file:
> http://znik.wbc.lublin.pl/ChemFan/Gify/ChemFan.gif
> a non existing link:
> http://znik.wbc.lublin.pl/Mineraly/Gify/ChemFan.gif
[...]
> That was under 1.8.1 version.

I've now tried this Wget 1.9.1 and with the 1.10 alpha, and neither
problem is there.  I recommend an upgrade.

> (I tried and I could not install newer version in my dir)

What problem did you have?  Maybe someone can help.


Re: links conversion; non-existent index.html

2005-05-01 Thread Andrzej
> There is no index.html under this link:
> http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/
> 
> but when I mirror the whole http://znik.wbc.lublin.pl/Mineraly/ web site 
> with command:
> cd $HOME/web/mineraly
> wget -m -nv -k -K -E -nH --cut-dirs=1 -np -t 1000 -D wbc.lublin.pl -o 
> $HOME/logiwget/logmineraly -p http://znik.wbc.lublin.pl/Mineraly/ 
> 
> it creates a non existing link:
> http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/index.html
> 
> as you can see in the mirror:
> http://mineraly.feedle.com/

That problem still persists.
 
> Besides it also created (when the above wget commad was run for the first 
> time) from the original link to the gif file:
> http://znik.wbc.lublin.pl/ChemFan/Gify/ChemFan.gif
> a non existing link:
> http://znik.wbc.lublin.pl/Mineraly/Gify/ChemFan.gif
> 
> However, next time it done it correctly:
> http://mineraly.feedle.com/Gify/ChemFan.gif
> 
> That was under 1.8.1 version.

Correction. Today it done it wrong again.
Have a look on the main page:
http://mineraly.feedle.com/
it changed the address to non-existent:
http://znik.wbc.lublin.pl/Mineraly/Gify/ChemFan.gif
and the picture is not displayed.
Why is it doing that?

a.