Wget converts links correctly *only* for the first time [was Re: links conversion; non-existent index.html].
> Yup. So I assume that the problem you see is not that of wget mirroring, but > a combination of saving to a custom dir (with --cut-dirs and the like) and > conversion of the links. Obviously, the link to > http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/index.html which would be > correct for a standard "wget -m URL" was carried over while the custom link > to http://mineraly.feedle.com/Ftp/UpLoad/index.html was not created. > My test with wget 1.5 just was a simple "wget15 -m -np URL" and it worked. > So maybe the convert/rename problem/bug was solved with 1.9.1 > This would also explain the "missing" gif file, I think. It's not the end of troubles though! It works correctly *only* for the first time! When I (or cron) run the same mirroring commands again over already mirrored files to renew the mirror, then the correctly converted link of the gif file (on the main mirror web page): http://mineraly.feedle.com/Gify/ChemFan.gif is exchanged to the incorrect one: http://znik.wbc.lublin.pl/Mineraly/Gify/ChemFan.gif You may see it in any mirror of the site. Source: http://znik.wbc.lublin.pl/Mineraly/ The two mirrors: http://mineraly.feedle.com/ http://mineraly.pg.gda.pl/ So it seems that there is a bug in 1.9.1 version after all. :( These are the commands I run under wget 1.9.1 to mirror the whole Mineraly site: cd $HOME/web/mineraly /usr/local/bin/wget -m -nv -k -K -E -nH --cut-dirs=1 -np -t 1000 -D wbc.lublin.pl -o $HOME/logiwget/logmineraly -p http://znik.wbc.lublin.pl/Mineraly/ && \ cd $HOME/web/mineraly/arch && \ /usr/local/bin/wget -m -nv -k -K -E -nH -np --cut-dirs=2 -t 1000 -D lists.man.lodz.pl -o $HOME/logiwget/logmineralyarchive -p http://lists.man.lodz.pl/pipermail/mineraly/index.html && \ cp $HOME/web/domirrora/mineraly/Archiwum/index.html $HOME/web/mineraly/archiwum1/ && \ cd mineralyftp && \ /usr/local/bin/wget -m -nv -k -K -E -nH -np --cut-dirs=4 -t 1000 -D ftp.man.lodz.pl --follow-ftp -o $HOME/logiwget/logmineralyarchiveftp -p ftp://ftp.man.lodz.pl/pub/doc/LISTY-DYSKUSYJNE/MINERALY/ && \ cd .. && \ perl -pi -e 's{\Qftp://ftp.man.lodz.pl/pub/doc/LISTY- DYSKUSYJNE/MINERALY\E}{./mineralyftp/}g' index.html There is yet another problem with another link. Because as you see above I have to mirror this site with 3 wget sessions in order to have complete mirror of the site, I have to the correct links in two places. In one I do it just by copying one file (with manually edited link) over the other, and in the other case I do it with perl script. However, again, it works correctly only for the first time. On that mirror page: http://mineraly.pg.gda.pl/arch/ (compare it with the mirrored source: http://lists.man.lodz.pl/pipermail/mineraly/ ) there should be a link: http://mineraly.pg.gda.pl/arch/mineralyftp/ (and it was after the first run of the above quoted commands) but now instead of that link there is an incorrect one: http://lists.man.lodz.pl/pipermail/mineraly/mineralyftp/ The same error is in the other mirror: http://mineraly.feedle.com/arch/ there is an incorrect link: http://lists.man.lodz.pl/pipermail/mineraly/mineralyftp/ Can someone solve those two mysteries? a.
Re: links conversion; non-existent index.html
> The problem was that that link: > http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/ > instead of being properly converted to: > http://mineraly.feedle.com/Ftp/UpLoad/ Or, in fact, wget's default: http://mineraly.feedle.com/Ftp/UpLoad/index.html > was left like this on the main mirror page: > http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/index.html > and hence while clicking on it: > "Not Found > The requested URL /Mineraly/Ftp/UpLoad/index.html was not found on this > server." Yup. So I assume that the problem you see is not that of wget mirroring, but a combination of saving to a custom dir (with --cut-dirs and the like) and conversion of the links. Obviously, the link to http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/index.html which would be correct for a standard "wget -m URL" was carried over while the custom link to http://mineraly.feedle.com/Ftp/UpLoad/index.html was not created. My test with wget 1.5 just was a simple "wget15 -m -np URL" and it worked. So maybe the convert/rename problem/bug was solved with 1.9.1 This would also explain the "missing" gif file, I think. Jens -- +++ GMX - die erste Adresse für Mail, Message, More +++ 10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail
Re: links conversion; non-existent index.html
"Jens Rösner" <[EMAIL PROTECTED]> writes: >> Well, if wget "has to" put index.html is such situations then wget is not >> suitable for mirroring such sites, > What exactly do you mean? It seems to work for me, e.g. index.html looks > like the apache-generated directory listing. When mirroring, index.html will > be re-written if/when it has changed on the server since the last mirroring. I agree. Wget's handling of empty trailing path element (e.g. "http://server/dir1/dir2/";) is not perfect, but in this case I think it does the right thing.
Re: links conversion; non-existent index.html
"Andrzej" <[EMAIL PROTECTED]> writes: > Yes it works for me as well when I already mirrored it with 1.9.1 > version. Only then the two problems I described before > dissapeared. So it was the fault of the old 1.8.1 version. Many bugs have been fixed from 1.8.1 to 1.9.1. It is always a good idea to try the latest (stable) version first, it saves a lot of time for both you and the people who respond to bug reports. It is really regretful that the still-in-use Debian "stable" ships with such an ancient Wget.
Re: links conversion; non-existent index.html
> Which link? The non-working one on your incorrect mirror or the working one > on my correct mirror on my HDD? The non-working one on my mirror. > No need to get snappy, Andrzej. You're right, I am *really* sorry! > > The problem is > > solved though by running the 1.9.1 wget version. > I still am wondering, because even wget 1.5 correctly generates the > index.html from the server output, when called on my local box. > I really do not know what is happening on your remote machine, but my wget > 1.5 is able to mirror the site. It creates the > Mineraly/Ftp/UpLoad/index.html file and the correct link to it. > I understand that it is not what you want (having an index.html), but wget > 1.5 creates a working mirror - as it is supposed to do. No, I don't mind having index.html as long as there is a correct content in it. :) The problem was that that link: http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/ instead of being properly converted to: http://mineraly.feedle.com/Ftp/UpLoad/ was left like this on the main mirror page: http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/index.html and hence while clicking on it: "Not Found The requested URL /Mineraly/Ftp/UpLoad/index.html was not found on this server." a. PS. Still not everything is perfect. I think that I finally delete everything and start from scratch. :/
Re: links conversion; non-existent index.html
> > Wget saves a mirror to your harddisk. Therefore, it cannot rely on an > apache > > server generating a directory listing. Thus, it created an index.html as > > Apparently you have not tried to open that link, Which link? The non-working one on your incorrect mirror or the working one on my correct mirror on my HDD? > got it now? No need to get snappy, Andrzej. >From your other mail: > No, you did not understand. I run wget on remote machines. Ah! Sorry, missed that. > The problem is > solved though by running the 1.9.1 wget version. I still am wondering, because even wget 1.5 correctly generates the index.html from the server output, when called on my local box. I really do not know what is happening on your remote machine, but my wget 1.5 is able to mirror the site. It creates the Mineraly/Ftp/UpLoad/index.html file and the correct link to it. I understand that it is not what you want (having an index.html), but wget 1.5 creates a working mirror - as it is supposed to do. CU Jens -- +++ Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl
Re: links conversion; non-existent index.html
> Wget saves a mirror to your harddisk. Therefore, it cannot rely on an apache > server generating a directory listing. Thus, it created an index.html as Apparently you have not tried to open that link, otherwise you would have noticed that there was an error and that was why I was complaining, got it now? > Tony Lewis explained. Now, _you_ uploaded (If I understood correctly) the > copy from your HDD but did not save the index.html. Otherwise it would be > there and it would work. Not when downloading under 1.8.1 version or older. a.
RE: links conversion; non-existent index.html
Probably because you're the only one that thinks it is a problem, instead of the way it needs to function? Nah, that couldn't be it. Mark Post -Original Message- From: Andrzej Kasperowicz [mailto:[EMAIL PROTECTED] Sent: Sunday, May 01, 2005 2:54 PM To: Jens Rösner; wget@sunsite.dk Subject: Re: links conversion; non-existent index.html -snip- > You "expect"?? Yes, of course. Why are you so surprised? a.
Re: links conversion; non-existent index.html
> IMO, this is not correct. index.html will include the info the directory > listing contains at the point of download. > This works for me with znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/ as well - > what seemed to be problem according to your other post. Yes it works for me as well when I already mirrored it with 1.9.1 version. Only then the two problems I described before dissapeared. So it was the fault of the old 1.8.1 version. > What exactly do you mean? It seems to work for me, e.g. index.html looks > like the apache-generated directory listing. When mirroring, index.html will > be re-written if/when it has changed on the server since the last mirroring. Try it with ver. 1.8.1 or older and it would not work. > > and I expect that problem to be > > corrected in future wget versions. > You "expect"?? Yes, of course. Why are you so surprised? a.
Re: links conversion; non-existent index.html
> IMO, this is not correct. index.html will include the info the > directory listing contains at the point of download. This works for me > with znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/ as well -> what seemed to > be problem according to your other post. Yes it works for me as well when I already mirrored it with 1.9.1 version. Only then the two problems I described before dissapeared. So it was the fault of the old 1.8.1 version. > What exactly do you mean? It seems to work for me, e.g. index.html > looks like the apache-generated directory listing. When mirroring, > index.html will be re-written if/when it has changed on the server > since the last mirroring. Try it with ver. 1.8.1 or older and it would not work. > > and I expect that problem to be > > corrected in future wget versions. > You "expect"?? Yes, of course. Why are you so surprised? a.
Re: links conversion; non-existent index.html
The following Unix command removes all files called "index.html" in the current and below directories: find . -type f -name 'index.html' -print -exec rm -f {} \; That might be one way to solve your problem. // Ulf Härnhammar
Re: links conversion; non-existent index.html
Do I understand correctly that the mirror at feeble is created by you and wget? > > Yes, because this is in th HTML file itself: > > "http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/index.html"; > > It does not work in a browser, so why should it work in wget? > It works in the browser: > http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/ > The is no index.html and the content of the directory is displayed. I assume I was confused by the different sites you wrote about. I was sure that both included the same link to ...index.html and the same gif-address. > http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/index.html > The link was not converted properly, it should be: > http://mineraly.feedle.com/Ftp/UpLoad/ > and it should be without any index.html, because there is none in the > original. Wget saves a mirror to your harddisk. Therefore, it cannot rely on an apache server generating a directory listing. Thus, it created an index.html as Tony Lewis explained. Now, _you_ uploaded (If I understood correctly) the copy from your HDD but did not save the index.html. Otherwise it would be there and it would work. Jens -- +++ GMX - die erste Adresse für Mail, Message, More +++ 10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail
Re: links conversion; non-existent index.html
> I know! But that is intentionally left without index.html. It should > display content of the directory, and I want that wget mirror it > correctly. > Similar situation is here: > http://chemfan.pl.feedle.com/arch/chemfanftp/ > it is left intentionally without index.html so that people could download > these archives. Is something wrong with my browser? This looks not like a simple directory listing, this file has formatting and even a background image. http://chemfan.pl.feedle.com/arch/chemfanftp/ looks the same as http://chemfan.pl.feedle.com/arch/chemfanftp/index.html in my Mozilla and wget downloads it correctly. > If wget put here index.html in the mirror of such site > then there will be no access to these files. IMO, this is not correct. index.html will include the info the directory listing contains at the point of download. This works for me with znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/ as well - what seemed to be problem according to your other post. > Well, if wget "has to" put index.html is such situations then wget is not > suitable for mirroring such sites, What exactly do you mean? It seems to work for me, e.g. index.html looks like the apache-generated directory listing. When mirroring, index.html will be re-written if/when it has changed on the server since the last mirroring. > and I expect that problem to be > corrected in future wget versions. You "expect"?? Jens -- +++ Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl
Re: links conversion; non-existent index.html
> When you specify a directory, it is up to the web server to determine > what resource gets returned. Some web servers will return a directory > listing, some will return some file (such as index.html), and others will > return an error. I know! But that is intentionally left without index.html. It should display content of the directory, and I want that wget mirror it correctly. Similar situation is here: http://chemfan.pl.feedle.com/arch/chemfanftp/ it is left intentionally without index.html so that people could download these archives. If wget put here index.html in the mirror of such site then there will be no access to these files. > If the web server returns any information, wget has to save the > information that is returned in *some* local file. It chooses to name > that local file "index.html" since it has no way of knowing where the > information might have actually been stored on the server. Well, if wget "has to" put index.html is such situations then wget is not suitable for mirroring such sites, and I expect that problem to be corrected in future wget versions. a.
RE: links conversion; non-existent index.html
Andrzej wrote: > Two problems: > > There is no index.html under this link: > http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/ [snip] > it creates a non existing link: > http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/index.html When you specify a directory, it is up to the web server to determine what resource gets returned. Some web servers will return a directory listing, some will return some file (such as index.html), and others will return an error. For example, Apache might return (in this order): index.html, index.htm, a directory listing (or a 403 Forbidden response if the configuration disallows directory listings). The actual list of files that Apache will search for and the order in which they are selected is determined by the configuration. If the web server returns any information, wget has to save the information that is returned in *some* local file. It chooses to name that local file "index.html" since it has no way of knowing where the information might have actually been stored on the server. Hope that helps, Tony
Re: links conversion; non-existent index.html
> Yes, because this is in th HTML file itself: > "http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/index.html"; > It does not work in a browser, so why should it work in wget? It works in the browser: http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/ The is no index.html and the content of the directory is displayed. In mirror mineraly.feedle.com it is different as you can see. > > > as you can see in the mirror: > > > http://mineraly.feedle.com/ > Both the original site you are starting from > (http://znik.wbc.lublin.pl/Mineraly/ ) and this mirror seem identical to me > in this aspect?! No, they are not! In the original it is: http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/ and in the mirror http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/index.html The link was not converted properly, it should be: http://mineraly.feedle.com/Ftp/UpLoad/ and it should be without any index.html, because there is none in the original. Hrvoje mentioned that it works correctly under the newest Wget version though. So perhaps it is just a fault of a bug in older version. > > it changed the address to non-existent: > > http://znik.wbc.lublin.pl/Mineraly/Gify/ChemFan.gif > > and the picture is not displayed. > > Why is it doing that? > Probably because this is the way it is written in the HTML of the page: > "http://znik.wbc.lublin.pl/Mineraly/Gify/ChemFan.gif"; > Also, the image does not display in a browser, > so why should wget find it? It displays! That is the address of the image in the source: http://znik.wbc.lublin.pl/ChemFan/Gify/ChemFan.gif (source page is: http://znik.wbc.lublin.pl/Mineraly/ ) and that is in the mirror: http://znik.wbc.lublin.pl/Mineraly/Gify/ChemFan.gif while it rather should be http://mineraly.feedle.com/Gify/ChemFan.gif > I hope I am not misunderstanding you, but I can see no fault in wget's > behaviour. Hope you see it now. That is however in 1.8.1 version. a.
Re: links conversion; non-existent index.html
"Andrzej" <[EMAIL PROTECTED]> writes: > [Wget] creates a non existing link: > http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/index.html [...] > it also created (when the above wget commad was run for the first > time) from the original link to the gif file: > http://znik.wbc.lublin.pl/ChemFan/Gify/ChemFan.gif > a non existing link: > http://znik.wbc.lublin.pl/Mineraly/Gify/ChemFan.gif [...] > That was under 1.8.1 version. I've now tried this Wget 1.9.1 and with the 1.10 alpha, and neither problem is there. I recommend an upgrade. > (I tried and I could not install newer version in my dir) What problem did you have? Maybe someone can help.
Re: links conversion; non-existent index.html
> There is no index.html under this link: > http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/ > > but when I mirror the whole http://znik.wbc.lublin.pl/Mineraly/ web site > with command: > cd $HOME/web/mineraly > wget -m -nv -k -K -E -nH --cut-dirs=1 -np -t 1000 -D wbc.lublin.pl -o > $HOME/logiwget/logmineraly -p http://znik.wbc.lublin.pl/Mineraly/ > > it creates a non existing link: > http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/index.html > > as you can see in the mirror: > http://mineraly.feedle.com/ That problem still persists. > Besides it also created (when the above wget commad was run for the first > time) from the original link to the gif file: > http://znik.wbc.lublin.pl/ChemFan/Gify/ChemFan.gif > a non existing link: > http://znik.wbc.lublin.pl/Mineraly/Gify/ChemFan.gif > > However, next time it done it correctly: > http://mineraly.feedle.com/Gify/ChemFan.gif > > That was under 1.8.1 version. Correction. Today it done it wrong again. Have a look on the main page: http://mineraly.feedle.com/ it changed the address to non-existent: http://znik.wbc.lublin.pl/Mineraly/Gify/ChemFan.gif and the picture is not displayed. Why is it doing that? a.