Re: missing files
On Tuesday 09 May 2006 06:18, you wrote: Hi all, I have a problem: I'm trying to download an entire directory from a site and I'm using the command wget -r -I directory_name site_name. It seems to work but at a certain point it stops but I'm sure that there are some files missing that I can download manually and I'm sure that they are files like the others that wget can download. Any clue about that? Thanks Have you checked to see if they have a robots.txt file that may restrict the download? If it does you'll have to turn off robots, '-e robots=off' on the command line. Curtis
Re: missing files
Even in this case, how it is possible discriminate between files? I mean, why can I download some files and I can't with others that have similar features? It sounds strange to me... Thanks anyway G On Wednesday 10 May 2006 15:15, Curtis Hatter wrote: On Tuesday 09 May 2006 06:18, you wrote: Hi all, I have a problem: I'm trying to download an entire directory from a site and I'm using the command wget -r -I directory_name site_name. It seems to work but at a certain point it stops but I'm sure that there are some files missing that I can download manually and I'm sure that they are files like the others that wget can download. Any clue about that? Thanks Have you checked to see if they have a robots.txt file that may restrict the download? If it does you'll have to turn off robots, '-e robots=off' on the command line. Curtis -- Per me l'uomo colto non è colui che sa quando è nato Napoleone, ma quello che sa dove andare a cercare l'informazione nell'unico momento della sua vita in cui gli serve, e in due minuti. Umberto Eco
Re: missing files
[...] Any clue about that? Not in your posting. You might say which Wget version you're using, on which sort of system, and which files are not getting fetched, and then show the links to those files in the HTML which Wget should have followed. Without some actual information about what's happening (clues), it's not possible to say much which might be useful. Steven M. Schweda [EMAIL PROTECTED] 382 South Warwick Street(+1) 651-699-9818 Saint Paul MN 55105-2547
Re: missing files
On Wednesday 10 May 2006 09:28, you wrote: Even in this case, how it is possible discriminate between files? I mean, why can I download some files and I can't with others that have similar features? It sounds strange to me... Thanks anyway G Check the link: http://www.robotstxt.org/wc/norobots-rfc.txt It explains how one can craft a robots.txt file to keep programs like Wget or LWP from fetching specific documents. As was noted by Steven, what platform are you running on? what version of wget? what links won't Wget download? What is the site? If the material is acceptable for a company to download, and it's not very large I can try to download it and see if I can recreate your problem. Curtis
Re: missing files
Thanks a lot, Curtis. Unfortunately, the material is very large. Anyway, I will check the link and I'm already using your suggestion. I will keep you informed, Thanks again. G On Wednesday 10 May 2006 15:41, Curtis Hatter wrote: On Wednesday 10 May 2006 09:28, you wrote: Even in this case, how it is possible discriminate between files? I mean, why can I download some files and I can't with others that have similar features? It sounds strange to me... Thanks anyway G Check the link: http://www.robotstxt.org/wc/norobots-rfc.txt It explains how one can craft a robots.txt file to keep programs like Wget or LWP from fetching specific documents. As was noted by Steven, what platform are you running on? what version of wget? what links won't Wget download? What is the site? If the material is acceptable for a company to download, and it's not very large I can try to download it and see if I can recreate your problem. Curtis -- Per me l'uomo colto non è colui che sa quando è nato Napoleone, ma quello che sa dove andare a cercare l'informazione nell'unico momento della sua vita in cui gli serve, e in due minuti. Umberto Eco
missing files
Hi all, I have a problem: I'm trying to download an entire directory from a site and I'm using the command wget -r -I directory_name site_name. It seems to work but at a certain point it stops but I'm sure that there are some files missing that I can download manually and I'm sure that they are files like the others that wget can download. Any clue about that? Thanks -- Per me l'uomo colto non è colui che sa quando è nato Napoleone, ma quello che sa dove andare a cercare l'informazione nell'unico momento della sua vita in cui gli serve, e in due minuti. Umberto Eco
Re: MISSING FILES USING WGET
As far as I understand, the problem is that the missing files are not directly referenced in the page, but only via a javascript, which wget cannot follow. However, in my case, I know where the missing files are located (there are in a subdirectory). So what I would need is another script that could download all the files contained in a given directory down to a given level of subdirectory. I think that this, together with the use of wget, would enable to download everything required to mirror the site.. Do you know if this is possible? Do you know any such script? (please answer to [EMAIL PROTECTED]) Thanks Thierry Pichevin
Re: MISSING FILES USING WGET
If no link points to the document you're interested in, then wget can't possibly know about its existance. Unless you tell it on the command line. And how can this be achieved? Thanks! T. Pichevin
MISSING FILES USING WGET
Dear everybody I am trying to use Wget to make a mirror site of: http://www.apec.asso.fr/metiers/environnement I used the command: wget -r -l6 -np -k http://www.apec.asso.fr/metiers/environnement 1. small problem: it creates an arborescence www.apec.asso.fr/metiers/environnement, whereas I would have expected only the subdirectories of 'environnement' to come 2. Big problem: many files don't come in: for example file 'environnement/directeur_environnement/temoignage.html'. This file is normally obtained from the main page by cliking 'directeur_environnement' (Under title "communication et mediation") and on the next page by clicking on' Dlgu Rgional de l'Ademe Haute-Normandie' (under title 'temoignage', on the right). Note that other in 'environnement/directeur_environnement/' come in... The missing files seem to have a common feature: they are viewed via a popup window when clicking on the link.. is this the problem? Please answer to [EMAIL PROTECTED] Thanks Thierry Pichevin
Re: MISSING FILES USING WGET
Quoting Thierry Pichevin ([EMAIL PROTECTED]): I used the command: wget -r -l6 -np -k http://www.apec.asso.fr/metiers/environnement 1. small problem: it creates an arborescence www.apec.asso.fr/metiers/environnement, whereas I would have expected only the subdirectories of 'environnement' to come This is the general behaviour of wget. If you want to get just sibdirectories, you will need to use `--cut-dirs' and `--no-host-directories'. 2. Big problem: many files don't come in: for example file 'environnement/directeur_environnement/temoignage.html'. This file is normally obtained from the main page by cliking 'directeur_environnement' (Under title "communication et mediation") and on the next page by clicking on' Dlgu Rgional de l'Ademe Haute-Normandie' (under title 'temoignage', on the right). Note that other in 'environnement/directeur_environnement/' come in... The missing files seem to have a common feature: they are viewed via a popup window when clicking on the link.. is this the problem? These URLs are acutally javascript calls. Wget ignores javascript as it cannot interpret it in any way. It would be probably possible to modify wget's interal HTML parser to try some heuristic to extract possible URLs from a `javascript:' URL, but noone has written the code yet. -- jan +-- Jan Prikryl| vr|vis center for virtual reality and visualisation [EMAIL PROTECTED] | http://www.vrvis.at +--