Re: how to follow incorrect links?
Hi Tomasz! There are some websites with backslashes istead of slashes in links. For instance : a href=photos\julia.jpg instead of : a href=photos/julia.jpg Internet Explorer can repair such addresses. My own assumption is: It repairs them, because Microsoft introduced that #censored# way of writing HTML. Anyway, this will not help you, I know. I think you should email the webmaster and tell him/her about the errors. How to make wget to follow such addresses? I think it is impossible. I can think of one way: start wget -nc -r -l0 -p URL after it finishes, replace all \ with / in the downloaded htm(l) files This will make the html files correct. After that, start wget -nc -r -l0 -p URL again wget will now parse the downloaded and corrected HTML files instead of the wrong files on the net. Continue this procedere until wget does not download any more files. I do not know how handy you are in your OS, but this should be doable with one or two small batch files. Maybe one of the pros has a better idea. :) CU Jens (just another user) -- DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen! AKTION Kein Einrichtungspreis nutzen: http://www.gmx.net/de/go/dsl
Re: -i problem: Invalid URL h: Unsupported scheme
Hi Mike! on the command line, but if I copy this to, e.g., test.txt and try wget -i test1.txt Well, make sure a) whether you have test.txt or test1.txt b) you only have URLs in your txt file like http:// - no options c) you save the txt file with a pure ASCII editor like Notepad - not Wordpad (it works with Wordpad, but Notepad is better) d) you use -i and not -I (as you wrote in your first line - wget is case-sensitive) Does anyone else have this problem? At least not me. CU Jens (just another user) -- DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen! AKTION Kein Einrichtungspreis nutzen: http://www.gmx.net/de/go/dsl
Re: Images in a CSS file
Hi Deryck! As far as I know, wget cannot parse CSS code (and neither JavaScript code). It has been requested often, but so far noone has tackled this (probably rather huge) task. CU Jens (just another user) Hello, I can make wget copy the necessary CSS files referenced from a webpage but is it possible to make it extract any images referenced from within the CSS file? Thanks -- GMX im TV ... Die Gedanken sind frei ... Schon gesehen? Jetzt Spot online ansehen: http://www.gmx.net/de/go/tv-spot
Re: Bug (wget 1.8.2): Wget downloads files rejected with -R.
Hi Jason! If I understood you correctly, this quote from the manual should help you: *** Note that these two options [accept and reject based on filenames] do not affect the downloading of HTML files; Wget must load all the HTMLs to know where to go at all--recursive retrieval would make no sense otherwise. *** If you are seeing wget behaviour different from this, please a) update your wget and b) provide more details where/how it happens. CU good luck! Jens (just another user) When the -R option is specified to reject files by name in recursive mode, wget downloads them anyway then deletes them after downloading. This is a problem when you are trying to be picky about the files you are downloading to save bandwidth. Since wget appears to know the name of the file it is downloading before it is downloaded (even if the specified URL is redirected to a different filename), then it should not bother downloading the file at all if it is going to delete it immediately after downloading it. - Jason Cipriani -- GMX im TV ... Die Gedanken sind frei ... Schon gesehen? Jetzt Spot online ansehen: http://www.gmx.net/de/go/tv-spot
Re: get link Internal Server Error
For me this link does NOT work in IE 6.0 latest Mozilla latest Opera So I tested a bit further. If you go to the site and reach http://www.interwetten.com/webclient/start.html and then use the URL you provide, it works. A quick check for stored cookies revealed that two cookies are stored. So you have to use wget with cookies. For info on how to do that, use the manual. CU Jens hi all: some link use IE open is normal,but use wget download have somewrong, i cant find some slove way, i think it maybe a bug : example link: http://www.interwetten.com/webclient/betting/offer.aspx?type=1kindofsportid=10L=EN this link use IE open is ok,but use wget have this wrong, Connecting to www.interwetten.com[213.185.178.21]:80... connected. HTTP request sent, awaiting response... 500 Internal Server Error 01:02:27 ERROR 500: Internal Server Error. henryluo -- NEU: WLAN-Router für 0,- EUR* - auch für DSL-Wechsler! GMX DSL = supergünstig kabellos http://www.gmx.net/de/go/dsl
wget 1.9 for windows: no debug support?
Hi! While I was testing the new wget 1.9 binary for windwos from http://space.tin.it/computer/hherold/ I noticed that it was slow if a URL specified within -i list.txt did not exist. It would wait until wget tries the next URL listed. Well, to find out what was happening, I specified -d for the debug output. The message was: debug support not compiled in and wget would continue with normal downloading. Is this an oversight or does it serve a purpose? CU Jens -- NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien... Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService Jetzt kostenlos anmelden unter http://www.gmx.net +++ GMX - die erste Adresse für Mail, Message, More! +++
Re: -A acceptlist (not working right)
Hi Samuel! You The Man! Pssst! Don't tell everyone! :D Then I did wget -nc -x -r -l0 -p -np -t10 -k -nv -A.gif,.htm,.html http://URL This also worked, then I began trying to figure out what the hell was wrong, I added the .cfm to the list. returned an empty foldertried .shtml phtml, got the same thing. apparently it does not like those in the accept list Try adding those, see what you get. Hm, I don't have .cfm/.shtml files handy at the moment. So I just added the extensions to the list and it worked the same as before. Then, I tried putting html at the nd of my list, putting the other stuff in front of it, viola. strange, very strange. Also, your -R example worked great, but if I add a .mov to the end of the list, it nullifies all the other reject commands. If I move the .mov to the front, it works? Seems to make no difference for me. Maybe this is a bug that only shows up on your OS? But I cannot imagine how this is possible. Questions: How can I direct the files into another directory other than /user/username Try -Pdir1/dir2 or (works at least in Windows -P../dir/dir2 ../ goes up one level. Currently -P does not allow a change of drive, but I think the coders are working on that (right?) How can I just blow it all into one directory without the folder structure. filenames and structure are not a problem, I just need the data in the files. try -nd (for no directories) BTW, -nd is default (so to say) for single files (wget http://host.com/page.html) and nights... Have a virtual drink on me. Cheers! :) CU Jens http://www.jensroesner.de/wgetgui/ -- GMX - Die Kommunikationsplattform im Internet. http://www.gmx.net
Re: A strange bit of HTML
Hi there! td ALIGN=CENTER VALIGN=CENTER WIDTH=120 HEIGHT=120a href=66B27885.htm msover1('Pic1','thumbnails/MO66B27885.jpg'); onMouseOut=msout1('Pic1','thumbnails/66B27885.jpg');img SRC=thumbnails/66B27885.jpg NAME=Pic1 BORDER=0 /a/td BTW: it is valign=middle :P (I detest AllCaps and property=value instead of property=value.) That sounds like they wanted onMouseOver=msover1(...) It's also likely that msover1 is a Javascript function :-( Definitively, I would say. I can't call this a bug, but is Wget doing the right thing by ignoring the href altogether? Until there's an ESP package that can guess what the author intended, I doubt wget has any choice but to ignore the defective tag. *g* Seriously, I think you guys are too strict. Similar discussion have spawned numerous times. If the HTML code says a href=URL yaddayada my-Mother=Shopping%5 goingsupermarket/a Why can't wget just ignore everything after ...URL? Is there any instance where this would create unwanted behaviour for the user? It does not matter if there is a javascript called, a CSS broken, or the webmaster has bad breath. Now, if a mouseover picture is loaded, wget cannot retrieve it anyway, no matter if the javascript is correct or malformed, right? In addition, wget should send an email to webmaster@offending domain, complaining about the invalid HTML :-) /me signs this petition! In addition, mails should be written for bad (=unreadable) combos of font colour and background colour, animated gifs and blink tags. Kind regards Jens -- GMX - Die Kommunikationsplattform im Internet. http://www.gmx.net
Re: A strange bit of HTML
Hi Hrvoje! First, I did/do not mean to offend/attack you, just in case that my suspicion about you being pi55ed because of my post is not totally unjustified. If the HTML code says a href=URL yaddayada my-Mother=Shopping%5 goingsupermarket/a Why can't wget just ignore everything after ...URL? Because, as he said, Wget can parse text, not read minds. Ah *slapsforehead* /me stupid. For example, you must know where a tag ends to be able to look for the next one, or to find comments. It is not enough to look for '' to determine the tag's ending -- something like img alt=my dog src=foo is a perfectly legal tag. okok, granted, to dissolve a href=foo.html target=_topimg src=pic.htm.jpg name=index.html alt=oopsbr-fool.htm-/a for example, you'd really have a hard time, I suppose. I honestly did not think of people messing with and . As for us being strict, I can only respond with a mini-rant... Wget doesn't create web standards, but it tries to support them. Spanning the chasm between the standards as written and the actual crap generated by HTML generators feels a lot like shoveling shit. [rant name=my rant] Ah, tell me about it. Although I come from the other side (Trying to write my sites -with a text editor- so that they look ok on different browsers and remain HMTL compliant) I surely know how much 'fun' it can be to work with standards. Especially if they were set by a commitee as intelligent and just (as in justice) like W3C... BTW, as an engineering student I am fully aware how much help good standards can be. [/rant] Some amount of shoveling is necessary and is performed by all small programs to protect their users, but there has to be a point where you draw the line. There is only so much shit Wget can shovel. Unfortunately, the amount of shit on the web will not decrease. I fear that the opposite may be true. no, wait, I am pretty sure... I'm not saying Ian's example is where the line has to be drawn. (Your example is equivalent to Ian's -- Wget would only choke on the last going part). But I'm sure that the line exists and that it is not far from those two examples. Ok, but I understand you correctly that these two examples (mine was intended to be equivalent, but without JS) should be on the parse and retrieve side of this line, not the ignore and blame Frontpage side? CU Jens -- GMX - Die Kommunikationsplattform im Internet. http://www.gmx.net
Re: -H suggestion
Hi! Once again I think this has nothing to do in the bug list, but, there you go: I've toyed with the idea of making a flag to allow `-p' span hosts even when normal download doesn't. Funny you mention this. When I first heard about -p (1.7?) I thought exactly that it would default to that behaviour. I think it would be really useful if the page requisites could be wherever they want. I mean, -p is already ignoring -np (since 1.8?), what I think is also very useful. The -i switch provides for a file listing the URLs to be downloaded. Please provide for a list file for URLs to be avoided when -H is enabled. URLs to be avoided? Given that a URL can be named in more than one way, this might be hard to do. Sorry, but does --reject-host (or similar, I don't have the docs here ATM) not exactly do this? I may well be missing the point here. But with disallowing hosts and dirs you should be able to do this. Or is the problem to load the lists from an external file? Then, please ignore my comment, I have no experience in this. CU Jens -- GMX - Die Kommunikationsplattform im Internet. http://www.gmx.net
What was that?
Hi guys! Today I found something strange. First, I am using MS WinME (okok) and Netscape (okok) I was downloading from audi.com a 3000kB zip file from a page via right-click After (!) I had finished downloading I thought Hey why not use wGet(GUI) for it. Smart, huh? One file, already downloaded... But there was another file I wanted to try in the first place. That did not work, because it was streaming mov :( (I guess that would be very difficult to implement?) Anyway, when wGet was downloading the zip I had already downloaded to another directory on another drive, the average speed was 850kb/sec! No, I am !not! sitting in the LAN of Audi.de oder so! I have a fast ethernet LAN access over uni, but I am in New Zealand, so, I highly doubt that it is possible. Anyway, the file is there, it works, it is strange. Can anyone explain a stupid Windows user what happened there? Does wget access the Netscape internal Cache? (Nahh, can't be...) I cannot provide you with a debug or -v log file. :( CU Jens *still confused http://www.jensroesner.de/wgetgui/
Re: What was that? Proxy!
Hi guys! Yes, you all are right. Proxy is the answer. I feel stupid now. /me goes to bed, maybe that helps! :| Thanks anyway! :) Until the next intelligent question :D CU Jens Man, I really hate ads like the following: -- GMX - Die Kommunikationsplattform im Internet. http://www.gmx.net
Re: bug?
Hi Tomas! I see, but then, how to exclude from being downloaded per file-basis? First, let me be a smartass: Go to http://www.acronymfinder.com and lokk up RTFM Then, proceed to the docs of wget. wget offers download restrictions on host, directory and file name. Search in the docs for -H -D --exclude-domains `-A ACCLIST' `--accept ACCLIST' `accept = ACCLIST' `-R REJLIST' `--reject REJLIST' `reject = REJLIST' `-I LIST' `--include LIST' `include_directories = LIST' `-X LIST' `--exclude LIST' `exclude_directories = LIST' CU Jens http://www.jensroesner.de/wgetgui
Re: meta noindex
Hi Tomas! Thanks a lot, but unfortunately that didn't work... I just do a simple: wget -r http://localhost And my unwanted file is included all the time... Hm :( Have you had it to work with 1.7 or are you using the CVS-version? CVS? No, I am just a stupid Windows user with some binaries from Heiko ;) Are you sure your wgetrc is recognized, that it is in the right directory? From my experience I also noted that some servers block wGet as it is not a Browser. That is why I incorporated in my wGetGUI that the User can choose to have wGet ignore robots.txt and identify as a Mozilla browser. That normally should work. Maybe you also have to try both at the same time for your problem? Right now I am a bit puzzled what you meant by I can't get wget 1.7 react on the following: I thought you wanted wGet to ignore robots?! Correct? Good luck Jens http://www.jensroesner.de/wgetgui -Original Message- From: Jens Roesner [mailto:[EMAIL PROTECTED]] Sent: 29 October 2001 23:21 To: Tomas Hjelmberg Subject: Re: meta noindex Hi Tomas! Put robots = off in your wgetrc You cannot use it in the command line if I am not mistaken. I think it was introduced in 1.7 so you should have no problems. Good luck Jens http://www.jensroesner.de/wgetgui Tomas Hjelmberg schrieb: Hi, I can't get wget 1.7 react on the following: html head META name=robots content=noindex, nofollow /head ... /html Cheers /Tomas
Re: Recursive retrieval of page-requisites
Hi wGetters! I want to download a subtree of HTML documents from a foreign site complete with page-requisites for offline viewing. I.e. all HTML pages from this point downwards, PLUS all the images(etc.) they refer to -- no matter where they are in the directory tree This is cheating, What does cheating mean here? Now I know the meaning of cheating, but I do not understand it in this context. Could someone please elaborate a bit? To me that sounds like a logical combination of -r -np -p? Any correction appreciated. Thanks Jens
Re: Cookie support
Hi Andreas! AFAIK wGet has cookie support. At least the 1.7 I use. If this does not help you, I did not understand your question. But I am sure there are smarter guys than me on the list! ;) CU Jens http://www.JensRoesner.de/wGetGUI/ [snip] Would it make sense to add basic cookie support to wget? [/snip]
Re: referer question
Hi wgetters! @André Guys, you don't understand what the OP wants. He needs a dynamically generated referer, something like wget --referer 'http://%h/' where, for each URL downloaded, wget would replace %h by the hostname. Well, I understood it this way. My problem was that I mainly use wGet and wGetGUI for downloads from !one! server. Therefore I did not think of the problem when wget leaves the server for which wGetGUI puts in the --referer=starthost. @Jan: Sorry, the option of wGetGUI is called Identify as browser what happens then is that wGetGUI does !both! --referer and --user-agent ! If I find time to do a user's manual, I will make this clear. Sorry for the confusion. @Vladi Ok, I know Windows sucks ;) But I am tooo lazy! BTW: I would like that --auto-referer, too ;) So go ahead! ;D CU Jens
Ever heard of that version?
Hi wGetters! I just stumbled over http://bay4.de/FWget/ Are his changes incorporated into Wget 1.7? Any opinions on that software? I think with WinME *yuck* as OS, this is out of question for me, but... CU Jens
Re: Domain Acceptance question
Hi Mengmeng! Thanks very much, I (obviously) was not aware of that! I'll see how I can incorporate that (-I/-X/-D/-H) in wGetGUI. Can I do something like -H -Dhome.nexgo.de -Ibmaj.roesner http://www.AudiStory.com ? I'll just give it a try. Thanks again! Jens
More Domain/Directory Acceptance
Hi again! I am trying to start from http://members.tripod.de/jroes/test.html (have a look) The first link goes to a site I do not want. The second link goes to a site that should be retrieved. wget -r -l0 -nh -d -o test.log -H -I/bmaj*/ http://members.tripod.de/jroes/test.html does not work. wget -r -l0 -nh -d -o test.log -H -Dhome.nexgo.de -I/bmaj*/ http://members.tripod.de/jroes/test.html does neither :( I also tried -Dhome.nexgo.de -I../bmaj.roesner/ with no success. The debug output (I know it is not bug, but this gives many information) reads: ** parseurl(http://home.nexgo.de/bmaj.roesner/;) - host home.nexgo.de - opath bmaj.roesner/ - dir bmaj.roesner - file - ndir bmaj.roesner http://home.nexgo.de:80/bmaj.roesner/ (bmaj.roesner) is excluded/not-included. http://home.nexgo.de:80/bmaj.roesner/ already in list, so we don't load. ** How can it be done? CU Jens http://www.jensroesner.de/wgetgui/
Re: More Domain/Directory Acceptance
Hi Ian, hi wgetters! Thanks for your help! It didn't work for me either, but the following variation did: wget -r -l0 -nh -d -o test.log -H -I'bmaj*' http://members.tripod.de/jroes/test.html Hm, did not for me :( neither in 1.4.5 nor in the newest Windows binaries version I downloaded from Heiko. :( However, wget-1.7 dumps core with this so you'll have to use the latest version from CVS. Hm, what exactly do you mean by that? Is the version from Heiko young enough? Here is what the debug output reads with 1.7.1: * DEBUG output created by Wget 1.7.1-pre1 on Windows. parseurl (http://members.tripod.de/jroes/test.html;) - host members.tripod.de - opath jroes/test.html - dir jroes - file test.html - ndir jroes newpath: /jroes/test.html --22:16:33-- http://members.tripod.de/jroes/test.html = `members.tripod.de/jroes/test.html' Connecting to members.tripod.de:80... Caching members.tripod.de - 62.52.56.162 Created fd 72. connected! ---request begin--- GET /jroes/test.html HTTP/1.0 User-Agent: Wget/1.7.1-pre1 Host: members.tripod.de Accept: */* Connection: Keep-Alive ---request end--- HTTP request sent, awaiting response... HTTP/1.1 200 OK Server: Apache/1.2.7-dev Set-Cookie: CookieStatus=COOKIE_OK; path=/; domain=.tripod.com; expires=Sat, 06-Jul-2002 10:17:26 MET cdm: 1 2 3 4 5 6Attempt to fake the domain: .tripod.com, members.tripod.de Set-Cookie: MEMBER_PAGE=jroes/test.html; path=/; domain=members.tripod.de cdm: 1 2 3 ** No file was written. I'll try it on another PC with another OS this weekend... But if you can give any advice already, that would be great! CU Jens
Re: More Domain/Directory Acceptance
Hi Ian and wgetters! Well if you're running it from a DOS-style shell, get rid of the single quotes I put in there, i.e. try -Ibmaj* Oh, I guess that was rather stupid of me. However, the windows version will only work with -I/bmaj or -Ibmaj.roesner, not with anything like -I/bmaj* oder -I/bmaj?roesner :( Reasons? (I also tried -Iaj.r which also did not work... Once again here is the command line for all wanting to give it a try: wget -nc -r -l0 -nh -d -o test.log -H -I/bmaj http://members.tripod.de/jroes/test.html works like a blast. :) BTW: wget -nc -r -l0 -nh -d -o test.log -H -Dhome.nexgo.de -I/bmaj http://members.tripod.de/jroes/test.html works, too. So you can restrict hosts and dirs on that host. (Imagine a bmaj dir on the tripod server, for example.) And this setup suits me just fine. But having more options is always a good thing, so, are there wildcards like * and ? in the Win32 version of wget? CU Jens http://www.jensroesner.de/wgetgui