Re: how to follow incorrect links?

2005-02-17 Thread jens . roesner
Hi Tomasz!

 There are some websites with backslashes istead of slashes in links.
 For instance : a href=photos\julia.jpg
 instead of   : a href=photos/julia.jpg
 Internet Explorer can repair such addresses. 
My own assumption is: It repairs them, because Microsoft 
introduced that #censored# way of writing HTML.
Anyway, this will not help you, I know.
I think you should email the webmaster and tell him/her 
about the errors.

 How to make wget to follow such addresses?
I think it is impossible. 

I can think of one way:
start wget -nc -r -l0 -p URL
after it finishes, replace all \ with / in the downloaded htm(l) files 
This will make the html files correct.
After that, start wget -nc -r -l0 -p URL again
wget will now parse the downloaded and corrected HTML files instead of the
wrong files on the net.
Continue this procedere until wget does not download any more files.
I do not know how handy you are in your OS, but this should be doable with
one or two small batch files.

Maybe one of the pros has a better idea. :)

CU
Jens (just another user)

-- 
DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen!
AKTION Kein Einrichtungspreis nutzen: http://www.gmx.net/de/go/dsl


Re: -i problem: Invalid URL šh: Unsupported scheme

2005-02-13 Thread jens . roesner
Hi Mike!

 on the command line, but if I copy this to, e.g., test.txt and try
 wget -i test1.txt
Well, make sure
a) whether you have test.txt or test1.txt
b) you only have URLs in your txt file like http:// - no options
c) you save the txt file with a pure ASCII editor like Notepad - not Wordpad
(it works with Wordpad, but Notepad is better)
d) you use -i and not -I (as you wrote in your first line - wget is
case-sensitive)

 Does anyone else have this problem?
At least not me.

CU
Jens (just another user)

-- 
DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen!
AKTION Kein Einrichtungspreis nutzen: http://www.gmx.net/de/go/dsl


Re: Images in a CSS file

2005-01-31 Thread jens . roesner
Hi Deryck!

As far as I know, wget cannot parse CSS code
(and neither JavaScript code). 
It has been requested often, but so far noone 
has tackled this (probably rather huge) task. 

CU
Jens
(just another user)


 Hello,
 
 I can make wget copy the necessary CSS files referenced from a webpage
 but is it possible to  make it extract any images referenced from
 within the CSS file?
 
 Thanks
 

-- 
GMX im TV ... Die Gedanken sind frei ... Schon gesehen?
Jetzt Spot online ansehen: http://www.gmx.net/de/go/tv-spot


Re: Bug (wget 1.8.2): Wget downloads files rejected with -R.

2005-01-22 Thread jens . roesner
Hi Jason!

If I understood you correctly, this quote from the manual should help you:
***
Note that these two options [accept and reject based on filenames] do not
affect the downloading of HTML files; Wget must load all the HTMLs to know
where to go at all--recursive retrieval would make no sense otherwise.
***

If you are seeing wget behaviour different from this, please a) update your
wget and b) provide more details where/how it happens.

CU  good luck!
Jens (just another user)



 When the -R option is specified to reject files by name in recursive mode,
 wget downloads them anyway then deletes them after downloading. This is a
 problem when you are trying to be picky about the files you are
downloading
 to save bandwidth. Since wget appears to know the name of the file it is
 downloading before it is downloaded (even if the specified URL is
redirected
 to a different filename), then it should not bother downloading the file
 at all if it is going to delete it immediately after downloading it.
 
 - Jason Cipriani
 

-- 
GMX im TV ... Die Gedanken sind frei ... Schon gesehen?
Jetzt Spot online ansehen: http://www.gmx.net/de/go/tv-spot


Re: get link Internal Server Error

2004-08-12 Thread jens . roesner
For me this link does NOT work in
IE 6.0
latest Mozilla
latest Opera

So I tested a bit further.
If you go to the site and reach 
http://www.interwetten.com/webclient/start.html
and then use the URL you provide, it works.
A quick check for stored cookies revealed that 
two cookies are stored.
So you have to use wget with cookies.
For info on how to do that, use the manual. 

CU
Jens


 hi all:
 some link use IE open is normal,but use wget download have
 somewrong, i cant find some slove way, i think it maybe a bug :
 example link:
 

http://www.interwetten.com/webclient/betting/offer.aspx?type=1kindofsportid=10L=EN
 this link use IE open is ok,but use wget have this  wrong,
 Connecting to www.interwetten.com[213.185.178.21]:80... connected.
 HTTP request sent, awaiting response... 500 Internal Server Error
 01:02:27 ERROR 500: Internal Server Error.
  henryluo
 

-- 
NEU: WLAN-Router für 0,- EUR* - auch für DSL-Wechsler!
GMX DSL = supergünstig  kabellos http://www.gmx.net/de/go/dsl



wget 1.9 for windows: no debug support?

2003-10-26 Thread jens . roesner
Hi!

While I was testing the new wget 1.9 binary for windwos from 
http://space.tin.it/computer/hherold/
I noticed that it was slow if a URL specified within -i list.txt 
did not exist. It would wait until wget tries the next URL listed.

Well,  to find out what was happening, I specified -d for the debug output. 
The message was:
debug support not compiled in
and wget would continue with normal downloading.

Is this an oversight or does it serve a purpose?

CU
Jens



-- 
NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien...
Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService

Jetzt kostenlos anmelden unter http://www.gmx.net

+++ GMX - die erste Adresse für Mail, Message, More! +++



Re: -A acceptlist (not working right)

2002-01-24 Thread jens . roesner

Hi Samuel!

 You The Man!
Pssst! Don't tell everyone! :D

  Then I did
  wget -nc -x -r -l0 -p -np -t10 -k -nv -A.gif,.htm,.html http://URL
 This also worked, then I began trying to figure out what the
 hell was wrong, I added the .cfm to the list. returned an
 empty foldertried .shtml  phtml, got the same thing.
 apparently it does not like those in the accept list
 Try adding those, see what you get.
Hm, I don't have .cfm/.shtml files handy at the moment.
So I just added the extensions to the list and it worked the same as before.

 Then, I tried putting html at the nd of my list, putting the
 other stuff in front of it, viola.
strange, very strange.

 Also, your -R example worked great, but if I add a .mov to
 the end of the list, it nullifies all the other reject commands.
 If I move the .mov to the front, it works?
Seems to make no difference for me.
Maybe this is a bug that only shows up on your OS?
But I cannot imagine how this is possible.

 Questions:
 How can I direct the files into another directory other than
 /user/username
Try -Pdir1/dir2
or (works at least in Windows -P../dir/dir2
../ goes up one level.
Currently -P does not allow a change of drive, 
but I think the coders are working on that (right?)

How can I just blow it all into one directory without the folder
 structure. filenames and structure are not a problem, I just need
 the data in the files.
try -nd (for no directories)
BTW, -nd is default (so to say) for single files 
(wget http://host.com/page.html)

 and nights...  Have a virtual drink on me.
Cheers! :)

CU
Jens

http://www.jensroesner.de/wgetgui/

-- 
GMX - Die Kommunikationsplattform im Internet.
http://www.gmx.net




Re: A strange bit of HTML

2002-01-16 Thread jens . roesner

Hi there!

 td ALIGN=CENTER VALIGN=CENTER WIDTH=120 HEIGHT=120a
 href=66B27885.htm msover1('Pic1','thumbnails/MO66B27885.jpg');
 onMouseOut=msout1('Pic1','thumbnails/66B27885.jpg');img
 SRC=thumbnails/66B27885.jpg NAME=Pic1 BORDER=0 /a/td

BTW: it is valign=middle :P
(I detest AllCaps and property=value instead of property=value.)

 That sounds like they wanted onMouseOver=msover1(...)
 It's also likely that msover1 is a Javascript function :-(
Definitively, I would say.


 I can't call this a bug, but is Wget doing the right thing by
 ignoring the href altogether?
 Until there's an ESP package that can guess what the author intended,
 I doubt wget has any choice but to ignore the defective tag. 
*g*
Seriously, I think you guys are too strict.
Similar discussion have spawned numerous times.
If the HTML code says 
a href=URL yaddayada my-Mother=Shopping%5 goingsupermarket/a
Why can't wget just ignore everything after ...URL?
Is there any instance where this would create unwanted behaviour 
for the user? It does not matter if there is a javascript called, a CSS
broken, or the webmaster has bad breath.
Now, if a mouseover picture is loaded, 
wget cannot retrieve it anyway, no matter if the javascript 
is correct or malformed, right?

 In addition,
 wget should send an email to webmaster@offending domain,
 complaining about the invalid HTML :-)
/me signs this petition!
In addition, mails should be written for bad (=unreadable) 
combos of font colour and background colour, 
animated gifs and blink tags.

Kind regards
Jens

-- 
GMX - Die Kommunikationsplattform im Internet.
http://www.gmx.net




Re: A strange bit of HTML

2002-01-16 Thread jens . roesner

Hi Hrvoje!

First, I did/do not mean to offend/attack you, 
just in case that my suspicion about you being 
pi55ed because of my post is not totally unjustified.

  If the HTML code says 
  a href=URL yaddayada my-Mother=Shopping%5 goingsupermarket/a
  Why can't wget just ignore everything after ...URL?
 
 Because, as he said, Wget can parse text, not read minds.  
Ah *slapsforehead* /me stupid.

 For
 example, you must know where a tag ends to be able to look for the
 next one, or to find comments.  It is not enough to look for '' to
 determine the tag's ending -- something like img alt=my dog
 src=foo is a perfectly legal tag.
okok, granted, to dissolve
a href=foo.html target=_topimg src=pic.htm.jpg name=index.html
alt=oopsbr-fool.htm-/a
for example, you'd really have a hard time, I suppose.
I honestly did not think of people messing with  and .

 As for us being strict, I can only respond with a mini-rant...
 Wget doesn't create web standards, but it tries to support them.
 Spanning the chasm between the standards as written and the actual
 crap generated by HTML generators feels a lot like shoveling shit.
[rant name=my rant]
Ah, tell me about it. Although I come from the other side 
(Trying to write my sites -with a text editor- so that they look ok on
different browsers and remain HMTL compliant) I surely know how much 'fun' it can
be to work with standards.
Especially if they were set by a commitee as intelligent and just (as in
justice) like W3C...
BTW, as an engineering student I am fully aware how much 
help good standards can be.
[/rant]


 Some amount of shoveling is necessary and is performed by all small
 programs to protect their users, but there has to be a point where you
 draw the line.  There is only so much shit Wget can shovel.
Unfortunately, the amount of shit on the web will not decrease.
I fear that the opposite may be true.
no, wait, I am pretty sure...

 I'm not saying Ian's example is where the line has to be drawn.  (Your
 example is equivalent to Ian's -- Wget would only choke on the last
 going part).  But I'm sure that the line exists and that it is not
 far from those two examples.
Ok, but I understand you correctly that these two examples (mine was
intended to be equivalent, but without JS) should be on the parse and retrieve
side of this line, not the ignore and blame Frontpage side?

CU
Jens

-- 
GMX - Die Kommunikationsplattform im Internet.
http://www.gmx.net




Re: -H suggestion

2002-01-15 Thread jens . roesner

Hi!

Once again I think this has nothing to do in the bug list, but, there you
go:

 I've toyed with the idea of making a flag to allow `-p' span hosts
 even when normal download doesn't.

Funny you mention this.
When I first heard about -p (1.7?) I thought exactly that it would default
to that behaviour.
I think it would be really useful if the page requisites could be wherever 
they want. I mean, -p is already ignoring -np (since 1.8?), what I think is
also very useful.

  The -i switch provides for a file listing the URLs to be downloaded.
  Please provide for a list file for URLs to be avoided when -H is
  enabled.
 
 URLs to be avoided?  Given that a URL can be named in more than one
 way, this might be hard to do.
 
Sorry, but does --reject-host (or similar, I don't have the docs here ATM)
not exactly do this? I may well be missing the point here.
But with disallowing hosts and dirs you should be able to do this.
Or is the problem to load the lists from an external file?
Then, please ignore my comment, I have no experience in this.

CU
Jens

-- 
GMX - Die Kommunikationsplattform im Internet.
http://www.gmx.net




What was that?

2001-11-23 Thread Jens Roesner

Hi guys!

Today I found something strange.
First, I am using MS WinME (okok) and Netscape (okok)
I was downloading from audi.com a 3000kB zip file from a page via
right-click
After (!) I had finished downloading I thought Hey why not use
wGet(GUI) for it.
Smart, huh? One file, already downloaded...
But there was another file I wanted to try in the first place.
That did not work, because it was streaming mov :(
(I guess that would be very difficult to implement?)
Anyway, when wGet was downloading the zip I had already downloaded to
another 
directory on another drive, the average speed was 850kb/sec!
No, I am !not! sitting in the LAN of Audi.de oder so!
I have a fast ethernet LAN access over uni, but I am in New Zealand, so, 
I highly doubt that it is possible.
Anyway, the file is there, it works, it is strange.
Can anyone explain a stupid Windows user what happened there?
Does wget access the Netscape internal Cache? (Nahh, can't be...)
I cannot provide you with a debug or -v log file. :(

CU
Jens  *still confused

http://www.jensroesner.de/wgetgui/



Re: What was that? Proxy!

2001-11-23 Thread jens . roesner

Hi guys!

Yes, you all are right.
Proxy is the answer. I feel stupid now.
/me goes to bed, maybe that helps! :|
Thanks anyway! :)

Until the next intelligent question :D
CU
Jens








Man, I really hate ads like the following:

-- 
GMX - Die Kommunikationsplattform im Internet.
http://www.gmx.net




Re: bug?

2001-11-22 Thread Jens Roesner

Hi Tomas!

 I see, but then, how to exclude from being downloaded per file-basis?
First, let me be a smartass:
Go to 
http://www.acronymfinder.com
and lokk up 
RTFM
Then, proceed to the docs of wget.
wget offers download restrictions on
host, directory and file name.
Search in the docs for
-H
-D
--exclude-domains
`-A ACCLIST' `--accept ACCLIST' `accept = ACCLIST'
`-R REJLIST' `--reject REJLIST' `reject = REJLIST'
`-I LIST' `--include LIST' `include_directories = LIST'
`-X LIST' `--exclude LIST' `exclude_directories = LIST'

CU
Jens

http://www.jensroesner.de/wgetgui



Re: meta noindex

2001-10-29 Thread Jens Roesner

Hi Tomas!

 Thanks a lot, but unfortunately that didn't work...
 I just do a simple:
 wget -r http://localhost
 And my unwanted file is included all the time...
Hm :(

 Have you had it to work with 1.7 or are you using the CVS-version?
CVS? No, I am just a stupid Windows user with some binaries from Heiko
;)
Are you sure your wgetrc is recognized, that it is in the right
directory?

From my experience I also noted that some servers block wGet as it is
not a Browser.
That is why I incorporated in my wGetGUI that the User can choose to
have wGet ignore robots.txt and identify as a Mozilla browser.
That normally should work. 
Maybe you also have to try both at the same time for your problem?

Right now I am a bit puzzled what you meant by
I can't get wget 1.7 react on the following:
I thought you wanted wGet to ignore robots?!
Correct?

Good luck
Jens
http://www.jensroesner.de/wgetgui



 
 -Original Message-
 From: Jens Roesner [mailto:[EMAIL PROTECTED]]
 Sent: 29 October 2001 23:21
 To: Tomas Hjelmberg
 Subject: Re: meta noindex
 
 Hi Tomas!
 
 Put
 robots = off
 in your wgetrc
 You cannot use it in the command line if I am not mistaken.
 I think it was introduced in 1.7 so you should have no problems.
 
 Good luck
 Jens
 http://www.jensroesner.de/wgetgui
 
 Tomas Hjelmberg schrieb:
 
  Hi,
  I can't get wget 1.7 react on the following:
  html
  head
  META name=robots content=noindex, nofollow
  /head
  ...
  /html
 
  Cheers /Tomas



Re: Recursive retrieval of page-requisites

2001-10-09 Thread Jens Roesner

Hi wGetters!

I want to download a subtree of HTML documents from a foreign site
   complete with page-requisites for offline viewing.
I.e. all HTML pages from this point downwards, PLUS all the images(etc.)
   they refer to -- no matter where they are in the directory tree
  This is cheating, 

What does cheating mean here?
Now I know the meaning of cheating, but I do not understand it in this
context.
Could someone please elaborate a bit?
To me that sounds like a logical combination of -r -np -p?
Any correction appreciated.

Thanks
Jens



Re: Cookie support

2001-09-20 Thread Jens Roesner

Hi Andreas!

AFAIK wGet has cookie support.
At least the 1.7 I use.
If this does not help you, I did not understand your question.
But I am sure there are smarter guys than me on the list! ;)

CU
Jens

http://www.JensRoesner.de/wGetGUI/


[snip]
 Would it make sense to add basic cookie support to wget?
[/snip]



Re: referer question

2001-09-13 Thread Jens Roesner

Hi wgetters!

@André
 Guys, you don't understand what the OP wants. He needs a
 dynamically generated referer, something like
   wget --referer 'http://%h/'
 where, for each URL downloaded, wget would replace %h by the
 hostname.
Well, I understood it this way. 
My problem was that I mainly use
wGet and wGetGUI for downloads from !one! server.
Therefore I did not think of the problem when wget leaves the server 
for which wGetGUI puts in the --referer=starthost.

@Jan:
Sorry, the option of wGetGUI is called Identify as browser what
happens then 
is that wGetGUI does !both! --referer and --user-agent !
If I find time to do a user's manual, I will make this clear.
Sorry for the confusion.

@Vladi
Ok, I know Windows sucks ;) But I am tooo lazy!
BTW: I would like that --auto-referer, too ;)
So go ahead! ;D

CU
Jens



Ever heard of that version?

2001-07-16 Thread Jens Roesner

Hi wGetters!

I just stumbled over 
http://bay4.de/FWget/
Are his changes incorporated into Wget 1.7?
Any opinions on that software?
I think with WinME *yuck* as OS, this is out of question for me, 
but...

CU
Jens



Re: Domain Acceptance question

2001-07-06 Thread Jens Roesner

Hi Mengmeng!

Thanks very much, I (obviously) was not aware of that!
I'll see how I can incorporate that (-I/-X/-D/-H) in wGetGUI.
Can I do something like -H -Dhome.nexgo.de -Ibmaj.roesner
http://www.AudiStory.com ?
I'll just give it a try.

Thanks again!
Jens



More Domain/Directory Acceptance

2001-07-06 Thread Jens Roesner

Hi again!

I am trying to start from
http://members.tripod.de/jroes/test.html
(have a look)
The first link goes to a site I do not want. 
The second link goes to a site that should be retrieved.

wget -r -l0 -nh -d -o test.log -H -I/bmaj*/
http://members.tripod.de/jroes/test.html

does not work.

wget -r -l0 -nh -d -o test.log -H -Dhome.nexgo.de -I/bmaj*/
http://members.tripod.de/jroes/test.html

does neither :(
I also tried -Dhome.nexgo.de -I../bmaj.roesner/ with no success.

The debug output (I know it is not bug, but this gives many information)
reads:

**
parseurl(http://home.nexgo.de/bmaj.roesner/;) - host home.nexgo.de -
opath bmaj.roesner/ - dir bmaj.roesner - file  - ndir bmaj.roesner
http://home.nexgo.de:80/bmaj.roesner/ (bmaj.roesner) is
excluded/not-included.
http://home.nexgo.de:80/bmaj.roesner/ already in list, so we don't load.
**

How can it be done?

CU
Jens

http://www.jensroesner.de/wgetgui/



Re: More Domain/Directory Acceptance

2001-07-06 Thread Jens Roesner

Hi Ian, hi wgetters!

Thanks for your help!

 It didn't work for me either, but the following variation did:
 wget -r -l0 -nh -d -o test.log -H -I'bmaj*' http://members.tripod.de/jroes/test.html
Hm, did not for me :( neither in 1.4.5 nor in the newest Windows
binaries version I downloaded from Heiko. :(

 However, wget-1.7 dumps core with this so you'll have to use the
 latest version from CVS.
Hm, what exactly do you mean by that? Is the version from Heiko young
enough?

Here is what the debug output reads with 1.7.1:

*
DEBUG output created by Wget 1.7.1-pre1 on Windows.

parseurl (http://members.tripod.de/jroes/test.html;) - host
members.tripod.de - opath jroes/test.html - dir jroes - file
test.html - ndir jroes
newpath: /jroes/test.html
--22:16:33--  http://members.tripod.de/jroes/test.html
   = `members.tripod.de/jroes/test.html'
Connecting to members.tripod.de:80... Caching members.tripod.de -
62.52.56.162
Created fd 72.
connected!
---request begin---
GET /jroes/test.html HTTP/1.0

User-Agent: Wget/1.7.1-pre1

Host: members.tripod.de

Accept: */*

Connection: Keep-Alive



---request end---
HTTP request sent, awaiting response... HTTP/1.1 200 OK
Server: Apache/1.2.7-dev
Set-Cookie: CookieStatus=COOKIE_OK; path=/; domain=.tripod.com;
expires=Sat, 06-Jul-2002 10:17:26 MET
cdm: 1 2 3 4 5 6Attempt to fake the domain: .tripod.com,
members.tripod.de
Set-Cookie: MEMBER_PAGE=jroes/test.html; path=/;
domain=members.tripod.de
cdm: 1 2 3

**
No file was written.
I'll try it on another PC with another OS this weekend...
But if you can give any advice already, that would be great!

CU
Jens



Re: More Domain/Directory Acceptance

2001-07-06 Thread Jens Roesner

Hi Ian and wgetters!

 Well if you're running it from a DOS-style shell, get rid of the
 single quotes I put in there, i.e. try -Ibmaj*
Oh, I guess that was rather stupid of me.
However, the windows version will only work with
-I/bmaj  or -Ibmaj.roesner, not with anything like -I/bmaj* oder
-I/bmaj?roesner
:(
Reasons?
(I also tried -Iaj.r which also did not work...

Once again here is the command line for all wanting to give it a try:
wget -nc -r -l0 -nh -d -o test.log -H -I/bmaj
http://members.tripod.de/jroes/test.html
works like a blast. :)
BTW:
wget -nc -r -l0 -nh -d -o test.log -H -Dhome.nexgo.de -I/bmaj
http://members.tripod.de/jroes/test.html
works, too. So you can restrict hosts and dirs on that host. (Imagine a
bmaj dir on the tripod server, for example.)
And this setup suits me just fine. But having more options is always a
good thing, so, are there wildcards like * and ? in the Win32 version of
wget?

CU
Jens

http://www.jensroesner.de/wgetgui