Re: [twsocket] HTTPcli: source path question

Zvone Wed, 08 Sep 2010 06:23:06 -0700

> Well, then I have a question: maybe you have some ideas of how to organize 
> recursive download: for example, if user started to download 
> www.example.com/path/index.html, we should also accept 
> www.example.com/path/logo.jpg and so on, but not www.example.com/index.php. 
> If user started www.example.com/path/foo, we should accept 
> www.example.com/path/foo/index.php but NOT www.example.com/path/bar.jpg.
> Applications like Wget do support this behavior but the question is how they 
> do it.


HTTP reply consists of header and document. In header you can find
useful info about the type of the document being served.
Wget uses this info to determine filename and hint the directory
structure. It parses HTML but not in a way that it creates a folder
structure. Rather it creates a browsable structure that you can open
in your web browser.

Basically for each document you receive you have to scan for <a
href="link"> links (and possibly also CSS-based links) and internally
in your program organize them into folder structure. You also need to
look at <base> link in html header if it exists.

To create browsable structure sometimes also <a href> links in
downloaded documents need to be modified as well, to point to
different location.
--
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be

Re: [twsocket] HTTPcli: source path question

Reply via email to