> Hi, I'm using httpcli to save a webpage html doc and I extract all of > it's image locations to a text file by saving the '<IMG SRC=' tags. > Afterward I want to download all of the images, but how can I determine > the TRUE location of the images? For example, say the image tag is: > '<IMG SRC='test.com/photo.jpg'' - for all I know, "test.com" could just > be a directory on the server or it could be the website. Another > example, say the image tag is: '<IMG SRC='/photo.jpg'' - so the image is > in the root directory of the website, but who knows what the root > directory is? It may simply be 'test.com', or if the html doc is located > in a subdirectory, it may be something like 'test.com/users/me'. > > So, what is the appropriate way to determine the actual true location of > these images from the 'IMG' tags?
If the image URL starts with "/" then it is an absolute URL. Just prepend the website URL and you have the image URL. If the image URL doesn't starts with "/", then it is a relative URL. You must prepent de URL of the page where the you've found the image, excluding the document itself. Example: Assuming you are getting a page from "http://www.mysite.com/docs/page.html". If you find an image source URL as "/photo.jpg" then the complete URL is "http://www.mysite.com/photo.jpg" If you find an image with URL "test.com/photo.jpg" then the complete URL is "http://www.mysite.com/docs/test.com/photo.jpg" > but who knows what the root directory is? The root directory is alwas easy to find. It is the URL starting from "http:" up to the first "/". In my above example, the root is simply "http://www.mysite.com". -- francois.pie...@overbyte.be The author of the freeware multi-tier middleware MidWare The author of the freeware Internet Component Suite (ICS) http://www.overbyte.be -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be