> Hi, I'm using httpcli to save a webpage html doc and I extract all of
> it's image locations to a text file by saving the '<IMG SRC=' tags.
> Afterward I want to download all of the images, but how can I determine
> the TRUE location of the images? For example, say the image tag is:
> '<IMG SRC='test.com/photo.jpg'' - for all I know, "test.com" could just
> be a directory on the server or it could be the website. Another
> example, say the image tag is: '<IMG SRC='/photo.jpg'' - so the image is
> in the root directory of the website, but who knows what the root
> directory is? It may simply be 'test.com', or if the html doc is located
> in a subdirectory, it may be something like 'test.com/users/me'.
>
> So, what is the appropriate way to determine the actual true location of
> these images from the 'IMG' tags?

If the image URL starts with "/" then it is an absolute URL. Just prepend 
the website URL and you have the image URL.
If the image URL doesn't starts with "/", then it is a relative URL. You 
must prepent de URL of the page where the you've found the image, excluding 
the document itself.

Example: Assuming you are getting a page from 
"http://www.mysite.com/docs/page.html";.
If you find an image source URL as "/photo.jpg" then the complete URL is 
"http://www.mysite.com/photo.jpg";
If you find an image with URL "test.com/photo.jpg" then the complete URL is 
"http://www.mysite.com/docs/test.com/photo.jpg";


> but who knows what the root directory is?

The root directory is alwas easy to find. It is the URL starting from 
"http:" up to the first "/". In my above example, the root is simply 
"http://www.mysite.com";.

--
francois.pie...@overbyte.be
The author of the freeware multi-tier middleware MidWare
The author of the freeware Internet Component Suite (ICS)
http://www.overbyte.be

-- 
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be

Reply via email to