Re: Feautre Request: Directory URL's and Mime Content-Type Header
Thought a little more about this issue. The convert-links option is what is appending index.html to all the URL's that end in a directory to the web pages I'm downloading. I need to use this option to strip the information about the original web server I'm creating the mirror from. However, if the convert-links option could be made somehow (via command-line option, whatever) to not append that index.html file name onto the end of the URL's, that would be a better work-around. I consider it a work-around because the file created on the file system would still be called index.html. But, this wouldn't be as readily apparent to user's of the mirror. And, it would be fairly easy to after downloading your mirror and before uploading, run a script to change file names for those you know should be changed. This is my circumstance, I imagine some other people would have to investigate a little more to find out what file names should be changed after they download every time. I don't know if this is easier than creating a file name based on the Content-type header in the wget sources. Also, the convert-links option is documented to be making it easy to view web pages on local filesystems. Which isn't exactly how I'm using it, or discussing using it in this message. -- Levander's Yabbering! http://home.mindspring.com/~levander
Re: Feautre Request: Directory URL's and Mime Content-Type Header
Okay, whatever yall think is best. I've got a work around working for me, but as mentioned previously, it could be more seamless. The solution you recommend seems to me though that it would be necessary in most circumstances to change this file name "by URL". Meaning, the option to determine the default file name wouldn't be "global" to all URL's. Because: 1.) if you're only downloading one file (as in my work around) you can already specifiy the file name to be created on the command line via the output-document option. And, 2.) if you're downloading recursively, in most cases it seems that the default file name (meaning the local file name created for URL's ending in a slash) would be index.html - as wget is currently written. And, any exceptions to to this rule shouldn't override what is the vast majority of the cases. I base these opinions on the idea that as the web grows, there will be more and more document types processed by the client and not on the server. I actually mentioned this issue on the #wordpress channel on freenode where several of the wordpress developers hang out. They seemed pretty emphatic that URL's should leave out any final index.html or index.xml, as it's superfluous information on the URL and makes the URL more difficult to read and remember. They even pointed me to a couple of docs on the web recommending this approach. The reasoning was similar to another mention I saw mentioned on these pages, not using tilde's in URL's, because average users who have no exposure to UNIX may not recognize the character when copying URL's by hand. Of course, wget could probably stand to wait for these files to become more prevalent before implementing this functionality. Wget is a great tool, and probably has a long future of use for web development. I just think implementing these features sooner rather than later would make transitioning to these features more seamless for its users. That said, I imagine this probably isn't a much demanded feature these days. Also, you mention wanting to create the file name before downloading. I'm not sure why making the file name only after the headers arrive would be difficult. Maybe there's just a lot of code that a file name extension would have to be passed back through to get to the point where a file name should be created in the code? Creating a default file name sounds like a simple enough process. But, I've never looked at the code myself. -Levander Hrvoje Niksic wrote: As currently written, Wget really prefers to determine the file name based on the URL, before the download starts (redirections are sort of an exception here). It would be easy to add an option to change "index.html" to "index.xml" or whatever you desire, but it would be much harder to determine the file name only after the headers arrive. -- Levander's Yabbering! http://home.mindspring.com/~levander
Re: Feautre Request: Directory URL's and Mime Content-Type Header
As currently written, Wget really prefers to determine the file name based on the URL, before the download starts (redirections are sort of an exception here). It would be easy to add an option to change "index.html" to "index.xml" or whatever you desire, but it would be much harder to determine the file name only after the headers arrive.
Re: Feautre Request: Directory URL's and Mime Content-Type Header
Jens Rösner wrote: apache. Could wget, for url's that end in slashes, read the content-type header, and if it's text/xml, could wget create index.xml inside the directory wget creates? Don't you mean "create index.html"? No, maybe I wasn't clear. The file wget is creating is called index.html. However, for my purposes, this is inappropriate. What I need created for my blog rss feed is an index.xml file. And, if the wget convert-links option is going to add a file name onto the URL's in other pages on my site that link to the blog rss feed, then they would then they would add the file name index.xml to these URL's. Like I said, for all the computer programs (browsers, news aggregators), I'm guessing index.html would work fine. But when a human sees a link to an html file, it's going to confuse him and make him think twice about maybe he's not supposed to feed that link to a news aggregator, maybe he's supposed to click on it and look at it to find the rss feed. -- Levander's Yabbering! http://home.mindspring.com/~levander -- Levander's Yabbering! http://home.mindspring.com/~levander
Re: Feautre Request: Directory URL's and Mime Content-Type Header
Hi Levander! I am not an expert by any means, "just another user", but what does the -E option do for you? -E = --html-extension > apache. Could wget, for url's that end in slashes, read the > content-type header, and if it's text/xml, could wget create index.xml > inside the directory wget creates? Don't you mean "create index.html"? CU Jens -- "Happy ProMail" bis 24. März: http://www.gmx.net/de/go/promail Zum 6. Geburtstag gibt's GMX ProMail jetzt 66 Tage kostenlos!
Feautre Request: Directory URL's and Mime Content-Type Header
Created a static mirror of my blog using wget to download the files. There was one wget feature I had to work around. When wget is downloading recursively and finds a URL that ends in a slash, for example http://example.com/html/, wget will create a directory locally and save the contents of the given url to a file called index.html. This worked fine for me in every case but one. My rss feed of my blog. I'm using Wordpress 1.5 to generate my blog, and by default Wordpress makes the rss feed url like http://example.com/feed/. And, then they usually have apache configured to search for index.xml as well as index.html, apache finds index.xml and sends it down to wget. Wget creates a directory and stores apache's resulting web page in that directory in a file called index.xml. And note, I believe because I'm using the wget convert-links option, the page that linked to the rss feed now has a link to http://example.com/feed/index.html. I think wget's current behavior would work fine, the only problem being my blog feed would now have an html suffix instead of an xml suffix. I *think* the news aggregators would be fine with this issue. However, it would confuse humans as html is usually a human viewable format, and xml suffixes for computer programs. This could possibly be changed by reading the Content-Type header that come back from the apache server when requesting this xml file. I don't know if it's true in every circumstance, but I did check the headers that Wordpress generates, and a text/xml content type is returned from apache. Could wget, for url's that end in slashes, read the content-type header, and if it's text/xml, could wget create index.xml inside the directory wget creates? I've already got a work around that works 95% for me in my circumstances, and 100% for the public viewers of the static mirror of my blog. I basically change the link of the rss feed inside Wordpress so that it's the same as what the link should look like after the blog is already mirrored. Then, when wget finds that link to the rss feed in one of the web pages its downloaded, it ignores the rss feed link because the mirror is on a different machine than the blog itself. I have to run wget a second time to put the rss feed xml file in the right place in the mirror I downloaded with wget. Then, I upload all the mirror files via ftp to the public web server. Only problem is I can't test things with the rss feed on my local machine. Have to go into wordpress and change the rss feed link every time I want to test something locally. Please CC me with all responses, I'm not subscribed to this mailing list. Thanks. -- Levander's Yabbering! http://home.mindspring.com/~levander