Re: Feautre Request: Directory URL's and Mime Content-Type Header

2005-03-25 Thread Levander
Thought a little more about this issue.
The convert-links option is what is appending index.html to all the 
URL's that end in a directory to the web pages I'm downloading.  I need 
to use this option to strip the information about the original web 
server I'm creating the mirror from.

However, if the convert-links option could be made somehow (via 
command-line option, whatever) to not append that index.html file name 
onto the end of the URL's, that would be a better work-around.

I consider it a work-around because the file created on the file system 
would still be called index.html.  But, this wouldn't be as readily 
apparent to user's of the mirror.

And, it would be fairly easy to after downloading your mirror and before 
uploading, run a script to change file names for those you know should 
be changed.  This is my circumstance, I imagine some other people would 
have to investigate a little more to find out what file names should be 
changed after they download every time.

I don't know if this is easier than creating a file name based on the 
Content-type header in the wget sources.

Also, the convert-links option is documented to be making it easy to 
view web pages on local filesystems.  Which isn't exactly how I'm using 
it, or discussing using it in this message.

--
Levander's Yabbering!
http://home.mindspring.com/~levander


Re: Feautre Request: Directory URL's and Mime Content-Type Header

2005-03-24 Thread Levander
Okay, whatever yall think is best.  I've got a work around working for 
me, but as mentioned previously, it could be more seamless.

The solution you recommend seems to me though that it would be necessary 
in most circumstances to change this file name "by URL".  Meaning, the 
option to determine the default file name wouldn't be "global" to all 
URL's. Because: 1.) if you're only downloading one file (as in my work 
around) you can already specifiy the file name to be created on the 
command line via the output-document option.  And, 2.) if you're 
downloading recursively, in most cases it seems that the default file 
name (meaning the local file name created for URL's ending in a slash) 
would be index.html - as wget is currently written.  And, any exceptions 
to to this rule shouldn't override what is the vast majority of the cases.

I base these opinions on the idea that as the web grows, there will be 
more and more document types processed by the client and not on the 
server.  I actually mentioned this issue on the #wordpress channel on 
freenode where several of the wordpress developers hang out.  They 
seemed pretty emphatic that URL's should leave out any final index.html 
or index.xml, as it's superfluous information on the URL and makes the 
URL more difficult to read and remember.  They even pointed me to a 
couple of docs on the web recommending this approach.  The reasoning was 
similar to another mention I saw mentioned on these pages, not using 
tilde's in URL's, because average users who have no exposure to UNIX may 
not recognize the character when copying URL's by hand.

Of course, wget could probably stand to wait for these files to become 
more prevalent before implementing this functionality. Wget is a great 
tool, and probably has a long future of use for web development.  I just 
think implementing these features sooner rather than later would make 
transitioning to these features more seamless for its users.  That said, 
I imagine this probably isn't a much demanded feature these days.

Also, you mention wanting to create the file name before downloading.  
I'm not sure why making the file name only after the headers arrive 
would be difficult.  Maybe there's just a lot of code that a file name 
extension would have to be passed back through to get to the point where 
a file name should be created in the code?  Creating a default file name 
sounds like a simple enough process.  But, I've never looked at the code 
myself.

-Levander
Hrvoje Niksic wrote:
As currently written, Wget really prefers to determine the file name
based on the URL, before the download starts (redirections are sort of
an exception here).  It would be easy to add an option to change
"index.html" to "index.xml" or whatever you desire, but it would be
much harder to determine the file name only after the headers arrive.
 

--
Levander's Yabbering!
http://home.mindspring.com/~levander


Re: Feautre Request: Directory URL's and Mime Content-Type Header

2005-03-24 Thread Hrvoje Niksic
As currently written, Wget really prefers to determine the file name
based on the URL, before the download starts (redirections are sort of
an exception here).  It would be easy to add an option to change
"index.html" to "index.xml" or whatever you desire, but it would be
much harder to determine the file name only after the headers arrive.


Re: Feautre Request: Directory URL's and Mime Content-Type Header

2005-03-22 Thread Levander
Jens Rösner wrote:
apache.  Could wget, for url's that end in slashes, read the 
content-type header, and if it's text/xml, could wget create index.xml 
inside the directory wget creates?

Don't you mean "create index.html"?
No, maybe I wasn't clear. The file wget is creating is called 
index.html.  However, for my purposes, this is inappropriate.  What I 
need created for my blog rss feed is an index.xml file.  And, if the 
wget convert-links option is going to add a file name onto the URL's in 
other pages on my site that link to the blog rss feed, then they would 
then they would add the file name index.xml to these URL's.

Like I said, for all the computer programs (browsers, news aggregators), 
I'm guessing index.html would work fine.  But when a human sees a link 
to an html file, it's going to confuse him and make him think twice 
about maybe he's not supposed to feed that link to a news aggregator, 
maybe he's supposed to click on it and look at it to find the rss feed.

--
Levander's Yabbering!
http://home.mindspring.com/~levander

--
Levander's Yabbering!
http://home.mindspring.com/~levander


Re: Feautre Request: Directory URL's and Mime Content-Type Header

2005-03-21 Thread Jens Rösner
Hi Levander!

I am not an expert by any means, "just another user", 
but what does the -E option do for you?
-E = --html-extension 

> apache.  Could wget, for url's that end in slashes, read the 
> content-type header, and if it's text/xml, could wget create index.xml 
> inside the directory wget creates?

Don't you mean "create index.html"?

CU
Jens


-- 
"Happy ProMail" bis 24. März: http://www.gmx.net/de/go/promail
Zum 6. Geburtstag gibt's GMX ProMail jetzt 66 Tage kostenlos!


Feautre Request: Directory URL's and Mime Content-Type Header

2005-03-20 Thread Levander
Created a static mirror of my blog using wget to download the files.  
There was one wget feature I had to work around.  When wget is 
downloading recursively and finds a URL that ends in a slash, for 
example http://example.com/html/, wget will create a directory locally 
and save the contents of the given url to a file called index.html.  
This worked fine for me in every case but one.  My rss feed of my blog.  
I'm using Wordpress 1.5 to generate my blog, and by default Wordpress 
makes the rss feed url like http://example.com/feed/.  And, then they 
usually have apache configured to search for index.xml as well as 
index.html, apache finds index.xml and sends it down to wget.  Wget 
creates a directory and stores apache's resulting web page in that 
directory in a file called index.xml. 

And note, I believe because I'm using the wget convert-links option, the 
page that linked to the rss feed now has a link to 
http://example.com/feed/index.html. 

I think wget's current behavior would work fine, the only problem being 
my blog feed would now have an html suffix instead of an xml suffix.  I 
*think* the news aggregators would be fine with this issue.  However, it 
would confuse humans as html is usually a human viewable format, and xml 
suffixes for computer programs.

This could possibly be changed by reading the Content-Type header that 
come back from the apache server when requesting this xml file.  I don't 
know if it's true in every circumstance, but I did check the headers 
that Wordpress generates, and a text/xml content type is returned from 
apache.  Could wget, for url's that end in slashes, read the 
content-type header, and if it's text/xml, could wget create index.xml 
inside the directory wget creates?

I've already got a work around that works 95% for me in my 
circumstances, and 100% for the public viewers of the static mirror of 
my blog.  I basically change the link of the rss feed inside Wordpress 
so that it's the same as what the link should look like after the blog 
is already mirrored.  Then, when wget finds that link to the rss feed in 
one of the web pages its downloaded, it ignores the rss feed link 
because the mirror is on a different machine than the blog itself.  I 
have to run wget a second time to put the rss feed xml file in the right 
place in the mirror I downloaded with wget.  Then, I upload all the 
mirror files via ftp to the public web server.  Only problem is I can't 
test things with the rss feed on my local machine.  Have to go into 
wordpress and change the rss feed link every time I want to test 
something locally.

Please CC me with all responses, I'm not subscribed to this mailing list.
Thanks.
--
Levander's Yabbering!
http://home.mindspring.com/~levander