how to parse a webpage to download links of certain type?

2008-03-09 Thread shirish
Hi all,
I'm sure wget has a way in which one can parse a webpage 
download certain fileypes . Say  something like look for .odf fileypes
in this webpage?
If something like this has been asked before then a link would
be good enough.
-- 
  Regards,
  Shirish Agarwal
  This email is licensed under http://creativecommons.org/licenses/by-nc/3.0/

065C 6D79 A68C E7EA 52B3  8D70 950D 53FB 729A 8B17


Re: how to parse a webpage to download links of certain type?

2008-03-09 Thread Charles
On Sun, Mar 9, 2008 at 3:49 PM, shirish [EMAIL PROTECTED] wrote:
 Hi all,
 I'm sure wget has a way in which one can parse a webpage 
  download certain fileypes . Say  something like look for .odf fileypes
  in this webpage?

CMIIW, but I guess the command for this would be

wget -r -l 1 -A .odf http://site-url

---
Charles


Re: how to parse a webpage to download links of certain type?

2008-03-09 Thread shirish
Charles,
 Pretty cool, this works :)

wget -r -l 1 -A .odf http://site-url
-- 
  Regards,
  Shirish Agarwal
  This email is licensed under http://creativecommons.org/licenses/by-nc/3.0/

065C 6D79 A68C E7EA 52B3  8D70 950D 53FB 729A 8B17


Re: how to parse a webpage to download links of certain type?

2008-03-09 Thread shirish
Hi all,
  Charles, there is one thing though, what it does is it makes
directories  sub-directories . I want to have them in the same
directory where I'm running wget, not directories, possible?
-- 
  Regards,
  Shirish Agarwal
  This email is licensed under http://creativecommons.org/licenses/by-nc/3.0/

065C 6D79 A68C E7EA 52B3  8D70 950D 53FB 729A 8B17


Re: how to parse a webpage to download links of certain type?

2008-03-09 Thread Steven M. Schweda
From: shirish

 [...] not directories [...]

alp $ wget -h
[...]
Directories:
  -nd, --no-directories   don't create directories.
[...]

   Sounds as if it may be worth a try.



   Steven M. Schweda   [EMAIL PROTECTED]
   382 South Warwick Street(+1) 651-699-9818
   Saint Paul  MN  55105-2547


Re: Content-Disposition UTF-8 and filename problems

2008-03-09 Thread Todd Pattist

I'll answer my own question for the record.

It's the Content-Disposition: attachment; 
filename*=UTF-8''filename.zip  header that causes the problem.  I set 
wget to use the Privoxy proxy (wgetrc line):


http_proxy = localhost:8118/

and then set Privoxy to modify incoming server headers with a Privoxy 
filter:


SERVER-HEADER-FILTER: contentdisp Server Header filter to change 
content-disposition
s/content-disposition: attachment; 
filename\*=UTF-8''(.*)/content-disposition: attachment; filename=$1/ig


The Privoxy filter changes the server header to this form:
Content-Disposition: attachment; filename=filename.zip

which wget can read and now all is well, with the filename being saved 
under the correct name. 
BTW, when content_disposition=on the file seems to be saved in the root 
directory, not the correct directory.  With content_disposition=off, the 
wrong name is used, but it's in the right place.  I believe someone else 
has seen this problem too (from the email archives IIRC).


Thanks for a great program!



Todd Pattist wrote:
I'm having trouble with the filename after retrieving a php generated 
file download.  It is retrieved with:

http://site.com/download/file.php?id=62651
The content disposition header says:
Content-Disposition: attachment; filename*=UTF-8''filename.zip

I want it to end up as filename.zip, but it ends up as 
[EMAIL PROTECTED]  Unfortunately, I'm dealing with hundreds of files 
of varying types.


I'm using these wgetrc options:
recursive = on
content_disposition = on
verbose = on
dir_prefix = folder
server_response = on

saving the header in FireFox I see:
content-disposition: attachment; filename*=UTF-8''filename.zip
Content-Type: application/octet-stream

I'm successfully saving other files from another site with the correct 
name that have a header as follows

content-disposition: attachment; filename=flower.zip
Content-Type: zip

Is my problem due to the differences in the content-disposition: 
attachment; filename lines above, is it the UTF-8 or something else?


Any help or hints would be appreciated

Here's a logfile of the relevant request/response header exchange that 
fails:

HTTP request sent, awaiting response...
 HTTP/1.1 200 OK
 Date: Sun, 09 Mar 2008 02:31:23 GMT
 Server: Apache
 Pragma: public
 Content-Disposition: attachment; filename*=UTF-8''filename.zip
 Vary: Accept-Encoding,User-Agent
 Keep-Alive: timeout=5, max=1999
 Connection: Keep-Alive
 Content-Type: application/octet-stream
Length: unspecified [application/octet-stream]
--2008-03-08 21:31:25--  http://site.com/download/file.php?id=62651
Connecting to site.com|70.87.3.196|:80... connected.
HTTP request sent, awaiting response...
 HTTP/1.1 200 OK
 Date: Sun, 09 Mar 2008 02:31:24 GMT
 Server: Apache
 Pragma: public
 Content-Disposition: attachment; filename*=UTF-8''filename.zip
 Content-Length: 125127
 Vary: Accept-Encoding,User-Agent
 Keep-Alive: timeout=5, max=2000
 Connection: Keep-Alive
 Content-Type: application/octet-stream
Length: 125127 (122K) [application/octet-stream]
Saving to: `foldername/site.com/download/[EMAIL PROTECTED]'