[Bug-wget] downloading all files on page (with identical filenames)
Dear list, I'm using wget 1.12 on ubuntu 10.04. I don't know if this is a bug or not. I'm using wget -U firefox -r -l1 -nd -e robots=off -A.pdf http://example.com to download pdf's off a page. The dilemma is that a lot of the pdf links on the page has the same name (example.pdf). Wget is supposed to append .1, .2, etc, to those files. However, with the above command, only .1 is appended, and hence, only one file with .1 is seen. If I set "-A.pdf,.pdf.1", then .1 and .2 gets appended, but .2 gets repeated and only one .2 file is available at the end. Are some of my arguments conflicting? Thanks Vinh
[Bug-wget] downloading links in a dynamic site
Dear list, My goal is to download some pdf files from a dynamic site (not sure on the terminology). For example, I would execute: wget -U firefox -r -l1 -nd -e robots=off -A '*.pdf,*.pdf.*' http://site.com/?sortorder=asc&p_o=0 and would get my 10 pdf files. On the page I can click a "Next" link (to have more files), and I execute: wget -U firefox -r -l1 -nd -e robots=off -A '*.pdf,*.pdf.*' http://site.com/?sortorder=asc&p_o=10 However, the downloaded files are identical to the previous. I tried the cookies setting and referer setting: wget -U firefox --cookies=on --keep-session-cookies --save-cookies=cookie.txt -r -l1 -nd -e robots=off -A '*.pdf,*.pdf.*' http://site.com/?sortorder=asc&p_o=0 wget -U firefox --referer='http://site.com/?sortorder=asc&p_o=0' --cookies=on --load-cookies=cookie.txt --keep-session-cookies --save-cookies=cookie.txt -r -l1 -nd -e robots=off -A '*.pdf,*.pdf.*' http://site.com/?sortorder=asc&p_o=10 but the results again are identical. Any suggestions? Thanks. Vinh
Re: [Bug-wget] downloading all files on page (with identical filenames)
On Sun, Jul 25, 2010 at 2:10 AM, Micah Cowan wrote: > On 07/24/2010 11:15 AM, Vinh Nguyen wrote: >> Dear list, >> >> I'm using wget 1.12 on ubuntu 10.04. I don't know if this is a bug or >> not. I'm using >> >> wget -U firefox -r -l1 -nd -e robots=off -A.pdf http://example.com >> >> to download pdf's off a page. The dilemma is that a lot of the pdf >> links on the page has the same name (example.pdf). Wget is supposed >> to append .1, .2, etc, to those files. However, with the above >> command, only .1 is appended, and hence, only one file with .1 is >> seen. If I set "-A.pdf,.pdf.1", then .1 and .2 gets appended, but .2 >> gets repeated and only one .2 file is available at the end. >> >> Are some of my arguments conflicting? > > Looks like that blasted delete-after logic again: it's because after the > rename, the files no longer match -A.pdf, so they get deleted (not sure > how you still have a .pdf.1 at all at the end, unless you're > interrupting wget before it gets a chance to delete it). As a > workaround, you should be able to use something like -A '*.pdf,*.pdf.*' > Thanks Micah, this works. > -- > Micah J. Cowan > http://micah.cowan.name/ >
Re: [Bug-wget] downloading links in a dynamic site
On Mon, Jul 26, 2010 at 11:18 AM, Keisial wrote: > Vinh Nguyen wrote: >> Dear list, >> >> My goal is to download some pdf files from a dynamic site (not sure on >> the terminology). For example, I would execute: >> >> wget -U firefox -r -l1 -nd -e robots=off -A '*.pdf,*.pdf.*' >> http://site.com/?sortorder=asc&p_o=0 >> >> and would get my 10 pdf files. On the page I can click a "Next" link >> (to have more files), and I execute: >> >> wget -U firefox -r -l1 -nd -e robots=off -A '*.pdf,*.pdf.*' >> http://site.com/?sortorder=asc&p_o=10 >> >> However, the downloaded files are identical to the previous. I tried >> the cookies setting and referer setting: >> >> wget -U firefox --cookies=on --keep-session-cookies >> --save-cookies=cookie.txt -r -l1 -nd -e robots=off -A '*.pdf,*.pdf.*' >> http://site.com/?sortorder=asc&p_o=0 >> wget -U firefox --referer='http://site.com/?sortorder=asc&p_o=0' >> --cookies=on --load-cookies=cookie.txt --keep-session-cookies >> --save-cookies=cookie.txt -r -l1 -nd -e robots=off -A '*.pdf,*.pdf.*' >> http://site.com/?sortorder=asc&p_o=10 >> >> but the results again are identical. Any suggestions? >> >> Thanks. >> Vinh > > Look at the page source how they are generating the urls. > Maybe they are using some ugly javascript, although that discards > the benefit of paging... Thanks for your response Keisial. I looked at the source, and of course, there is javascript. However, I couldn't tie it to anything that generate links. The links that I click on: 32 ChaptersFirst | 1-10 | 11-20 | 21-30 | 31-32 | Next That's displayed in the source. Also, when i try to manually enter the url changing =10, =20, =30, I get the right page, so I don't think it's a javascript issue. What else could it be besides referer and cookies? Vinh
Re: [Bug-wget] downloading links in a dynamic site
On Mon, Jul 26, 2010 at 1:51 PM, Vinh Nguyen wrote: > That's displayed in the source. Also, when i try to manually enter > the url changing =10, =20, =30, I get the right page, so I don't think > it's a javascript issue. What else could it be besides referer and > cookies? Confirmed that it also works in a DIFFERENT browser (conkeror and firefox). Hmm, what can be the difference between wget and these browsers?
Re: [Bug-wget] downloading links in a dynamic site
On Mon, Jul 26, 2010 at 2:02 PM, Vinh Nguyen wrote: > On Mon, Jul 26, 2010 at 1:51 PM, Vinh Nguyen wrote: >> That's displayed in the source. Also, when i try to manually enter >> the url changing =10, =20, =30, I get the right page, so I don't think >> it's a javascript issue. What else could it be besides referer and >> cookies? > > Confirmed that it also works in a DIFFERENT browser (conkeror and > firefox). Hmm, what can be the difference between wget and these > browsers? This issue is RESOLVED. Put 'quotes' around the url. I thought I had this the entire time. Thanks everyone. Vinh