[Bug-wget] downloading all files on page (with identical filenames)

2010-07-25 Thread Vinh Nguyen
Dear list,

I'm using wget 1.12 on ubuntu 10.04.  I don't know if this is a bug or
not.  I'm using

wget -U firefox -r -l1 -nd -e robots=off -A.pdf http://example.com

to download pdf's off a page.  The dilemma is that a lot of the pdf
links on the page has the same name (example.pdf).  Wget is supposed
to append .1, .2, etc, to those files.  However, with the above
command, only .1 is appended, and hence, only one file with .1 is
seen.  If I set "-A.pdf,.pdf.1", then .1 and .2 gets appended, but .2
gets repeated and only one .2 file is available at the end.

Are some of my arguments conflicting?

Thanks
Vinh



[Bug-wget] downloading links in a dynamic site

2010-07-25 Thread Vinh Nguyen
Dear list,

My goal is to download some pdf files from a dynamic site (not sure on
the terminology).  For example, I would execute:

wget -U firefox -r -l1 -nd -e robots=off -A '*.pdf,*.pdf.*'
http://site.com/?sortorder=asc&p_o=0

and would get my 10 pdf files.  On the page I can click a "Next" link
(to have more files), and I execute:

wget -U firefox -r -l1 -nd -e robots=off -A '*.pdf,*.pdf.*'
http://site.com/?sortorder=asc&p_o=10

However, the downloaded files are identical to the previous.  I tried
the cookies setting and referer setting:

wget -U firefox --cookies=on --keep-session-cookies
--save-cookies=cookie.txt -r -l1 -nd -e robots=off -A '*.pdf,*.pdf.*'
http://site.com/?sortorder=asc&p_o=0
wget -U firefox --referer='http://site.com/?sortorder=asc&p_o=0'
--cookies=on --load-cookies=cookie.txt --keep-session-cookies
--save-cookies=cookie.txt -r -l1 -nd -e robots=off -A '*.pdf,*.pdf.*'
http://site.com/?sortorder=asc&p_o=10

but the results again are identical.  Any suggestions?

Thanks.
Vinh



Re: [Bug-wget] downloading all files on page (with identical filenames)

2010-07-25 Thread Vinh Nguyen
On Sun, Jul 25, 2010 at 2:10 AM, Micah Cowan  wrote:
> On 07/24/2010 11:15 AM, Vinh Nguyen wrote:
>> Dear list,
>>
>> I'm using wget 1.12 on ubuntu 10.04.  I don't know if this is a bug or
>> not.  I'm using
>>
>> wget -U firefox -r -l1 -nd -e robots=off -A.pdf http://example.com
>>
>> to download pdf's off a page.  The dilemma is that a lot of the pdf
>> links on the page has the same name (example.pdf).  Wget is supposed
>> to append .1, .2, etc, to those files.  However, with the above
>> command, only .1 is appended, and hence, only one file with .1 is
>> seen.  If I set "-A.pdf,.pdf.1", then .1 and .2 gets appended, but .2
>> gets repeated and only one .2 file is available at the end.
>>
>> Are some of my arguments conflicting?
>
> Looks like that blasted delete-after logic again: it's because after the
> rename, the files no longer match -A.pdf, so they get deleted (not sure
> how you still have a .pdf.1 at all at the end, unless you're
> interrupting wget before it gets a chance to delete it). As a
> workaround, you should be able to use something like -A '*.pdf,*.pdf.*'
>

Thanks Micah, this works.

> --
> Micah J. Cowan
> http://micah.cowan.name/
>



Re: [Bug-wget] downloading links in a dynamic site

2010-07-26 Thread Vinh Nguyen
On Mon, Jul 26, 2010 at 11:18 AM, Keisial  wrote:
>  Vinh Nguyen wrote:
>> Dear list,
>>
>> My goal is to download some pdf files from a dynamic site (not sure on
>> the terminology).  For example, I would execute:
>>
>> wget -U firefox -r -l1 -nd -e robots=off -A '*.pdf,*.pdf.*'
>> http://site.com/?sortorder=asc&p_o=0
>>
>> and would get my 10 pdf files.  On the page I can click a "Next" link
>> (to have more files), and I execute:
>>
>> wget -U firefox -r -l1 -nd -e robots=off -A '*.pdf,*.pdf.*'
>> http://site.com/?sortorder=asc&p_o=10
>>
>> However, the downloaded files are identical to the previous.  I tried
>> the cookies setting and referer setting:
>>
>> wget -U firefox --cookies=on --keep-session-cookies
>> --save-cookies=cookie.txt -r -l1 -nd -e robots=off -A '*.pdf,*.pdf.*'
>> http://site.com/?sortorder=asc&p_o=0
>> wget -U firefox --referer='http://site.com/?sortorder=asc&p_o=0'
>> --cookies=on --load-cookies=cookie.txt --keep-session-cookies
>> --save-cookies=cookie.txt -r -l1 -nd -e robots=off -A '*.pdf,*.pdf.*'
>> http://site.com/?sortorder=asc&p_o=10
>>
>> but the results again are identical.  Any suggestions?
>>
>> Thanks.
>> Vinh
>
> Look at the page source how they are generating the urls.
> Maybe they are using some ugly javascript, although that discards
> the benefit of paging...


Thanks for your response Keisial.  I looked at the source, and of
course, there is javascript.  However, I couldn't tie it to anything
that generate links.  The links that I click on:

32 ChaptersFirst | 1-10 | 11-20 | 21-30 | 31-32 | Next

That's displayed in the source.  Also, when i try to manually enter
the url changing =10, =20, =30, I get the right page, so I don't think
it's a javascript issue.  What else could it be besides referer and
cookies?

Vinh



Re: [Bug-wget] downloading links in a dynamic site

2010-07-26 Thread Vinh Nguyen
On Mon, Jul 26, 2010 at 1:51 PM, Vinh Nguyen  wrote:
> That's displayed in the source.  Also, when i try to manually enter
> the url changing =10, =20, =30, I get the right page, so I don't think
> it's a javascript issue.  What else could it be besides referer and
> cookies?

Confirmed that it also works in a DIFFERENT browser (conkeror and
firefox).  Hmm, what can be the difference between wget and these
browsers?



Re: [Bug-wget] downloading links in a dynamic site

2010-07-26 Thread Vinh Nguyen
On Mon, Jul 26, 2010 at 2:02 PM, Vinh Nguyen  wrote:
> On Mon, Jul 26, 2010 at 1:51 PM, Vinh Nguyen  wrote:
>> That's displayed in the source.  Also, when i try to manually enter
>> the url changing =10, =20, =30, I get the right page, so I don't think
>> it's a javascript issue.  What else could it be besides referer and
>> cookies?
>
> Confirmed that it also works in a DIFFERENT browser (conkeror and
> firefox).  Hmm, what can be the difference between wget and these
> browsers?

This issue is RESOLVED.  Put 'quotes' around the url.  I thought I had
this the entire time.  Thanks everyone.

Vinh