Re: [Bug-wget] download page-requisites with spanning hosts

2009-04-30 Thread Jake b
On Thu, Apr 30, 2009 at 3:14 AM, Petr Pisar petr.pi...@atlas.cz wrote:

 On Wed, Apr 29, 2009 at 06:50:11PM -0500, Jake b wrote:
  Instead of creating something like: 912.html or index.html it instead
  becomes: viewtopic@t=29807postdays=0postorder=ascstart=27330
 
 That't normal because the server doean't provide any usefull alternative name
 via HTTP headers which can be obtained using wget's option
 --content-disposition.

I already know how to get the page number ( my python script converts
27330 to 912 and back ), but i'm not sure how to tell wget that the
output html file should be named.

  How do I make wget download all images on the page? I don't want to
  recurse other hosts, or even sijun, just download this page, and all
  images needed to display it.
 
 That's not easy task. Especially because all big desktop images are stored on
 other servers. I think wget is not enough powerfull to do it all on its own.

Are you saying because some services show a thumbnail, then click to
do the full image? I'm not worried about that, since the majority are
full size in the thread.

Would it be simpler to say something like: download page 912,
recursion level=1 ( or 2? ), except for non-image links. ( so it only
allows recursion on images, ie: downloading randomguyshost.com/3.png

But the problem that it does not span any hosts? Is there a way I can
achieve this, if I do the same, except, allow span everybody, recurse
lvl=1, and only recurse non-images.

 I propose using other tools to extract the image ULRs and then to download 
 them
 using wget. E.g.:

I guess I could use wget to get the html, and parse that for image
tags manually, but, then I don't get the forum thread comments. Which
isn't required, but would be nice.

 wget -O - 
 'http://forums.sijun.com/viewtopic.php?t=29807postdays=0postorder=ascstart=27+330'
  | grep -o -E 'http:\/\/[^]*\.(jpg|jpeg|png)' | wget -i -

Ok, will have to try it out. ( In windows ATM so I can't pipe. )

 Acctually, I suppose you use some unix enviroment, where you have available
 powerfull collection of external tools (grep, seq) and amazing shell scripting
 abilities (like colons and loops).

 -- Petr

Using python, and I have dual boot if needed.

--
Jake




[Bug-wget] download page-requisites with spanning hosts

2009-04-29 Thread Jake b
I'm trying to download multiple pages from the sijun speedpaint thread
so I can use their images for my random desktop folder. I can download
each page by hand using firefox, but, this becomes unwieldy,
especially since prev button has bit of a delay. ( So I want to
automate it, with delays and/or speedcaps to be friendly to the server
)

The wGet command I am using:
wget.exe -p -k -w 15
http://forums.sijun.com/viewtopic.php?t=29807postdays=0postorder=ascstart=27330;

It has 2 problems:

1) Rename file:

Instead of creating something like: 912.html or index.html it instead
becomes: viewtopic@t=29807postdays=0postorder=ascstart=27330

2) images that span hosts are failing.

I have page-resuisites on, but, since some pages are on tinypic, or
imageshack, etc it is not downloading them. Meaning it looks like
this:

sijun/page912.php
imageshack.com/1.png
tinypic.com/2.png
randomguyshost.com/3.png


Because of this, I cannot simply list all domains to span. I don't
know all the domains, since people have personal servers.

How do I make wget download all images on the page? I don't want to
recurse other hosts, or even sijun, just download this page, and all
images needed to display it.




[ This one is a lower priority, but someone might already know how to
solve this ]
3) After this is done, I want to loop to download multiple pages. It
would be cool If I downloaded pages 900 to 912, and each pages next
link work correctly to link to the local versions.

I'm not sure if I can use wget's -k command, or, if that won't work
because of recursion on forums can be wierd?
Either way, I have a simple script that can convert 900 to 912 into
the correct URLs, and pausing in between each request.

Maybe I will have to manually modify links using regex's unless you
know a shortcut?



thanks!
--
Jake