Re: [Bug-wget] download page-requisites with spanning hosts

2009-04-30 Thread Petr Pisar
On Wed, Apr 29, 2009 at 06:50:11PM -0500, Jake b wrote:
 
 The wGet command I am using:
 wget.exe -p -k -w 15
 http://forums.sijun.com/viewtopic.php?t=29807postdays=0postorder=ascstart=27330;
 
 It has 2 problems:
 
 1) Rename file:
 
 Instead of creating something like: 912.html or index.html it instead
 becomes: viewtopic@t=29807postdays=0postorder=ascstart=27330

That't normal because the server doean't provide any usefull alternative name
via HTTP headers which can be obtained using wget's option
--content-disposition.

If you want get number of the page of the gallery, you need to parse the HTML
code by hand to obtain it (e.g. using grep).

However I guess better naming conventions is the value of start URL
parameter (in your example the number 27330).

 2) images that span hosts are failing.
 
 I have page-resuisites on, but, since some pages are on tinypic, or
 imageshack, etc it is not downloading them. Meaning it looks like
 this:
 
 sijun/page912.php
   imageshack.com/1.png
   tinypic.com/2.png
   randomguyshost.com/3.png
 
 
 Because of this, I cannot simply list all domains to span. I don't
 know all the domains, since people have personal servers.
 
 How do I make wget download all images on the page? I don't want to
 recurse other hosts, or even sijun, just download this page, and all
 images needed to display it.
 
That's not easy task. Especially because all big desktop images are stored on
other servers. I think wget is not enough powerfull to do it all on its own.

I propose using other tools to extract the image ULRs and then to download them
using wget. E.g.:

wget -O - 
'http://forums.sijun.com/viewtopic.php?t=29807postdays=0postorder=ascstart=27+330'
 | grep -o -E 'http:\/\/[^]*\.(jpg|jpeg|png)' | wget -i -

This command downloads the HTML code, uses grep to find out all image files
stored on other servers (deciding throug file name extensions and absolute
addresses), and finally, it downloads such images.

There is little problem: not all of the images still exist and some servers
return dummy page instead of proper error code. So you can get non-image files
sometimes.

 [ This one is a lower priority, but someone might already know how to
 solve this ]
 3) After this is done, I want to loop to download multiple pages. It
 would be cool If I downloaded pages 900 to 912, and each pages next
 link work correctly to link to the local versions.
 
[…]
 Either way, I have a simple script that can convert 900 to 912 into
 the correct URLs, and pausing in between each request.
 
Wrap your script inside counted for-loop:

for N in $(seq 900 912); do
# variable N contains here the right number
echo $N
done

Acctually, I suppose you use some unix enviroment, where you have available
powerfull collection of external tools (grep, seq) and amazing shell scripting
abilities (like colons and loops).

-- Petr


pgptCJIEFQP9c.pgp
Description: PGP signature


Re: [Bug-wget] download page-requisites with spanning hosts

2009-04-30 Thread Jake b
On Thu, Apr 30, 2009 at 3:14 AM, Petr Pisar petr.pi...@atlas.cz wrote:

 On Wed, Apr 29, 2009 at 06:50:11PM -0500, Jake b wrote:
  Instead of creating something like: 912.html or index.html it instead
  becomes: viewtopic@t=29807postdays=0postorder=ascstart=27330
 
 That't normal because the server doean't provide any usefull alternative name
 via HTTP headers which can be obtained using wget's option
 --content-disposition.

I already know how to get the page number ( my python script converts
27330 to 912 and back ), but i'm not sure how to tell wget that the
output html file should be named.

  How do I make wget download all images on the page? I don't want to
  recurse other hosts, or even sijun, just download this page, and all
  images needed to display it.
 
 That's not easy task. Especially because all big desktop images are stored on
 other servers. I think wget is not enough powerfull to do it all on its own.

Are you saying because some services show a thumbnail, then click to
do the full image? I'm not worried about that, since the majority are
full size in the thread.

Would it be simpler to say something like: download page 912,
recursion level=1 ( or 2? ), except for non-image links. ( so it only
allows recursion on images, ie: downloading randomguyshost.com/3.png

But the problem that it does not span any hosts? Is there a way I can
achieve this, if I do the same, except, allow span everybody, recurse
lvl=1, and only recurse non-images.

 I propose using other tools to extract the image ULRs and then to download 
 them
 using wget. E.g.:

I guess I could use wget to get the html, and parse that for image
tags manually, but, then I don't get the forum thread comments. Which
isn't required, but would be nice.

 wget -O - 
 'http://forums.sijun.com/viewtopic.php?t=29807postdays=0postorder=ascstart=27+330'
  | grep -o -E 'http:\/\/[^]*\.(jpg|jpeg|png)' | wget -i -

Ok, will have to try it out. ( In windows ATM so I can't pipe. )

 Acctually, I suppose you use some unix enviroment, where you have available
 powerfull collection of external tools (grep, seq) and amazing shell scripting
 abilities (like colons and loops).

 -- Petr

Using python, and I have dual boot if needed.

--
Jake




Re: [Bug-wget] download page-requisites with spanning hosts

2009-04-30 Thread Petr Pisar
On Thu, Apr 30, 2009 at 03:31:21AM -0500, Jake b wrote:
 On Thu, Apr 30, 2009 at 3:14 AM, Petr Pisar petr.pi...@atlas.cz wrote:
 
  On Wed, Apr 29, 2009 at 06:50:11PM -0500, Jake b wrote:
 but i'm not sure how to tell wget that the output html file should be named.
 
wget -O OUTPUT_FILE_NAME

   How do I make wget download all images on the page? I don't want to
   recurse other hosts, or even sijun, just download this page, and all
   images needed to display it.
  
  That's not easy task. Especially because all big desktop images are stored
  on other servers. I think wget is not enough powerfull to do it all on its
  own.
 
 Are you saying because some services show a thumbnail, then click to do the
 full image? 
[…]
 Would it be simpler to say something like: download page 912, recursion
 level=1 ( or 2? ), except for non-image links. ( so it only allows recursion
 on images, ie: downloading randomguyshost.com/3.png
 
You can limit downloads according file name extentions (option -A), however
this will remove the sole main HTML file and prevent you in recursion. And
no, there is no option to download only files pointed from special HTML
element like IMG.

Without the -A option, you get a lot of useless files (reagardless spanning).

If you look on locations of files you are interested in, you will see that all
the files are located outside the Sijun domain. On every page is only small
amount of such files. Thus it's more efficient and friendly to the servers to
extract these URLs only at first and then download them only.

 But the problem that it does not span any hosts? Is there a way I can
 achieve this, if I do the same, except, allow span everybody, recurse
 lvl=1, and only recurse non-images.

There is option -H for spanning. Following wget-only command does what you
want, but as I said it produce a lot of useless requests and files.

wget -p -l 1 -H 
'http://forums.sijun.com/viewtopic.php?t=29807postdays=0postorder=ascstart=27+330'
 


  I propose using other tools to extract the image ULRs and then to download
  them using wget. E.g.:
 
 I guess I could use wget to get the html, and parse that for image tags
 manually, but, then I don't get the forum thread comments. Which isn't
 required, but would be nice.

You can do both: extract image URLs and extract comments.

 
  wget -O
  - 
  'http://forums.sijun.com/viewtopic.php?t=29807postdays=0postorder=ascstart=27+330'
  | grep -o -E 'http:\/\/[^]*\.(jpg|jpeg|png)' | wget -i -
 
 Ok, will have to try it out. ( In windows ATM so I can't pipe. )
 
AFAIK Windows shells command.com and cmd.exe supports pipes.

 Using python, and I have dual boot if needed.
 
Or you can execute programs connected through pipes in Python.

-- Petr


pgpkvyaPL0mO2.pgp
Description: PGP signature


[Bug-wget] download page-requisites with spanning hosts

2009-04-29 Thread Jake b
I'm trying to download multiple pages from the sijun speedpaint thread
so I can use their images for my random desktop folder. I can download
each page by hand using firefox, but, this becomes unwieldy,
especially since prev button has bit of a delay. ( So I want to
automate it, with delays and/or speedcaps to be friendly to the server
)

The wGet command I am using:
wget.exe -p -k -w 15
http://forums.sijun.com/viewtopic.php?t=29807postdays=0postorder=ascstart=27330;

It has 2 problems:

1) Rename file:

Instead of creating something like: 912.html or index.html it instead
becomes: viewtopic@t=29807postdays=0postorder=ascstart=27330

2) images that span hosts are failing.

I have page-resuisites on, but, since some pages are on tinypic, or
imageshack, etc it is not downloading them. Meaning it looks like
this:

sijun/page912.php
imageshack.com/1.png
tinypic.com/2.png
randomguyshost.com/3.png


Because of this, I cannot simply list all domains to span. I don't
know all the domains, since people have personal servers.

How do I make wget download all images on the page? I don't want to
recurse other hosts, or even sijun, just download this page, and all
images needed to display it.




[ This one is a lower priority, but someone might already know how to
solve this ]
3) After this is done, I want to loop to download multiple pages. It
would be cool If I downloaded pages 900 to 912, and each pages next
link work correctly to link to the local versions.

I'm not sure if I can use wget's -k command, or, if that won't work
because of recursion on forums can be wierd?
Either way, I have a simple script that can convert 900 to 912 into
the correct URLs, and pausing in between each request.

Maybe I will have to manually modify links using regex's unless you
know a shortcut?



thanks!
--
Jake