Re: [Bug-wget] download page-requisites with spanning hosts
On Wed, Apr 29, 2009 at 06:50:11PM -0500, Jake b wrote: The wGet command I am using: wget.exe -p -k -w 15 http://forums.sijun.com/viewtopic.php?t=29807postdays=0postorder=ascstart=27330; It has 2 problems: 1) Rename file: Instead of creating something like: 912.html or index.html it instead becomes: viewtopic@t=29807postdays=0postorder=ascstart=27330 That't normal because the server doean't provide any usefull alternative name via HTTP headers which can be obtained using wget's option --content-disposition. If you want get number of the page of the gallery, you need to parse the HTML code by hand to obtain it (e.g. using grep). However I guess better naming conventions is the value of start URL parameter (in your example the number 27330). 2) images that span hosts are failing. I have page-resuisites on, but, since some pages are on tinypic, or imageshack, etc it is not downloading them. Meaning it looks like this: sijun/page912.php imageshack.com/1.png tinypic.com/2.png randomguyshost.com/3.png Because of this, I cannot simply list all domains to span. I don't know all the domains, since people have personal servers. How do I make wget download all images on the page? I don't want to recurse other hosts, or even sijun, just download this page, and all images needed to display it. That's not easy task. Especially because all big desktop images are stored on other servers. I think wget is not enough powerfull to do it all on its own. I propose using other tools to extract the image ULRs and then to download them using wget. E.g.: wget -O - 'http://forums.sijun.com/viewtopic.php?t=29807postdays=0postorder=ascstart=27+330' | grep -o -E 'http:\/\/[^]*\.(jpg|jpeg|png)' | wget -i - This command downloads the HTML code, uses grep to find out all image files stored on other servers (deciding throug file name extensions and absolute addresses), and finally, it downloads such images. There is little problem: not all of the images still exist and some servers return dummy page instead of proper error code. So you can get non-image files sometimes. [ This one is a lower priority, but someone might already know how to solve this ] 3) After this is done, I want to loop to download multiple pages. It would be cool If I downloaded pages 900 to 912, and each pages next link work correctly to link to the local versions. […] Either way, I have a simple script that can convert 900 to 912 into the correct URLs, and pausing in between each request. Wrap your script inside counted for-loop: for N in $(seq 900 912); do # variable N contains here the right number echo $N done Acctually, I suppose you use some unix enviroment, where you have available powerfull collection of external tools (grep, seq) and amazing shell scripting abilities (like colons and loops). -- Petr pgptCJIEFQP9c.pgp Description: PGP signature
Re: [Bug-wget] download page-requisites with spanning hosts
On Thu, Apr 30, 2009 at 3:14 AM, Petr Pisar petr.pi...@atlas.cz wrote: On Wed, Apr 29, 2009 at 06:50:11PM -0500, Jake b wrote: Instead of creating something like: 912.html or index.html it instead becomes: viewtopic@t=29807postdays=0postorder=ascstart=27330 That't normal because the server doean't provide any usefull alternative name via HTTP headers which can be obtained using wget's option --content-disposition. I already know how to get the page number ( my python script converts 27330 to 912 and back ), but i'm not sure how to tell wget that the output html file should be named. How do I make wget download all images on the page? I don't want to recurse other hosts, or even sijun, just download this page, and all images needed to display it. That's not easy task. Especially because all big desktop images are stored on other servers. I think wget is not enough powerfull to do it all on its own. Are you saying because some services show a thumbnail, then click to do the full image? I'm not worried about that, since the majority are full size in the thread. Would it be simpler to say something like: download page 912, recursion level=1 ( or 2? ), except for non-image links. ( so it only allows recursion on images, ie: downloading randomguyshost.com/3.png But the problem that it does not span any hosts? Is there a way I can achieve this, if I do the same, except, allow span everybody, recurse lvl=1, and only recurse non-images. I propose using other tools to extract the image ULRs and then to download them using wget. E.g.: I guess I could use wget to get the html, and parse that for image tags manually, but, then I don't get the forum thread comments. Which isn't required, but would be nice. wget -O - 'http://forums.sijun.com/viewtopic.php?t=29807postdays=0postorder=ascstart=27+330' | grep -o -E 'http:\/\/[^]*\.(jpg|jpeg|png)' | wget -i - Ok, will have to try it out. ( In windows ATM so I can't pipe. ) Acctually, I suppose you use some unix enviroment, where you have available powerfull collection of external tools (grep, seq) and amazing shell scripting abilities (like colons and loops). -- Petr Using python, and I have dual boot if needed. -- Jake
Re: [Bug-wget] download page-requisites with spanning hosts
On Thu, Apr 30, 2009 at 03:31:21AM -0500, Jake b wrote: On Thu, Apr 30, 2009 at 3:14 AM, Petr Pisar petr.pi...@atlas.cz wrote: On Wed, Apr 29, 2009 at 06:50:11PM -0500, Jake b wrote: but i'm not sure how to tell wget that the output html file should be named. wget -O OUTPUT_FILE_NAME How do I make wget download all images on the page? I don't want to recurse other hosts, or even sijun, just download this page, and all images needed to display it. That's not easy task. Especially because all big desktop images are stored on other servers. I think wget is not enough powerfull to do it all on its own. Are you saying because some services show a thumbnail, then click to do the full image? […] Would it be simpler to say something like: download page 912, recursion level=1 ( or 2? ), except for non-image links. ( so it only allows recursion on images, ie: downloading randomguyshost.com/3.png You can limit downloads according file name extentions (option -A), however this will remove the sole main HTML file and prevent you in recursion. And no, there is no option to download only files pointed from special HTML element like IMG. Without the -A option, you get a lot of useless files (reagardless spanning). If you look on locations of files you are interested in, you will see that all the files are located outside the Sijun domain. On every page is only small amount of such files. Thus it's more efficient and friendly to the servers to extract these URLs only at first and then download them only. But the problem that it does not span any hosts? Is there a way I can achieve this, if I do the same, except, allow span everybody, recurse lvl=1, and only recurse non-images. There is option -H for spanning. Following wget-only command does what you want, but as I said it produce a lot of useless requests and files. wget -p -l 1 -H 'http://forums.sijun.com/viewtopic.php?t=29807postdays=0postorder=ascstart=27+330' I propose using other tools to extract the image ULRs and then to download them using wget. E.g.: I guess I could use wget to get the html, and parse that for image tags manually, but, then I don't get the forum thread comments. Which isn't required, but would be nice. You can do both: extract image URLs and extract comments. wget -O - 'http://forums.sijun.com/viewtopic.php?t=29807postdays=0postorder=ascstart=27+330' | grep -o -E 'http:\/\/[^]*\.(jpg|jpeg|png)' | wget -i - Ok, will have to try it out. ( In windows ATM so I can't pipe. ) AFAIK Windows shells command.com and cmd.exe supports pipes. Using python, and I have dual boot if needed. Or you can execute programs connected through pipes in Python. -- Petr pgpkvyaPL0mO2.pgp Description: PGP signature
[Bug-wget] download page-requisites with spanning hosts
I'm trying to download multiple pages from the sijun speedpaint thread so I can use their images for my random desktop folder. I can download each page by hand using firefox, but, this becomes unwieldy, especially since prev button has bit of a delay. ( So I want to automate it, with delays and/or speedcaps to be friendly to the server ) The wGet command I am using: wget.exe -p -k -w 15 http://forums.sijun.com/viewtopic.php?t=29807postdays=0postorder=ascstart=27330; It has 2 problems: 1) Rename file: Instead of creating something like: 912.html or index.html it instead becomes: viewtopic@t=29807postdays=0postorder=ascstart=27330 2) images that span hosts are failing. I have page-resuisites on, but, since some pages are on tinypic, or imageshack, etc it is not downloading them. Meaning it looks like this: sijun/page912.php imageshack.com/1.png tinypic.com/2.png randomguyshost.com/3.png Because of this, I cannot simply list all domains to span. I don't know all the domains, since people have personal servers. How do I make wget download all images on the page? I don't want to recurse other hosts, or even sijun, just download this page, and all images needed to display it. [ This one is a lower priority, but someone might already know how to solve this ] 3) After this is done, I want to loop to download multiple pages. It would be cool If I downloaded pages 900 to 912, and each pages next link work correctly to link to the local versions. I'm not sure if I can use wget's -k command, or, if that won't work because of recursion on forums can be wierd? Either way, I have a simple script that can convert 900 to 912 into the correct URLs, and pausing in between each request. Maybe I will have to manually modify links using regex's unless you know a shortcut? thanks! -- Jake