Re: Download all the necessary files and linked images

2006-05-17 Thread Jean-Marc Molina
Hrvoje Niksic wrote:
 I think you have a point there -- -A shouldn't so blatantly invalidate
 -p.  That would be IMHO the best fix to the problem you're
 encountering.

Frank mentionned that limitation in its first reply.





Re: Download all the necessary files and linked images

2006-03-24 Thread Hrvoje Niksic
thomas [EMAIL PROTECTED] writes:

 i tried adding '-r -l1 - A.pdf' but that removes the html page and all the
 '-p' files.

How about -r -l1 -R.html?  That would download the HTML and the linked
contents, but not other HTMLs.


Re: Download all the necessary files and linked images

2006-03-24 Thread thomas
well that doesn't work in most real case situations since .html's are a minority now a days so you'd get all the dynamic pages (.php, urls with no extension, etc)i feel like the desired behavior is closer to -p than -r. it seems kind of unnatural to me that --accept totally overrides -p but on the other hand the current -A behavior is important in the context of -r.
what may be needed is an --accept-also option similar to --accept but less extremist. the code that handles the inlines for -p could then check if links are in the --accept-also format and if so include them as well.
i could then do a 'wget -p -k --accept-also .pdf ' tcOn 3/24/06, Hrvoje Niksic 
[EMAIL PROTECTED] wrote:thomas [EMAIL PROTECTED]
 writes: i tried adding '-r -l1 - A.pdf' but that removes the html page and all the '-p' files.How about -r -l1 -R.html?That would download the HTML and the linkedcontents, but not other HTMLs.



Re: Download all the necessary files and linked images

2006-03-24 Thread Hrvoje Niksic
thomas [EMAIL PROTECTED] writes:

 i feel like the desired behavior is closer to -p than -r. it seems
 kind of unnatural to me that --accept totally overrides -p but on
 the other hand the current -A behavior is important in the context
 of -r.

I think you have a point there -- -A shouldn't so blatantly invalidate
-p.  That would be IMHO the best fix to the problem you're
encountering.


Re: Download all the necessary files and linked images

2006-03-11 Thread Jean-Marc MOLINA
Tobias Tiederle wrote:
 I just set up my compile environment for WGet again.
 When I did regex support, I had the same problem with exclusion, so I
 introduced a new parameter --follow-excluded-html.
 (Which is of course the default) but you can turn it off with
 --no-follow-excluded-html...

 See attached patch for current trunk.

Nice ! I was planning to build wget anyway. However until it's released in
wget, I think I will stick to the idea of post-processing the archive using
a PHP script.

Thanks !






Re: Download all the necessary files and linked images

2006-03-11 Thread Jean-Marc MOLINA
Mauro Tortonesi wrote:
 although i really dislike the name --no-follow-excluded-html, i
 certainly agree on the necessity to introduce such a feature into
 wget.

 can we come up with a better name (and reach consensus on that)
 before i include this feature in wget 1.11?

I agree no shouldn't be used, it would be better to stick to wget naming
schemes : follow_html = on/off. Like there are follow_ftp or follow_tags
options. It's a boolean option, there shouldn't be 2 options to represent
each state. What do you think ?








Re: Download all the necessary files and linked images

2006-03-10 Thread Mauro Tortonesi

Tobias Tiederle wrote:

Hi,

Jean-Marc MOLINA schrieb:

I just set up my compile environment for WGet again.
When I did regex support, I had the same problem with exclusion, so I
introduced a new parameter --follow-excluded-html.
(Which is of course the default) but you can turn it off with
--no-follow-excluded-html...

See attached patch for current trunk.


although i really dislike the name --no-follow-excluded-html, i 
certainly agree on the necessity to introduce such a feature into wget.


can we come up with a better name (and reach consensus on that) before i 
include this feature in wget 1.11?


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: Download all the necessary files and linked images

2006-03-09 Thread Frank McCown

Jean-Marc MOLINA wrote:

Hello,

I want to archive a HTML page and « all the files that are necessary to
properly display » it (Wget manual), plus all the linked images (a
href=linked_image_urlimg src=inlined_image_url/a). I tried most
options and features : recursive archiving, including and excluding
directories and file types... But I can't make up the right options to only
archive the index.html page from the following hierarchy :

/pages/index.html  ; displays image_1.png, links to image_2.png
/pages/page_1.html ; linked from index.html
/pages/images/image_1.png
/images/image_2.png

Consider image_2.png as a thumb of the image_1.png, that's why it's so
important to archive it.

The archive I want to get :

/pages/index.html
/pages/images/image_1.png
/images/image_2.png

If I set the -r -l1 (recursivity on, level 1) and -P (--page-requisites :
necessary files) I will get the page_1.html, and I don't want it. And it
seems excluding the /pages directory or only including png files doesn't
affect the -P option behaviour.

How can I force Wget not to archive the page_1.html file ? At least I would
like it to clean the archive at the end. Also note that the page I try to
archive links to many pages that I want to exclude from the archive so I
can't affort to clean it manually.

JM.


I'm afraid wget won't do exactly what you want it to do.  Future 
versions of wget may enable you to specify a wildcard to select which 
files you'd like to download, but I don't know when you can expect that 
behavior.


In the meantime I'd recommend writing a script which performs a wget -P 
on the index.html file and then pulls out all links to images and feeds 
those to wget for retrieval.


Regards,
Frank


Re: Download all the necessary files and linked images

2006-03-09 Thread Jean-Marc MOLINA
Frank McCown wrote:
 I'm afraid wget won't do exactly what you want it to do.  Future
 versions of wget may enable you to specify a wildcard to select which
 files you'd like to download, but I don't know when you can expect
 that behavior.

The more I use wget, the more I like it, even if I use HTTrack more often.
So someday I hope I will find some time to contribute to the project. There
are a few features I would like it to support, like the wildcard feature
you mention. Actually it supports regular expressions but it seems the
--page-requisites option overrides all applied filters.

 In the meantime I'd recommend writing a script which performs a wget
 -P on the index.html file and then pulls out all links to images and
 feeds those to wget for retrieval.

Yes I also thought about post-processing the archive using a simple PHP
script. In fact... I think I will try that solution in the next few days.
Stay tuned :).





Re: Download all the necessary files and linked images

2006-03-09 Thread Jean-Marc MOLINA
Frank McCown wrote:
 I'm afraid wget won't do exactly what you want it to do.  Future
 versions of wget may enable you to specify a wildcard to select which
 files you'd like to download, but I don't know when you can expect
 that behavior.

I have an other opinion about that limitation. Could it be considered as a
bug ? From the Types of Files section of the manual we can read : « Note
that these two options do not affect the downloading of html files; Wget
must load all the htmls to know where to go at all-recursive retrieval would
make no sense otherwise. ». It means the accept and reject options don't
work on HTML files. But I think they should because, special in this case,
you deliberately have to exclude them. Excluding them makes sense. So I
don't really know what to do... Consider the problem as a bug, as a new
feature to implement or as an existing feature that should be redesigned.
It's pretty tricky.

I guess post-processing the archive using a PHP script is okay for now.
After all it seems I was the only one to ever request such a feature.





Re: Download all the necessary files and linked images

2006-03-09 Thread Tobias Tiederle
Hi,

Jean-Marc MOLINA schrieb:
 I have an other opinion about that limitation. Could it be considered as a
 bug ? From the Types of Files section of the manual we can read : « Note
 that these two options do not affect the downloading of html files; Wget
 must load all the htmls to know where to go at all-recursive retrieval would
 make no sense otherwise. ». It means the accept and reject options don't
 work on HTML files. But I think they should because, special in this case,
 you deliberately have to exclude them. Excluding them makes sense. So I
 don't really know what to do... Consider the problem as a bug, as a new
 feature to implement or as an existing feature that should be redesigned.
 It's pretty tricky.

I just set up my compile environment for WGet again.
When I did regex support, I had the same problem with exclusion, so I
introduced a new parameter --follow-excluded-html.
(Which is of course the default) but you can turn it off with
--no-follow-excluded-html...

See attached patch for current trunk.

TT
Index: trunk/src/init.c
===
--- trunk/src/init.c(revision 2133)
+++ trunk/src/init.c(working copy)
@@ -146,6 +146,7 @@
 #endif
   { excludedirectories, opt.excludes,   cmd_directory_vector },
   { excludedomains,  opt.exclude_domains,   cmd_vector },
+  { followexcluded, opt.followexcluded, cmd_boolean },
   { followftp,   opt.follow_ftp,cmd_boolean },
   { followtags,  opt.follow_tags,   cmd_vector },
   { forcehtml,   opt.force_html,cmd_boolean },
@@ -277,6 +278,7 @@
 
   opt.cookies = true;
   opt.verbose = -1;
+  opt.followexcluded = 1;
   opt.ntry = 20;
   opt.reclevel = 5;
   opt.add_hostdir = true;
Index: trunk/src/main.c
===
--- trunk/src/main.c(revision 2133)
+++ trunk/src/main.c(working copy)
@@ -158,6 +158,7 @@
 { exclude-directories, 'X', OPT_VALUE, excludedirectories, -1 },
 { exclude-domains, 0, OPT_VALUE, excludedomains, -1 },
 { execute, 'e', OPT__EXECUTE, NULL, required_argument },
+{ follow-excluded-html, 0, OPT_BOOLEAN, followexcluded, -1 },
 { follow-ftp, 0, OPT_BOOLEAN, followftp, -1 },
 { follow-tags, 0, OPT_VALUE, followtags, -1 },
 { force-directories, 'x', OPT_BOOLEAN, dirstruct, -1 },
@@ -611,6 +612,9 @@
   -X,  --exclude-directories=LIST  list of excluded directories.\n),
 N_(\
   -np, --no-parent don't ascend to the parent directory.\n),
+  N_(\
+   --follow-excluded-html  turns on downloading of excluded files 
for\n\
+   inspection (this is the default).\n),
 \n,
 
 N_(Mail bug reports and suggestions to [EMAIL PROTECTED].\n)
Index: trunk/src/recur.c
===
--- trunk/src/recur.c   (revision 2133)
+++ trunk/src/recur.c   (working copy)
@@ -511,13 +511,14 @@
!(has_html_suffix_p (u-file)
   /* The exception only applies to non-leaf HTMLs (but -p
  always implies non-leaf because we can overstep the
- maximum depth to get the requisites): */
-   (/* non-leaf */
+ maximum depth to get the requisites): 
+ No execption if the user specified no-follow-excluded */
+   (opt.followexcluded  (/* non-leaf */
   opt.reclevel == INFINITE_RECURSION
   /* also non-leaf */
   || depth  opt.reclevel - 1
   /* -p, which implies non-leaf (see above) */
-  || opt.page_requisites)))
+  || opt.page_requisites
 {
   if (!acceptable (u-file))
{