Re: trailing '/' of include-directories removed bug
you're right, the include-directories option operates much the same way (my guess in the interest of speed) as the rest of the accept/reject options. which (others have also noticed) is a little flakey. /a On Fri, 13 Jun 2003, wei ye wrote: Did you test your patch? I patched it on my source code and it doesn't work. There are lot of files under http://biz.yahoo.com/edu/, but the patched code only downloaded the index.html. [EMAIL PROTECTED] src]$ ./wget -r --domains=biz.yahoo.com -I /edu/ http://biz.yahoo.com/edu/ [EMAIL PROTECTED] src]$ ls biz.yahoo.com/ edu/ [EMAIL PROTECTED] src]$ ls biz.yahoo.com/edu/ index.html [EMAIL PROTECTED] src]$ Here is the debug info, note that in proclist() function, frontcmp(p, s) supposed return 1, but it returns 0. `p' is 'edu/' which, keed the trailing '/' from parameter, and 's' is 'edu' - the directory of crawled url. Since 's' doesn't start with 'p', then it failed. If pass the url's 'path' instead of 'dir' to accdir(), it may work. Actually, I really recommend change the '-include-directories' parameter to '-include-urls'(so does -exlclude..). Then keeps the '/' characters in the parameter make more sense and easier to use. I used htdig before, which uses 'exclude_urls: /cgi-bin/' as well in its configuration. [EMAIL PROTECTED] src]$ gdb wget (gdb) b accdir Breakpoint 1 at 0x806cb42: file utils.c, line 714. (gdb) run -r --domains=biz.yahoo.com -I /edu/ http://biz.yahoo.com/edu/ Starting program: /home/weiye/downloads/wget-1.8.2/src/wget -r --domains=biz.yahoo.com - I /edu/ http://biz.yahoo.com/edu/ --18:55:07-- http://biz.yahoo.com/edu/ = `biz.yahoo.com/edu/index.html' Resolving biz.yahoo.com... done. Connecting to biz.yahoo.com[66.163.175.141]:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] [ = ] 6,741 6.43M/s 18:55:07 (6.43 MB/s) - `biz.yahoo.com/edu/index.html' saved [6741] Breakpoint 1, accdir (directory=0x8089df0 edu, flags=ALLABS) at utils.c:714 714 if (flags ALLABS *directory == '/') (gdb) n 716 if (opt.includes) (gdb) 718 if (!proclist (opt.includes, directory, flags)) (gdb) s proclist (strlist=0x807f090, s=0x8089df0 edu, flags=ALLABS) at utils.c:690 690 for (x = strlist; *x; x++) (gdb) n 691 if (has_wildcards_p (*x)) (gdb) p *x $1 = 0x807f0a0 /edu/ (gdb) n 698 char *p = *x + ((flags ALLABS) (**x == '/')); /* Remove '/' */ (gdb) 699 if (frontcmp (p, s)) (gdb) p p $2 = 0x807f0a1 edu/ (gdb) p s $3 = 0x8089df0 edu (gdb) p p $4 = 0x807f0a1 edu/ (gdb) n 701 } (gdb) bt #0 proclist (strlist=0x807f090, s=0x8089df0 edu, flags=ALLABS) at utils.c:701 #1 0x806cb76 in accdir (directory=0x8089df0 edu, flags=ALLABS) at utils.c:718 #2 0x8064d8d in download_child_p (upos=0x807e7e0, parent=0x808c800, depth=0, start_url_parsed=0x808, blacklist=0x807e100) at recur.c:514 #3 0x80648b0 in retrieve_tree (start_url=0x807e080 http://biz.yahoo.com/edu/;) at recur.c:348 #4 0x8062179 in main (argc=6, argv=0x9fbff444) at main.c:822 #5 0x804a20d in _start () (gdb) Thanks very much!! -- Consider supporting GNU Software and the Free Software Foundation By Buying Stuff - http://www.gnu.org/gear/ (GNU and FSF are not responsible for this promotion nor do they necessarily agree with the views or opinions of the author)
Re: trailing '/' of include-directories removed bug
no, i think your original idea of getting rid of the code that removes the trailing slash is a better idea. i think this would fix it but keep the degenerate case of root directory (whatever that's about): Index: src/init.c === RCS file: /pack/anoncvs/wget/src/init.c,v retrieving revision 1.54 diff -u -u -r1.54 init.c --- src/init.c 2002/08/03 20:34:57 1.54 +++ src/init.c 2003/06/13 20:24:16 @@ -753,7 +753,6 @@ if (*val) { - /* Strip the trailing slashes from directories. */ char **t, **seps; seps = sepstring (val); @@ -761,10 +760,10 @@ { int len = strlen (*t); /* Skip degenerate case of root directory. */ - if (len 1) + if (len == 1) { - if ((*t)[len - 1] == '/') - (*t)[len - 1] = '\0'; + if ((*t)[0] == '/') + (*t)[0] = '\0'; } } *pvec = merge_vecs (*pvec, seps); On Thu, 12 Jun 2003, wei ye wrote: For the situation I only need '/r/', there is no option for I to do that. If user need '/r*/', they should specify -I '/r*/' instead. Simple patch attached, please consider it. Thanks!! [EMAIL PROTECTED] src]$ diff -u utils.c.orig utils.c --- utils.c.origFri May 17 20:05:22 2002 +++ utils.c Thu Jun 12 20:24:21 2003 @@ -696,7 +696,9 @@ else { char *p = *x + ((flags ALLABS) (**x == '/')); /* Remove '/' */ - if (frontcmp (p, s)) + /* if *p=c, pass if s is c or c/... not ca */ + int plen = strlen(p); + if ( (strncmp (p, s, plen) == 0) (s[plen] == '/' || s[plen] == '\0') ) break; } return *x; [EMAIL PROTECTED] src]$ -- I get threatening vacation messages from J K, too.
Re: trailing '/' of include-directories removed bug
Did you test your patch? I patched it on my source code and it doesn't work. There are lot of files under http://biz.yahoo.com/edu/, but the patched code only downloaded the index.html. [EMAIL PROTECTED] src]$ ./wget -r --domains=biz.yahoo.com -I /edu/ http://biz.yahoo.com/edu/ [EMAIL PROTECTED] src]$ ls biz.yahoo.com/ edu/ [EMAIL PROTECTED] src]$ ls biz.yahoo.com/edu/ index.html [EMAIL PROTECTED] src]$ Here is the debug info, note that in proclist() function, frontcmp(p, s) supposed return 1, but it returns 0. `p' is 'edu/' which, keed the trailing '/' from parameter, and 's' is 'edu' - the directory of crawled url. Since 's' doesn't start with 'p', then it failed. If pass the url's 'path' instead of 'dir' to accdir(), it may work. Actually, I really recommend change the '-include-directories' parameter to '-include-urls'(so does -exlclude..). Then keeps the '/' characters in the parameter make more sense and easier to use. I used htdig before, which uses 'exclude_urls: /cgi-bin/' as well in its configuration. [EMAIL PROTECTED] src]$ gdb wget (gdb) b accdir Breakpoint 1 at 0x806cb42: file utils.c, line 714. (gdb) run -r --domains=biz.yahoo.com -I /edu/ http://biz.yahoo.com/edu/ Starting program: /home/weiye/downloads/wget-1.8.2/src/wget -r --domains=biz.yahoo.com - I /edu/ http://biz.yahoo.com/edu/ --18:55:07-- http://biz.yahoo.com/edu/ = `biz.yahoo.com/edu/index.html' Resolving biz.yahoo.com... done. Connecting to biz.yahoo.com[66.163.175.141]:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] [ = ] 6,741 6.43M/s 18:55:07 (6.43 MB/s) - `biz.yahoo.com/edu/index.html' saved [6741] Breakpoint 1, accdir (directory=0x8089df0 edu, flags=ALLABS) at utils.c:714 714 if (flags ALLABS *directory == '/') (gdb) n 716 if (opt.includes) (gdb) 718 if (!proclist (opt.includes, directory, flags)) (gdb) s proclist (strlist=0x807f090, s=0x8089df0 edu, flags=ALLABS) at utils.c:690 690 for (x = strlist; *x; x++) (gdb) n 691 if (has_wildcards_p (*x)) (gdb) p *x $1 = 0x807f0a0 /edu/ (gdb) n 698 char *p = *x + ((flags ALLABS) (**x == '/')); /* Remove '/' */ (gdb) 699 if (frontcmp (p, s)) (gdb) p p $2 = 0x807f0a1 edu/ (gdb) p s $3 = 0x8089df0 edu (gdb) p p $4 = 0x807f0a1 edu/ (gdb) n 701 } (gdb) bt #0 proclist (strlist=0x807f090, s=0x8089df0 edu, flags=ALLABS) at utils.c:701 #1 0x806cb76 in accdir (directory=0x8089df0 edu, flags=ALLABS) at utils.c:718 #2 0x8064d8d in download_child_p (upos=0x807e7e0, parent=0x808c800, depth=0, start_url_parsed=0x808, blacklist=0x807e100) at recur.c:514 #3 0x80648b0 in retrieve_tree (start_url=0x807e080 http://biz.yahoo.com/edu/;) at recur.c:348 #4 0x8062179 in main (argc=6, argv=0x9fbff444) at main.c:822 #5 0x804a20d in _start () (gdb) Thanks very much!! --- Aaron S. Hawley [EMAIL PROTECTED] wrote: no, i think your original idea of getting rid of the code that removes the trailing slash is a better idea. i think this would fix it but keep the degenerate case of root directory (whatever that's about): Index: src/init.c === RCS file: /pack/anoncvs/wget/src/init.c,v retrieving revision 1.54 diff -u -u -r1.54 init.c --- src/init.c2002/08/03 20:34:57 1.54 +++ src/init.c2003/06/13 20:24:16 @@ -753,7 +753,6 @@ if (*val) { - /* Strip the trailing slashes from directories. */ char **t, **seps; seps = sepstring (val); @@ -761,10 +760,10 @@ { int len = strlen (*t); /* Skip degenerate case of root directory. */ - if (len 1) + if (len == 1) { - if ((*t)[len - 1] == '/') - (*t)[len - 1] = '\0'; + if ((*t)[0] == '/') + (*t)[0] = '\0'; } } *pvec = merge_vecs (*pvec, seps); On Thu, 12 Jun 2003, wei ye wrote: For the situation I only need '/r/', there is no option for I to do that. If user need '/r*/', they should specify -I '/r*/' instead. Simple patch attached, please consider it. Thanks!! [EMAIL PROTECTED] src]$ diff -u utils.c.orig utils.c --- utils.c.origFri May 17 20:05:22 2002 +++ utils.c Thu Jun 12 20:24:21 2003 @@ -696,7 +696,9 @@ else { char *p = *x + ((flags ALLABS) (**x == '/')); /* Remove '/' */ - if (frontcmp (p, s)) + /* if *p=c, pass if s is c or c/... not ca */ + int plen = strlen(p); + if ( (strncmp (p, s, plen) == 0) (s[plen] == '/' || s[plen] == '\0') ) break; } return *x; [EMAIL PROTECTED] src]$ -- I get threatening vacation messages from J K, too. = Wei Ye __ Do you Yahoo!? SBC Yahoo! DSL - Now only $29.95 per month!
Re: trailing '/' of include-directories removed bug
above the code segment you submitted (line 765 of init.c) the comment: /* Strip the trailing slashes from directories. */ here are the manual notes on this option: (from Recursive Accept/Reject Options) `-I list' `--include-directories=list' Specify a comma-separated list of directories you wish to follow when downloading (See section Directory-Based Limits for more details.) Elements of list may contain wildcards. --- and --- (from Directory-Based Limits) `-I list' `--include list' `include_directories = list' `-I' option accepts a comma-separated list of directories included in the retrieval. Any other directories will simply be ignored. The directories are absolute paths. So, if you wish to download from `http://host/people/bozo/' following only links to bozo's colleagues in the `/people' directory and the bogus scripts in `/cgi-bin', you can specify: wget -I /people,/cgi-bin http://host/people/bozo/ --- On Wed, 11 Jun 2003, wei ye wrote: I'm trying to crawl url with --include-directories='/r/' parameter. I expect to crawl '/r/*', but wget gives me '/r*'. By reading the code, it turns out that cmd_directory_vector() removed the trailing '/' of include-directories '/r/'. It's a minor bug, but I hope it could be fix in next version. Thanks! static int cmd_directory_vector(...) { ... if (len 1) { if ((*t)[len - 1] == '/') (*t)[len - 1] = '\0'; } ... } = Wei Ye -- Yahweh commanded Abraham to sacrifice his only son Isaac on the top of a mountain. When Abraham asked why, Yahweh replied because 'I am God.' When I heard this story the first time, I promised myself to check out atheism. -- Louis Proyect www.marxmail.org/
Re: trailing '/' of include-directories removed bug
oh, i understand your problem. your request seems reasonable. i was trying to see if anyone had an idea why it seemed to be more of a feature than a bug. On Thu, 12 Jun 2003, wei ye wrote: Please take a look this example: $ \rm -rf biz.yahoo.com $ ls biz.yahoo.com $ wget -r --domains=biz.yahoo.com -I /r/ 'http://biz.yahoo.com/r/' $ ls biz.yahoo.com/ r/ reports/research/ $ I want only '/r/', but it crawls /r*, which includes /reports/, /research/. Is it an expected result or a bug? Thanks alot! --- Aaron S. Hawley [EMAIL PROTECTED] wrote: above the code segment you submitted (line 765 of init.c) the comment: /* Strip the trailing slashes from directories. */ here are the manual notes on this option: (from Recursive Accept/Reject Options) `-I list' `--include-directories=list' Specify a comma-separated list of directories you wish to follow when downloading (See section Directory-Based Limits for more details.) Elements of list may contain wildcards. --- and --- (from Directory-Based Limits) `-I list' `--include list' `include_directories = list' `-I' option accepts a comma-separated list of directories included in the retrieval. Any other directories will simply be ignored. The directories are absolute paths. So, if you wish to download from `http://host/people/bozo/' following only links to bozo's colleagues in the `/people' directory and the bogus scripts in `/cgi-bin', you can specify: wget -I /people,/cgi-bin http://host/people/bozo/ --- On Wed, 11 Jun 2003, wei ye wrote: I'm trying to crawl url with --include-directories='/r/' parameter. I expect to crawl '/r/*', but wget gives me '/r*'. By reading the code, it turns out that cmd_directory_vector() removed the trailing '/' of include-directories '/r/'. It's a minor bug, but I hope it could be fix in next version. Thanks! static int cmd_directory_vector(...) { ... if (len 1) { if ((*t)[len - 1] == '/') (*t)[len - 1] = '\0'; } ... } = Wei Ye -- Fight for Free Digital Speech www.digitalspeech.org
Re: trailing '/' of include-directories removed bug
Please take a look this example: $ \rm -rf biz.yahoo.com $ ls biz.yahoo.com $ wget -r --domains=biz.yahoo.com -I /r/ 'http://biz.yahoo.com/r/' $ ls biz.yahoo.com/ r/ reports/research/ $ I want only '/r/', but it crawls /r*, which includes /reports/, /research/. Is it an expected result or a bug? Thanks alot! --- Aaron S. Hawley [EMAIL PROTECTED] wrote: above the code segment you submitted (line 765 of init.c) the comment: /* Strip the trailing slashes from directories. */ here are the manual notes on this option: (from Recursive Accept/Reject Options) `-I list' `--include-directories=list' Specify a comma-separated list of directories you wish to follow when downloading (See section Directory-Based Limits for more details.) Elements of list may contain wildcards. --- and --- (from Directory-Based Limits) `-I list' `--include list' `include_directories = list' `-I' option accepts a comma-separated list of directories included in the retrieval. Any other directories will simply be ignored. The directories are absolute paths. So, if you wish to download from `http://host/people/bozo/' following only links to bozo's colleagues in the `/people' directory and the bogus scripts in `/cgi-bin', you can specify: wget -I /people,/cgi-bin http://host/people/bozo/ --- On Wed, 11 Jun 2003, wei ye wrote: I'm trying to crawl url with --include-directories='/r/' parameter. I expect to crawl '/r/*', but wget gives me '/r*'. By reading the code, it turns out that cmd_directory_vector() removed the trailing '/' of include-directories '/r/'. It's a minor bug, but I hope it could be fix in next version. Thanks! static int cmd_directory_vector(...) { ... if (len 1) { if ((*t)[len - 1] == '/') (*t)[len - 1] = '\0'; } ... } = Wei Ye -- Yahweh commanded Abraham to sacrifice his only son Isaac on the top of a mountain. When Abraham asked why, Yahweh replied because 'I am God.' When I heard this story the first time, I promised myself to check out atheism. -- Louis Proyect www.marxmail.org/ = Wei Ye __ Do you Yahoo!? Yahoo! Calendar - Free online calendar with sync to Outlook(TM). http://calendar.yahoo.com
Re: trailing '/' of include-directories removed bug
For the situation I only need '/r/', there is no option for I to do that. If user need '/r*/', they should specify -I '/r*/' instead. Simple patch attached, please consider it. Thanks!! [EMAIL PROTECTED] src]$ diff -u utils.c.orig utils.c --- utils.c.origFri May 17 20:05:22 2002 +++ utils.c Thu Jun 12 20:24:21 2003 @@ -696,7 +696,9 @@ else { char *p = *x + ((flags ALLABS) (**x == '/')); /* Remove '/' */ - if (frontcmp (p, s)) + /* if *p=c, pass if s is c or c/... not ca */ + int plen = strlen(p); + if ( (strncmp (p, s, plen) == 0) (s[plen] == '/' || s[plen] == '\0') ) break; } return *x; [EMAIL PROTECTED] src]$ --- Aaron S. Hawley [EMAIL PROTECTED] wrote: oh, i understand your problem. your request seems reasonable. i was trying to see if anyone had an idea why it seemed to be more of a feature than a bug. On Thu, 12 Jun 2003, wei ye wrote: Please take a look this example: $ \rm -rf biz.yahoo.com $ ls biz.yahoo.com $ wget -r --domains=biz.yahoo.com -I /r/ 'http://biz.yahoo.com/r/' $ ls biz.yahoo.com/ r/ reports/research/ $ I want only '/r/', but it crawls /r*, which includes /reports/, /research/. Is it an expected result or a bug? Thanks alot! --- Aaron S. Hawley [EMAIL PROTECTED] wrote: above the code segment you submitted (line 765 of init.c) the comment: /* Strip the trailing slashes from directories. */ here are the manual notes on this option: (from Recursive Accept/Reject Options) `-I list' `--include-directories=list' Specify a comma-separated list of directories you wish to follow when downloading (See section Directory-Based Limits for more details.) Elements of list may contain wildcards. --- and --- (from Directory-Based Limits) `-I list' `--include list' `include_directories = list' `-I' option accepts a comma-separated list of directories included in the retrieval. Any other directories will simply be ignored. The directories are absolute paths. So, if you wish to download from `http://host/people/bozo/' following only links to bozo's colleagues in the `/people' directory and the bogus scripts in `/cgi-bin', you can specify: wget -I /people,/cgi-bin http://host/people/bozo/ --- On Wed, 11 Jun 2003, wei ye wrote: I'm trying to crawl url with --include-directories='/r/' parameter. I expect to crawl '/r/*', but wget gives me '/r*'. By reading the code, it turns out that cmd_directory_vector() removed the trailing '/' of include-directories '/r/'. It's a minor bug, but I hope it could be fix in next version. Thanks! static int cmd_directory_vector(...) { ... if (len 1) { if ((*t)[len - 1] == '/') (*t)[len - 1] = '\0'; } ... } = Wei Ye -- Fight for Free Digital Speech www.digitalspeech.org = Wei Ye __ Do you Yahoo!? Yahoo! Calendar - Free online calendar with sync to Outlook(TM). http://calendar.yahoo.com
Re: trailing '/' of include-directories removed bug
BTW, my wget version is 1.8.2. Thanks! --- wei ye [EMAIL PROTECTED] wrote: I'm trying to crawl url with --include-directories='/r/' parameter. I expect to crawl '/r/*', but wget gives me '/r*'. By reading the code, it turns out that cmd_directory_vector() removed the trailing '/' of include-directories '/r/'. It's a minor bug, but I hope it could be fix in next version. Thanks! static int cmd_directory_vector(...) { ... if (len 1) { if ((*t)[len - 1] == '/') (*t)[len - 1] = '\0'; } ... } = Wei Ye __ Do you Yahoo!? Yahoo! Calendar - Free online calendar with sync to Outlook(TM). http://calendar.yahoo.com = Wei Ye __ Do you Yahoo!? Yahoo! Calendar - Free online calendar with sync to Outlook(TM). http://calendar.yahoo.com