Re: trailing '/' of include-directories removed bug

Aaron S. Hawley Mon, 16 Jun 2003 15:06:55 -0700

you're right, the include-directories option operates much the same way
(my guess in the interest of speed) as the rest of the accept/reject
options.


which (others have also noticed) is a little flakey.
/a

On Fri, 13 Jun 2003, wei ye wrote:

> Did you test your patch? I patched it on my source code and it doesn't work.
>
> There are lot of files under http://biz.yahoo.com/edu/, but
> the patched code only downloaded the index.html.
>
> [EMAIL PROTECTED] src]$ ./wget -r --domains=biz.yahoo.com -I /edu/
> http://biz.yahoo.com/edu/
> [EMAIL PROTECTED] src]$ ls biz.yahoo.com/
> edu/
> [EMAIL PROTECTED] src]$ ls biz.yahoo.com/edu/
> index.html
> [EMAIL PROTECTED] src]$
>
>
> Here is the debug info, note that in proclist() function, frontcmp(p, s)
> supposed return 1, but it returns 0.
> `p' is 'edu/' which, keed the trailing '/' from parameter, and 's'
> is 'edu' - the directory of crawled url. Since 's' doesn't start with 'p',
> then it failed.
>
> If pass the url's 'path' instead of 'dir' to accdir(), it may work.
>
> Actually, I really recommend change the '-include-directories' parameter to
> '-include-urls'(so does -exlclude..). Then keeps the '/' characters in the
> parameter make more sense and easier to use. I used htdig before, which uses
> 'exclude_urls: /cgi-bin/' as well in its configuration.
>
>
> [EMAIL PROTECTED] src]$ gdb wget
> (gdb) b accdir
> Breakpoint 1 at 0x806cb42: file utils.c, line 714.
> (gdb) run -r  --domains=biz.yahoo.com -I /edu/ http://biz.yahoo.com/edu/
> Starting program: /home/weiye/downloads/wget-1.8.2/src/wget -r
> --domains=biz.yahoo.com -         I /edu/ http://biz.yahoo.com/edu/
> --18:55:07--  http://biz.yahoo.com/edu/
>            => `biz.yahoo.com/edu/index.html'
> Resolving biz.yahoo.com... done.
> Connecting to biz.yahoo.com[66.163.175.141]:80... connected.
> HTTP request sent, awaiting response... 200 OK
> Length: unspecified [text/html]
>
>     [ <=>                                          ] 6,741          6.43M/s
>
>
> 18:55:07 (6.43 MB/s) - `biz.yahoo.com/edu/index.html' saved [6741]
>
>
> Breakpoint 1, accdir (directory=0x8089df0 "edu", flags=ALLABS) at utils.c:714
> 714       if (flags & ALLABS && *directory == '/')
> (gdb) n
> 716       if (opt.includes)
> (gdb)
> 718           if (!proclist (opt.includes, directory, flags))
> (gdb) s
> proclist (strlist=0x807f090, s=0x8089df0 "edu", flags=ALLABS) at utils.c:690
> 690       for (x = strlist; *x; x++)
> (gdb) n
> 691         if (has_wildcards_p (*x))
> (gdb) p *x
> $1 = 0x807f0a0 "/edu/"
> (gdb) n
> 698             char *p = *x + ((flags & ALLABS) && (**x == '/')); /* Remove
> '/' */
> (gdb)
> 699             if (frontcmp (p, s))
> (gdb) p p
> $2 = 0x807f0a1 "edu/"
> (gdb) p s
> $3 = 0x8089df0 "edu"
> (gdb) p p
> $4 = 0x807f0a1 "edu/"
> (gdb) n
> 701           }
> (gdb) bt
> #0  proclist (strlist=0x807f090, s=0x8089df0 "edu", flags=ALLABS) at
> utils.c:701
> #1  0x806cb76 in accdir (directory=0x8089df0 "edu", flags=ALLABS) at
> utils.c:718
> #2  0x8064d8d in download_child_p (upos=0x807e7e0, parent=0x808c800, depth=0,
>     start_url_parsed=0x8080000, blacklist=0x807e100) at recur.c:514
> #3  0x80648b0 in retrieve_tree (start_url=0x807e080
> "http://biz.yahoo.com/edu/";)
>     at recur.c:348
> #4  0x8062179 in main (argc=6, argv=0x9fbff444) at main.c:822
> #5  0x804a20d in _start ()
> (gdb)
>
> Thanks very much!!

-- 
Consider supporting GNU Software and the Free Software Foundation
By Buying Stuff - http://www.gnu.org/gear/
                      (GNU and FSF are not responsible for this promotion
   nor do they necessarily agree with the views or opinions of the author)

Re: trailing '/' of include-directories removed bug

Reply via email to