Re: trailing '/' of include-directories removed bug

wei ye Fri, 13 Jun 2003 19:18:31 -0700

Did you test your patch? I patched it on my source code and it doesn't work.


There are lot of files under http://biz.yahoo.com/edu/, but
the patched code only downloaded the index.html.

[EMAIL PROTECTED] src]$ ./wget -r --domains=biz.yahoo.com -I /edu/
http://biz.yahoo.com/edu/
[EMAIL PROTECTED] src]$ ls biz.yahoo.com/
edu/
[EMAIL PROTECTED] src]$ ls biz.yahoo.com/edu/
index.html
[EMAIL PROTECTED] src]$ 


Here is the debug info, note that in proclist() function, frontcmp(p, s)
supposed return 1, but it returns 0.
`p' is 'edu/' which, keed the trailing '/' from parameter, and 's'
is 'edu' - the directory of crawled url. Since 's' doesn't start with 'p',
then it failed.

If pass the url's 'path' instead of 'dir' to accdir(), it may work.

Actually, I really recommend change the '-include-directories' parameter to
'-include-urls'(so does -exlclude..). Then keeps the '/' characters in the
parameter make more sense and easier to use. I used htdig before, which uses
'exclude_urls: /cgi-bin/' as well in its configuration.


[EMAIL PROTECTED] src]$ gdb wget
(gdb) b accdir
Breakpoint 1 at 0x806cb42: file utils.c, line 714.
(gdb) run -r  --domains=biz.yahoo.com -I /edu/ http://biz.yahoo.com/edu/
Starting program: /home/weiye/downloads/wget-1.8.2/src/wget -r 
--domains=biz.yahoo.com -         I /edu/ http://biz.yahoo.com/edu/
--18:55:07--  http://biz.yahoo.com/edu/
           => `biz.yahoo.com/edu/index.html'
Resolving biz.yahoo.com... done.
Connecting to biz.yahoo.com[66.163.175.141]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]

    [ <=>                                          ] 6,741          6.43M/s    
        

18:55:07 (6.43 MB/s) - `biz.yahoo.com/edu/index.html' saved [6741]


Breakpoint 1, accdir (directory=0x8089df0 "edu", flags=ALLABS) at utils.c:714
714       if (flags & ALLABS && *directory == '/')
(gdb) n
716       if (opt.includes)
(gdb) 
718           if (!proclist (opt.includes, directory, flags))
(gdb) s
proclist (strlist=0x807f090, s=0x8089df0 "edu", flags=ALLABS) at utils.c:690
690       for (x = strlist; *x; x++)
(gdb) n
691         if (has_wildcards_p (*x))
(gdb) p *x
$1 = 0x807f0a0 "/edu/"
(gdb) n
698             char *p = *x + ((flags & ALLABS) && (**x == '/')); /* Remove
'/' */
(gdb) 
699             if (frontcmp (p, s))
(gdb) p p
$2 = 0x807f0a1 "edu/"
(gdb) p s
$3 = 0x8089df0 "edu"
(gdb) p p
$4 = 0x807f0a1 "edu/"
(gdb) n
701           }
(gdb) bt
#0  proclist (strlist=0x807f090, s=0x8089df0 "edu", flags=ALLABS) at
utils.c:701
#1  0x806cb76 in accdir (directory=0x8089df0 "edu", flags=ALLABS) at
utils.c:718
#2  0x8064d8d in download_child_p (upos=0x807e7e0, parent=0x808c800, depth=0, 
    start_url_parsed=0x8080000, blacklist=0x807e100) at recur.c:514
#3  0x80648b0 in retrieve_tree (start_url=0x807e080
"http://biz.yahoo.com/edu/";)
    at recur.c:348
#4  0x8062179 in main (argc=6, argv=0x9fbff444) at main.c:822
#5  0x804a20d in _start ()
(gdb) 

Thanks very much!!

--- "Aaron S. Hawley" <[EMAIL PROTECTED]> wrote:
> no, i think your original idea of getting rid of the code that removes the
> trailing slash is a better idea.  i think this would fix it but keep the
> "degenerate case of root directory" (whatever that's about):
> 
> Index: src/init.c
> ===================================================================
> RCS file: /pack/anoncvs/wget/src/init.c,v
> retrieving revision 1.54
> diff -u -u -r1.54 init.c
> --- src/init.c        2002/08/03 20:34:57     1.54
> +++ src/init.c        2003/06/13 20:24:16
> @@ -753,7 +753,6 @@
> 
>    if (*val)
>      {
> -      /* Strip the trailing slashes from directories.  */
>        char **t, **seps;
> 
>        seps = sepstring (val);
> @@ -761,10 +760,10 @@
>       {
>         int len = strlen (*t);
>         /* Skip degenerate case of root directory.  */
> -       if (len > 1)
> +       if (len == 1)
>           {
> -           if ((*t)[len - 1] == '/')
> -             (*t)[len - 1] = '\0';
> +           if ((*t)[0] == '/')
> +             (*t)[0] = '\0';
>           }
>       }
>        *pvec = merge_vecs (*pvec, seps);
> 
> On Thu, 12 Jun 2003, wei ye wrote:
> 
> > For the situation I only need '/r/', there is no option for I to do that.
> >
> > If user need '/r*/', they should specify -I '/r*/' instead.
> >
> > Simple patch attached, please consider it. Thanks!!
> >
> > [EMAIL PROTECTED] src]$ diff  -u utils.c.orig utils.c
> > --- utils.c.orig        Fri May 17 20:05:22 2002
> > +++ utils.c     Thu Jun 12 20:24:21 2003
> > @@ -696,7 +696,9 @@
> >      else
> >        {
> >         char *p = *x + ((flags & ALLABS) && (**x == '/')); /* Remove '/' */
> > -       if (frontcmp (p, s))
> > +       /* if *p="c", pass if s is "c" or "c/..." not "ca...". */
> > +       int plen = strlen(p);
> > +       if ( (strncmp (p, s, plen) == 0) && (s[plen] == '/' || s[plen] ==
> '\0')
> > )
> >           break;
> >        }
> >    return *x;
> > [EMAIL PROTECTED] src]$
> 
> 
> -- 
> I get threatening vacation messages from "J K", too.


=====
Wei Ye

__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

Re: trailing '/' of include-directories removed bug

Reply via email to