Re: trailing '/' of include-directories removed bug

2003-06-16 Thread Aaron S. Hawley
you're right, the include-directories option operates much the same way
(my guess in the interest of speed) as the rest of the accept/reject
options.

which (others have also noticed) is a little flakey.
/a

On Fri, 13 Jun 2003, wei ye wrote:

 Did you test your patch? I patched it on my source code and it doesn't work.

 There are lot of files under http://biz.yahoo.com/edu/, but
 the patched code only downloaded the index.html.

 [EMAIL PROTECTED] src]$ ./wget -r --domains=biz.yahoo.com -I /edu/
 http://biz.yahoo.com/edu/
 [EMAIL PROTECTED] src]$ ls biz.yahoo.com/
 edu/
 [EMAIL PROTECTED] src]$ ls biz.yahoo.com/edu/
 index.html
 [EMAIL PROTECTED] src]$


 Here is the debug info, note that in proclist() function, frontcmp(p, s)
 supposed return 1, but it returns 0.
 `p' is 'edu/' which, keed the trailing '/' from parameter, and 's'
 is 'edu' - the directory of crawled url. Since 's' doesn't start with 'p',
 then it failed.

 If pass the url's 'path' instead of 'dir' to accdir(), it may work.

 Actually, I really recommend change the '-include-directories' parameter to
 '-include-urls'(so does -exlclude..). Then keeps the '/' characters in the
 parameter make more sense and easier to use. I used htdig before, which uses
 'exclude_urls: /cgi-bin/' as well in its configuration.


 [EMAIL PROTECTED] src]$ gdb wget
 (gdb) b accdir
 Breakpoint 1 at 0x806cb42: file utils.c, line 714.
 (gdb) run -r  --domains=biz.yahoo.com -I /edu/ http://biz.yahoo.com/edu/
 Starting program: /home/weiye/downloads/wget-1.8.2/src/wget -r
 --domains=biz.yahoo.com - I /edu/ http://biz.yahoo.com/edu/
 --18:55:07--  http://biz.yahoo.com/edu/
= `biz.yahoo.com/edu/index.html'
 Resolving biz.yahoo.com... done.
 Connecting to biz.yahoo.com[66.163.175.141]:80... connected.
 HTTP request sent, awaiting response... 200 OK
 Length: unspecified [text/html]

 [ =  ] 6,741  6.43M/s


 18:55:07 (6.43 MB/s) - `biz.yahoo.com/edu/index.html' saved [6741]


 Breakpoint 1, accdir (directory=0x8089df0 edu, flags=ALLABS) at utils.c:714
 714   if (flags  ALLABS  *directory == '/')
 (gdb) n
 716   if (opt.includes)
 (gdb)
 718   if (!proclist (opt.includes, directory, flags))
 (gdb) s
 proclist (strlist=0x807f090, s=0x8089df0 edu, flags=ALLABS) at utils.c:690
 690   for (x = strlist; *x; x++)
 (gdb) n
 691 if (has_wildcards_p (*x))
 (gdb) p *x
 $1 = 0x807f0a0 /edu/
 (gdb) n
 698 char *p = *x + ((flags  ALLABS)  (**x == '/')); /* Remove
 '/' */
 (gdb)
 699 if (frontcmp (p, s))
 (gdb) p p
 $2 = 0x807f0a1 edu/
 (gdb) p s
 $3 = 0x8089df0 edu
 (gdb) p p
 $4 = 0x807f0a1 edu/
 (gdb) n
 701   }
 (gdb) bt
 #0  proclist (strlist=0x807f090, s=0x8089df0 edu, flags=ALLABS) at
 utils.c:701
 #1  0x806cb76 in accdir (directory=0x8089df0 edu, flags=ALLABS) at
 utils.c:718
 #2  0x8064d8d in download_child_p (upos=0x807e7e0, parent=0x808c800, depth=0,
 start_url_parsed=0x808, blacklist=0x807e100) at recur.c:514
 #3  0x80648b0 in retrieve_tree (start_url=0x807e080
 http://biz.yahoo.com/edu/;)
 at recur.c:348
 #4  0x8062179 in main (argc=6, argv=0x9fbff444) at main.c:822
 #5  0x804a20d in _start ()
 (gdb)

 Thanks very much!!

-- 
Consider supporting GNU Software and the Free Software Foundation
By Buying Stuff - http://www.gnu.org/gear/
  (GNU and FSF are not responsible for this promotion
   nor do they necessarily agree with the views or opinions of the author)


Re: trailing '/' of include-directories removed bug

2003-06-13 Thread Aaron S. Hawley
no, i think your original idea of getting rid of the code that removes the
trailing slash is a better idea.  i think this would fix it but keep the
degenerate case of root directory (whatever that's about):

Index: src/init.c
===
RCS file: /pack/anoncvs/wget/src/init.c,v
retrieving revision 1.54
diff -u -u -r1.54 init.c
--- src/init.c  2002/08/03 20:34:57 1.54
+++ src/init.c  2003/06/13 20:24:16
@@ -753,7 +753,6 @@

   if (*val)
 {
-  /* Strip the trailing slashes from directories.  */
   char **t, **seps;

   seps = sepstring (val);
@@ -761,10 +760,10 @@
{
  int len = strlen (*t);
  /* Skip degenerate case of root directory.  */
- if (len  1)
+ if (len == 1)
{
- if ((*t)[len - 1] == '/')
-   (*t)[len - 1] = '\0';
+ if ((*t)[0] == '/')
+   (*t)[0] = '\0';
}
}
   *pvec = merge_vecs (*pvec, seps);

On Thu, 12 Jun 2003, wei ye wrote:

 For the situation I only need '/r/', there is no option for I to do that.

 If user need '/r*/', they should specify -I '/r*/' instead.

 Simple patch attached, please consider it. Thanks!!

 [EMAIL PROTECTED] src]$ diff  -u utils.c.orig utils.c
 --- utils.c.origFri May 17 20:05:22 2002
 +++ utils.c Thu Jun 12 20:24:21 2003
 @@ -696,7 +696,9 @@
  else
{
 char *p = *x + ((flags  ALLABS)  (**x == '/')); /* Remove '/' */
 -   if (frontcmp (p, s))
 +   /* if *p=c, pass if s is c or c/... not ca */
 +   int plen = strlen(p);
 +   if ( (strncmp (p, s, plen) == 0)  (s[plen] == '/' || s[plen] == '\0')
 )
   break;
}
return *x;
 [EMAIL PROTECTED] src]$


-- 
I get threatening vacation messages from J K, too.


Re: trailing '/' of include-directories removed bug

2003-06-13 Thread wei ye

Did you test your patch? I patched it on my source code and it doesn't work.

There are lot of files under http://biz.yahoo.com/edu/, but
the patched code only downloaded the index.html.

[EMAIL PROTECTED] src]$ ./wget -r --domains=biz.yahoo.com -I /edu/
http://biz.yahoo.com/edu/
[EMAIL PROTECTED] src]$ ls biz.yahoo.com/
edu/
[EMAIL PROTECTED] src]$ ls biz.yahoo.com/edu/
index.html
[EMAIL PROTECTED] src]$ 


Here is the debug info, note that in proclist() function, frontcmp(p, s)
supposed return 1, but it returns 0.
`p' is 'edu/' which, keed the trailing '/' from parameter, and 's'
is 'edu' - the directory of crawled url. Since 's' doesn't start with 'p',
then it failed.

If pass the url's 'path' instead of 'dir' to accdir(), it may work.

Actually, I really recommend change the '-include-directories' parameter to
'-include-urls'(so does -exlclude..). Then keeps the '/' characters in the
parameter make more sense and easier to use. I used htdig before, which uses
'exclude_urls: /cgi-bin/' as well in its configuration.


[EMAIL PROTECTED] src]$ gdb wget
(gdb) b accdir
Breakpoint 1 at 0x806cb42: file utils.c, line 714.
(gdb) run -r  --domains=biz.yahoo.com -I /edu/ http://biz.yahoo.com/edu/
Starting program: /home/weiye/downloads/wget-1.8.2/src/wget -r 
--domains=biz.yahoo.com - I /edu/ http://biz.yahoo.com/edu/
--18:55:07--  http://biz.yahoo.com/edu/
   = `biz.yahoo.com/edu/index.html'
Resolving biz.yahoo.com... done.
Connecting to biz.yahoo.com[66.163.175.141]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]

[ =  ] 6,741  6.43M/s


18:55:07 (6.43 MB/s) - `biz.yahoo.com/edu/index.html' saved [6741]


Breakpoint 1, accdir (directory=0x8089df0 edu, flags=ALLABS) at utils.c:714
714   if (flags  ALLABS  *directory == '/')
(gdb) n
716   if (opt.includes)
(gdb) 
718   if (!proclist (opt.includes, directory, flags))
(gdb) s
proclist (strlist=0x807f090, s=0x8089df0 edu, flags=ALLABS) at utils.c:690
690   for (x = strlist; *x; x++)
(gdb) n
691 if (has_wildcards_p (*x))
(gdb) p *x
$1 = 0x807f0a0 /edu/
(gdb) n
698 char *p = *x + ((flags  ALLABS)  (**x == '/')); /* Remove
'/' */
(gdb) 
699 if (frontcmp (p, s))
(gdb) p p
$2 = 0x807f0a1 edu/
(gdb) p s
$3 = 0x8089df0 edu
(gdb) p p
$4 = 0x807f0a1 edu/
(gdb) n
701   }
(gdb) bt
#0  proclist (strlist=0x807f090, s=0x8089df0 edu, flags=ALLABS) at
utils.c:701
#1  0x806cb76 in accdir (directory=0x8089df0 edu, flags=ALLABS) at
utils.c:718
#2  0x8064d8d in download_child_p (upos=0x807e7e0, parent=0x808c800, depth=0, 
start_url_parsed=0x808, blacklist=0x807e100) at recur.c:514
#3  0x80648b0 in retrieve_tree (start_url=0x807e080
http://biz.yahoo.com/edu/;)
at recur.c:348
#4  0x8062179 in main (argc=6, argv=0x9fbff444) at main.c:822
#5  0x804a20d in _start ()
(gdb) 

Thanks very much!!

--- Aaron S. Hawley [EMAIL PROTECTED] wrote:
 no, i think your original idea of getting rid of the code that removes the
 trailing slash is a better idea.  i think this would fix it but keep the
 degenerate case of root directory (whatever that's about):
 
 Index: src/init.c
 ===
 RCS file: /pack/anoncvs/wget/src/init.c,v
 retrieving revision 1.54
 diff -u -u -r1.54 init.c
 --- src/init.c2002/08/03 20:34:57 1.54
 +++ src/init.c2003/06/13 20:24:16
 @@ -753,7 +753,6 @@
 
if (*val)
  {
 -  /* Strip the trailing slashes from directories.  */
char **t, **seps;
 
seps = sepstring (val);
 @@ -761,10 +760,10 @@
   {
 int len = strlen (*t);
 /* Skip degenerate case of root directory.  */
 -   if (len  1)
 +   if (len == 1)
   {
 -   if ((*t)[len - 1] == '/')
 - (*t)[len - 1] = '\0';
 +   if ((*t)[0] == '/')
 + (*t)[0] = '\0';
   }
   }
*pvec = merge_vecs (*pvec, seps);
 
 On Thu, 12 Jun 2003, wei ye wrote:
 
  For the situation I only need '/r/', there is no option for I to do that.
 
  If user need '/r*/', they should specify -I '/r*/' instead.
 
  Simple patch attached, please consider it. Thanks!!
 
  [EMAIL PROTECTED] src]$ diff  -u utils.c.orig utils.c
  --- utils.c.origFri May 17 20:05:22 2002
  +++ utils.c Thu Jun 12 20:24:21 2003
  @@ -696,7 +696,9 @@
   else
 {
  char *p = *x + ((flags  ALLABS)  (**x == '/')); /* Remove '/' */
  -   if (frontcmp (p, s))
  +   /* if *p=c, pass if s is c or c/... not ca */
  +   int plen = strlen(p);
  +   if ( (strncmp (p, s, plen) == 0)  (s[plen] == '/' || s[plen] ==
 '\0')
  )
break;
 }
 return *x;
  [EMAIL PROTECTED] src]$
 
 
 -- 
 I get threatening vacation messages from J K, too.


=
Wei Ye

__
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!

Re: trailing '/' of include-directories removed bug

2003-06-12 Thread Aaron S. Hawley
above the code segment you submitted (line 765 of init.c) the
comment:

/* Strip the trailing slashes from directories.  */

here are the manual notes on this option:

(from Recursive Accept/Reject Options)

`-I list'
`--include-directories=list'
Specify a comma-separated list of directories you wish to follow when
downloading (See section Directory-Based Limits for more details.)
Elements of list may contain wildcards.

 --- and ---

(from Directory-Based Limits)

`-I list'
`--include list'
`include_directories = list'
`-I' option accepts a comma-separated list of directories included in
the retrieval. Any other directories will simply be ignored. The
directories are absolute paths. So, if you wish to download from
`http://host/people/bozo/' following only links to bozo's colleagues in
the `/people' directory and the bogus scripts in `/cgi-bin', you can
specify:

wget -I /people,/cgi-bin http://host/people/bozo/

---

On Wed, 11 Jun 2003, wei ye wrote:

 I'm trying to crawl url with  --include-directories='/r/'
 parameter.

 I expect to crawl '/r/*', but wget gives me '/r*'.

 By reading the code, it turns out that cmd_directory_vector()
 removed the trailing '/' of include-directories '/r/'.

 It's a minor bug, but I hope it could be fix in next version.

 Thanks!

 static int cmd_directory_vector(...) {
  ...
   if (len  1)
 {
   if ((*t)[len - 1] == '/')
 (*t)[len - 1] = '\0';
 }
  ...

 }

 =
 Wei Ye

-- 
Yahweh commanded Abraham to sacrifice his only son Isaac on the top of a
mountain. When Abraham asked why, Yahweh replied because 'I am God.' When
I heard this story the first time, I promised myself to check out
atheism.  -- Louis Proyect www.marxmail.org/


Re: trailing '/' of include-directories removed bug

2003-06-12 Thread Aaron S. Hawley
oh, i understand your problem.  your request seems reasonable.  i was
trying to see if anyone had an idea why it seemed to be more of a
feature than a bug.

On Thu, 12 Jun 2003, wei ye wrote:


 Please take a look this example:
 $ \rm -rf biz.yahoo.com
 $ ls biz.yahoo.com
 $ wget -r  --domains=biz.yahoo.com -I /r/ 'http://biz.yahoo.com/r/'
 $ ls biz.yahoo.com/
 r/  reports/research/
 $

 I want only '/r/', but it crawls /r*, which includes /reports/, /research/.

 Is it an expected result or a bug?

 Thanks alot!


 --- Aaron S. Hawley [EMAIL PROTECTED] wrote:
  above the code segment you submitted (line 765 of init.c) the
  comment:
 
  /* Strip the trailing slashes from directories.  */
 
  here are the manual notes on this option:
 
  (from Recursive Accept/Reject Options)
 
  `-I list'
  `--include-directories=list'
  Specify a comma-separated list of directories you wish to follow when
  downloading (See section Directory-Based Limits for more details.)
  Elements of list may contain wildcards.
 
   --- and ---
 
  (from Directory-Based Limits)
 
  `-I list'
  `--include list'
  `include_directories = list'
  `-I' option accepts a comma-separated list of directories included in
  the retrieval. Any other directories will simply be ignored. The
  directories are absolute paths. So, if you wish to download from
  `http://host/people/bozo/' following only links to bozo's colleagues in
  the `/people' directory and the bogus scripts in `/cgi-bin', you can
  specify:
 
  wget -I /people,/cgi-bin http://host/people/bozo/
 
  ---
 
  On Wed, 11 Jun 2003, wei ye wrote:
 
   I'm trying to crawl url with  --include-directories='/r/'
   parameter.
  
   I expect to crawl '/r/*', but wget gives me '/r*'.
  
   By reading the code, it turns out that cmd_directory_vector()
   removed the trailing '/' of include-directories '/r/'.
  
   It's a minor bug, but I hope it could be fix in next version.
  
   Thanks!
  
   static int cmd_directory_vector(...) {
...
 if (len  1)
   {
 if ((*t)[len - 1] == '/')
   (*t)[len - 1] = '\0';
   }
...
  
   }
  
   =
   Wei Ye

-- 
Fight for Free Digital Speech
www.digitalspeech.org


Re: trailing '/' of include-directories removed bug

2003-06-12 Thread wei ye

Please take a look this example:
$ \rm -rf biz.yahoo.com
$ ls biz.yahoo.com
$ wget -r  --domains=biz.yahoo.com -I /r/ 'http://biz.yahoo.com/r/'
$ ls biz.yahoo.com/
r/  reports/research/
$

I want only '/r/', but it crawls /r*, which includes /reports/, /research/.

Is it an expected result or a bug?

Thanks alot!


--- Aaron S. Hawley [EMAIL PROTECTED] wrote:
 above the code segment you submitted (line 765 of init.c) the
 comment:
 
 /* Strip the trailing slashes from directories.  */
 
 here are the manual notes on this option:
 
 (from Recursive Accept/Reject Options)
 
 `-I list'
 `--include-directories=list'
 Specify a comma-separated list of directories you wish to follow when
 downloading (See section Directory-Based Limits for more details.)
 Elements of list may contain wildcards.
 
  --- and ---
 
 (from Directory-Based Limits)
 
 `-I list'
 `--include list'
 `include_directories = list'
 `-I' option accepts a comma-separated list of directories included in
 the retrieval. Any other directories will simply be ignored. The
 directories are absolute paths. So, if you wish to download from
 `http://host/people/bozo/' following only links to bozo's colleagues in
 the `/people' directory and the bogus scripts in `/cgi-bin', you can
 specify:
 
 wget -I /people,/cgi-bin http://host/people/bozo/
 
 ---
 
 On Wed, 11 Jun 2003, wei ye wrote:
 
  I'm trying to crawl url with  --include-directories='/r/'
  parameter.
 
  I expect to crawl '/r/*', but wget gives me '/r*'.
 
  By reading the code, it turns out that cmd_directory_vector()
  removed the trailing '/' of include-directories '/r/'.
 
  It's a minor bug, but I hope it could be fix in next version.
 
  Thanks!
 
  static int cmd_directory_vector(...) {
   ...
if (len  1)
  {
if ((*t)[len - 1] == '/')
  (*t)[len - 1] = '\0';
  }
   ...
 
  }
 
  =
  Wei Ye
 
 -- 
 Yahweh commanded Abraham to sacrifice his only son Isaac on the top of a
 mountain. When Abraham asked why, Yahweh replied because 'I am God.' When
 I heard this story the first time, I promised myself to check out
 atheism.  -- Louis Proyect www.marxmail.org/


=
Wei Ye

__
Do you Yahoo!?
Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
http://calendar.yahoo.com


Re: trailing '/' of include-directories removed bug

2003-06-12 Thread wei ye

For the situation I only need '/r/', there is no option for I to do that.

If user need '/r*/', they should specify -I '/r*/' instead.

Simple patch attached, please consider it. Thanks!!

[EMAIL PROTECTED] src]$ diff  -u utils.c.orig utils.c
--- utils.c.origFri May 17 20:05:22 2002
+++ utils.c Thu Jun 12 20:24:21 2003
@@ -696,7 +696,9 @@
 else
   {
char *p = *x + ((flags  ALLABS)  (**x == '/')); /* Remove '/' */
-   if (frontcmp (p, s))
+   /* if *p=c, pass if s is c or c/... not ca */
+   int plen = strlen(p);
+   if ( (strncmp (p, s, plen) == 0)  (s[plen] == '/' || s[plen] == '\0')
)
  break;
   }
   return *x;
[EMAIL PROTECTED] src]$ 


--- Aaron S. Hawley [EMAIL PROTECTED] wrote:
 oh, i understand your problem.  your request seems reasonable.  i was
 trying to see if anyone had an idea why it seemed to be more of a
 feature than a bug.
 
 On Thu, 12 Jun 2003, wei ye wrote:
 
 
  Please take a look this example:
  $ \rm -rf biz.yahoo.com
  $ ls biz.yahoo.com
  $ wget -r  --domains=biz.yahoo.com -I /r/ 'http://biz.yahoo.com/r/'
  $ ls biz.yahoo.com/
  r/  reports/research/
  $
 
  I want only '/r/', but it crawls /r*, which includes /reports/, /research/.
 
  Is it an expected result or a bug?
 
  Thanks alot!
 
 
  --- Aaron S. Hawley [EMAIL PROTECTED] wrote:
   above the code segment you submitted (line 765 of init.c) the
   comment:
  
   /* Strip the trailing slashes from directories.  */
  
   here are the manual notes on this option:
  
   (from Recursive Accept/Reject Options)
  
   `-I list'
   `--include-directories=list'
   Specify a comma-separated list of directories you wish to follow when
   downloading (See section Directory-Based Limits for more details.)
   Elements of list may contain wildcards.
  
--- and ---
  
   (from Directory-Based Limits)
  
   `-I list'
   `--include list'
   `include_directories = list'
   `-I' option accepts a comma-separated list of directories included in
   the retrieval. Any other directories will simply be ignored. The
   directories are absolute paths. So, if you wish to download from
   `http://host/people/bozo/' following only links to bozo's colleagues in
   the `/people' directory and the bogus scripts in `/cgi-bin', you can
   specify:
  
   wget -I /people,/cgi-bin http://host/people/bozo/
  
   ---
  
   On Wed, 11 Jun 2003, wei ye wrote:
  
I'm trying to crawl url with  --include-directories='/r/'
parameter.
   
I expect to crawl '/r/*', but wget gives me '/r*'.
   
By reading the code, it turns out that cmd_directory_vector()
removed the trailing '/' of include-directories '/r/'.
   
It's a minor bug, but I hope it could be fix in next version.
   
Thanks!
   
static int cmd_directory_vector(...) {
 ...
  if (len  1)
{
  if ((*t)[len - 1] == '/')
(*t)[len - 1] = '\0';
}
 ...
   
}
   
=
Wei Ye
 
 -- 
 Fight for Free Digital Speech
 www.digitalspeech.org


=
Wei Ye

__
Do you Yahoo!?
Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
http://calendar.yahoo.com


Re: trailing '/' of include-directories removed bug

2003-06-11 Thread wei ye
BTW, my wget version is 1.8.2.

Thanks!

--- wei ye [EMAIL PROTECTED] wrote:
 
 I'm trying to crawl url with  --include-directories='/r/'
 parameter.
 
 I expect to crawl '/r/*', but wget gives me '/r*'.
 
 By reading the code, it turns out that cmd_directory_vector()
 removed the trailing '/' of include-directories '/r/'.
 
 It's a minor bug, but I hope it could be fix in next version.
 
 Thanks!
 
 static int cmd_directory_vector(...) {
  ...
   if (len  1)
 {
   if ((*t)[len - 1] == '/')
 (*t)[len - 1] = '\0';
 }
  ...
 
 }
 
 =
 Wei Ye
 
 __
 Do you Yahoo!?
 Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
 http://calendar.yahoo.com
 


=
Wei Ye

__
Do you Yahoo!?
Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
http://calendar.yahoo.com


trailing '/' of include-directories removed bug

2003-06-11 Thread wei ye

I'm trying to crawl url with  --include-directories='/r/'
parameter.

I expect to crawl '/r/*', but wget gives me '/r*'.

By reading the code, it turns out that cmd_directory_vector()
removed the trailing '/' of include-directories '/r/'.

It's a minor bug, but I hope it could be fix in next version.

Thanks!

static int cmd_directory_vector(...) {
 ...
  if (len  1)
{
  if ((*t)[len - 1] == '/')
(*t)[len - 1] = '\0';
}
 ...

}

=
Wei Ye

__
Do you Yahoo!?
Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
http://calendar.yahoo.com