Re: About Automated Unit Test for Wget

2008-04-06 Thread Yoshihiro Tanaka
2008/4/5, Micah Cowan [EMAIL PROTECTED]:
 -BEGIN PGP SIGNED MESSAGE-
  Hash: SHA1

  Daniel Stenberg wrote:
   This system allows us to write unit-tests if we'd like to, but mostly so
   far we've focused to test it system-wide. It is hard enough for us!


 Yeah, I thought I'd seen something like that; I was thinking we might
  even be able to appropriate some of that, if that looked doable. Except
  that I preferred faking the server completely, so I could deal better
  with cross-site issues, which AFAICT are significantly more important to
  Wget than they are to Curl.


It seems that abstraction of network API needs more discussion,
so I would focus on the server emulation

By the way, How about using LD_PRELOAD ?
I tested a little and it seems to be working. If we use this, we can test
by overriding socket interface, and still we don't change wget real source
code.

--main.c --
#include stdio.h


int main(void)
{

puts(Helow Wgets\n);
return 0;

}



--testputs.c 
#include stdio.h

int puts(const char *str)
{
   while(*str)
putchar(*str++);
   printf(This is a test module);
   putchar('\n');
}
-


--Compile like below:

[EMAIL PROTECTED] Test]$ gcc main.c -o main
[EMAIL PROTECTED] Test]$ gcc -fPIC -shared -o testputs.so testputs.c



--Execute like below:

[EMAIL PROTECTED] Test]$ ./main
Helow Wgets

[EMAIL PROTECTED] Test]$ LD_PRELOAD=./testputs.so ./main
Helow Wgets
This is a test module


--
I found this way on the net, and sample was using wget !! they are overriding
socket, close, and connect.
http://www.t-dori.net/forensics/hook_tcp.cpp

-- 
Yoshihiro TANAKA


Re: About Automated Unit Test for Wget

2008-04-06 Thread Yoshihiro Tanaka
2008/4/5, Micah Cowan [EMAIL PROTECTED]:
 -BEGIN PGP SIGNED MESSAGE-
  Hash: SHA1

  Daniel Stenberg wrote:
   This system allows us to write unit-tests if we'd like to, but mostly so
   far we've focused to test it system-wide. It is hard enough for us!


 Yeah, I thought I'd seen something like that; I was thinking we might
  even be able to appropriate some of that, if that looked doable. Except
  that I preferred faking the server completely, so I could deal better
  with cross-site issues, which AFAICT are significantly more important to
  Wget than they are to Curl.


It seems that abstraction of network API needs more discussion,
so I would focus on the server emulation

By the way, How about using LD_PRELOAD ?
I tested a little and it seems to be working. If we use this, we can test
by overriding socket interface, and still we don't change wget real source
code.

I found this way on the net, and sample was using wget !! they are overriding
socket, close, connect.


--main.c --
#include stdio.h


int main(void)
{

puts(Helow Wgets\n);
return 0;

}



--testputs.c 
#include stdio.h

int puts(const char *str)
{
   while(*str)
putchar(*str++);
   printf(This is a test module);
   putchar('\n');
}
-


--Compile like below:

[EMAIL PROTECTED] Test]$ gcc main.c -o main
[EMAIL PROTECTED] Test]$ gcc -fPIC -shared -o testputs.so testputs.c



--Execute like below:

[EMAIL PROTECTED] Test]$ ./main
Helow Wgets

[EMAIL PROTECTED] Test]$ LD_PRELOAD=./testputs.so ./main
Helow Wgets
This is a test module



-- 
Yoshihiro TANAKA
SFSU CS Department


Re: About Automated Unit Test for Wget

2008-04-05 Thread Yoshihiro Tanaka
2008/4/4, Micah Cowan [EMAIL PROTECTED]:

  IMO, if it's worth testing, it's probably better to have external
  linkage anyway.

I got it.


1) Select functions which can be tested in unit test.
   But How can I select them? is difficult.
   Basically the less dependency the function has, more easy to
   include in unit test, but about boundary line, I'm not sure.


 This is precisely the problem, and one reason I've been thinking that
  this might not make an ideal topic for a GSoC proposal, unless you want
  to include refactoring existing functions like gethttp and http_loop
  into more logically discreet sets of functions. Essentially, to get
  better coverage of the code that needs it the most, that code will need
  to be rewritten. I believe this can be an iterative process (find one
  function to factor out, write a unit test for it, make it work...).

Yes, since I want to write proposal for Unit testing, I can't skip this
problem. But considering GSoC program is only 2 month, I'd rather narrow
down the target - to gethttp funcion.

Although I'm not well aware of source code,
I'm thinking like below:

In gethttp function there are roughly six chunk of functionality.

1.preparation of request

2.making header part of HTTP
   proxy_auth
   generate_hosthead
   , and other process to make header

3.connection
   persistent_available_p
   establishment of connection to host
   ssl_connection process

4.http request process
   request send
   read request response
   checking status codes

5.local file - related process ( a bunch of process...)
   deterimine filename
   file existence check
   noclobber, -O check
   timestamping check
   Content-Length check
   Keep-Alive response check
   Authorize process
   Set-cookie header
   Content-Range check
   filename dealing (HTML Extention)
   416 status code dealing
   open local file

6.download body part  writing into local file



So, Basically I think it could be divided into these functionality.
And after that each functionality would be divided into more
small pieces to the extent that unit tests can be done separately.

In addition to above, we have to think about abstraction of
network API and file I/O API.

But network API(such as fd_read_body, fd_read_hunk) exists in
 retr.c, and socket is opened in connect.c file, it looks that
abstraction of network API would require major modification of
interfaces.

And design of this would not be proper for me, I think.
So what I want to suggest is that I want to ask interface _design_.
How do you think ? At least I want to narrow down the scope within
I can take responsiblity.


  What I'm most keenly interested in, is the ability to verify the logic
  of how follow/don't-follow is decided (that actually may not be too hard
  to write tests against now), how Wget handles various protocol-level
  situations, how it chooses the filename and deals with the local
  filesystem, etc. I will be very, _very_ happy when everything that's in
  http_loop and gethttp is verified by unit tests.

  But a lot of getting to where we can test that may mean abstracting out
  things like the Berkeley Sockets API and filesystem interactions, so
  that we can drop in fake replacements for testing.


I'd like to try, if we could settle down the problem of interface design...


  I'm familiar with a framework called (simply) Check, which might be
  worth considering. It forks a new process for each test, which isolates
  it from interfering with the other tests, and also provides a clean way
  to handle things like segmentation violations or aborts. However, it's
  intended for Unix, and probably doesn't compile on other systems.

  http://check.sourceforge.net/

Thank you for your information :)


-- 
Yoshihiro TANAKA


Re: About Automated Unit Test for Wget

2008-04-05 Thread Yoshihiro Tanaka
2008/4/5, Yoshihiro Tanaka [EMAIL PROTECTED]:
 2008/4/4, Micah Cowan [EMAIL PROTECTED]:

 
IMO, if it's worth testing, it's probably better to have external
linkage anyway.


 I got it.



  1) Select functions which can be tested in unit test.
 But How can I select them? is difficult.
 Basically the less dependency the function has, more easy to
 include in unit test, but about boundary line, I'm not sure.
  
  
   This is precisely the problem, and one reason I've been thinking that
this might not make an ideal topic for a GSoC proposal, unless you want
to include refactoring existing functions like gethttp and http_loop
into more logically discreet sets of functions. Essentially, to get
better coverage of the code that needs it the most, that code will need
to be rewritten. I believe this can be an iterative process (find one
function to factor out, write a unit test for it, make it work...).


 Yes, since I want to write proposal for Unit testing, I can't skip this
  problem. But considering GSoC program is only 2 month, I'd rather narrow
  down the target - to gethttp funcion.

  Although I'm not well aware of source code,
  I'm thinking like below:

  In gethttp function there are roughly six chunk of functionality.

  1.preparation of request

  2.making header part of HTTP
proxy_auth
generate_hosthead
, and other process to make header

  3.connection
persistent_available_p
establishment of connection to host
ssl_connection process

  4.http request process
request send
read request response
checking status codes

  5.local file - related process ( a bunch of process...)
deterimine filename
file existence check
noclobber, -O check
timestamping check
Content-Length check
Keep-Alive response check
Authorize process
Set-cookie header
Content-Range check
filename dealing (HTML Extention)
416 status code dealing
open local file

  6.download body part  writing into local file



  So, Basically I think it could be divided into these functionality.
  And after that each functionality would be divided into more
  small pieces to the extent that unit tests can be done separately.

  In addition to above, we have to think about abstraction of
  network API and file I/O API.

  But network API(such as fd_read_body, fd_read_hunk) exists in
   retr.c, and socket is opened in connect.c file, it looks that
  abstraction of network API would require major modification of
  interfaces.

Or did you mean to write wget version of socket interface?
i.e. to write our version of socket, connect,write,read,close,bind,
listen,accept,,,? sorry I'm confused.



  And design of this would not be proper for me, I think.
  So what I want to suggest is that I want to ask interface _design_.
  How do you think ? At least I want to narrow down the scope within
  I can take responsiblity.



What I'm most keenly interested in, is the ability to verify the logic
of how follow/don't-follow is decided (that actually may not be too hard
to write tests against now), how Wget handles various protocol-level
situations, how it chooses the filename and deals with the local
filesystem, etc. I will be very, _very_ happy when everything that's in
http_loop and gethttp is verified by unit tests.
  
But a lot of getting to where we can test that may mean abstracting out
things like the Berkeley Sockets API and filesystem interactions, so
that we can drop in fake replacements for testing.
  


 I'd like to try, if we could settle down the problem of interface design...



I'm familiar with a framework called (simply) Check, which might be
worth considering. It forks a new process for each test, which isolates
it from interfering with the other tests, and also provides a clean way
to handle things like segmentation violations or aborts. However, it's
intended for Unix, and probably doesn't compile on other systems.
  
http://check.sourceforge.net/


 Thank you for your information :)


  --

 Yoshihiro TANAKA

-- 
Yoshihiro TANAKA


About Automated Unit Test for Wget

2008-04-04 Thread Yoshihiro Tanaka
Hello, I want to ask about Unit test of Wget in the future.
I want to ask about unit test.

Now unit test of Wget is written only for following .c files.
 -- http.c init.c main.c res.c url.c utils.c (test.c)

So as written in Wiki, new unit test suite is necessary.
   (ref. http://wget.addictivecode.org/FeatureSpecifications/Testing)

In order to make new unit test suite, I think following work is necessary.

 1) Select functions which can be tested in unit test.
But How can I select them? is difficult.
Basically the less dependency the function has, more easy to
include in unit test, but about boundary line, I'm not sure.

 2) In order to do above 1), How about Making a list of all functions
in Wget? and maintain?

The advantage of 2) is ...
* make clear which function is included in Unit test
* make clear which function is _not_ in Unit test
* make easy to manage testing
* make easy to devide testing work

So once this list is done, it would become easier to maintain
testing schedule, progress, etc.. And when Unit test suite is
done, this list should be able to be generated automatically...
and to do regression test, all we do is just run Unit test again :)

 3) Contents of list I come up is following:

 * Wget version num
 * Filename
 * function name
 * Included in Unit Test or not
 * Simple Call graph of the function


So let me ask your opinions.
And is there any suggestion about unit test of Wget?
(test tools, other preliminary work for unit test, how manage it ...)

Thank you for your time.

-- 
Yoshihiro TANAKA


Re: About file format for MetaDataBase

2008-03-29 Thread Yoshihiro Tanaka
 it doesn't
understand _anyway_, and any other important changes will pretty much
require a major version dump, does it actually make sense to distinguish

 
  (I meant bump.)

   an SIDB 1.0 from an SIDB 1.1?
  
   At least minor version would help when we check the contents of SIDB file.
   In the case like, why this item is/is not writen here?


 That's true; but actually, using the Wget version number instead could
  be more informative in that way. We could write that information as well
   (but give it no semantic meaning: just intended for human readers).
  That way, we wouldn't have to remember to be sure to bump the SIDB
  version number every time we add a new header type (I'm not as worried
  about the major version bumps: I think we'll remember to bump for truly
  incompatible changes).



Yes, if we could do without more information, it would be better.
I just wandering it might be useful. How about the case like this?:

Wget 1.12  SIDB 1.0
Wget 1.13  SIDB 1.1
Wget 1.14  SIDB 1.1
Wget 1.15  SIDB 1.1
Wget 1.16  SIDB 1.2

For me, if SIDB has version number, it looks clear which version of
Wget uses which format of SIDB.
This is my impression, so please tell me how do you feel.

Thank you for your time.

-- 
Yoshihiro TANAKA


About file format for MetaDataBase

2008-03-27 Thread Yoshihiro Tanaka
Hello, My name is Yoshihiro TANAKA.

I'm interested in GSOC, and MetaDataBase project.

So let me ask about file format for MetaDataBase(SIDB).
Considering forwards-compatibility, Wget should be able to ignore items
it does not recognize. For this, Wget has to know which data belongs to
which item.
So how about csv, with delimiter | ?

It would look like below.

-
first  line:Wget Start at MMSSMMHH-DDMM
second line:SIDB Version:1.13
third  line:Wget invocation configration
fourth line:titleline:URL|StatusCode|Filepath|MIME-Type|..
fifth  line, and below:data lines bra|bra|bra|bra|bra|bra|...
data lines bra|bra|bra|bra|bra|bra|...
data lines bra|bra|bra|bra|bra|bra|...
data lines bra|bra|bra|bra|bra|bra|...
data lines bra|bra|bra|bra|bra|bra|...
data lines bra|bra|bra|bra|bra|bra|...
last line:Wget End at MMSSMMHH-DDMM
---

The advantage of this format is:
1. Wget can recognize start/end of session
2. Wget can recognize which data belongs to with item
   (It includes configuration infor in title line)
3. Wget can recognize the version of this SIDB file
   (It does not have to be same to that of Wget)

Case 1: When Older Wget reads newer version of SIDB file,
it can only read items which it recognizes.

Case 2: When Newer Wget wants to use old version SIDB file,
it can check Version of file, and cope with it.

Case 3: When New Wget wants to use new version SIDB file as Old
version SIDB file,
it can specify version of SIDB file like:
# Wget -VSIDB 1.12
which means even SIDB file version is 1.13, Wget treat it as
version 1.12 file.


so please comment on this file format.
Thank you for your time.
-- 
Yoshihiro TANAKA
SFSU CS Department