Re: About file format for MetaDataBase

2008-03-29 Thread Yoshihiro Tanaka
2008/3/28, Micah Cowan [EMAIL PROTECTED]:

 It's not a problem so long as the data is clearly associated with its file.

  The sample file I gave in the previous post has a demonstration of this;
  logo.png was being downloaded while index.html was still being fetched.
  If more information had been available on index.html, it could be
  written out with the appropriate CONTINUE directive preceding it.

  It's not clear to me that that's the best way to deal with it; it could
  be that associating an identifier with each URI, and then using that id
  with each line, could be a good alternative as well.

  Or, perhaps we should keep the block-oriented format (most information
  will be available at the start, in the headers and whatnot), and use ids
  for lines that indicate final status.

I prefer block-oriented format. I want the information about one file come in
clusters. Because it will be more readable and easy to process.



   Yes, and about this part I want to know how Wget should treat SIDB file.
   For example, I want to define the case like below:
   - When there is already SIDB file. Is this file modified/appended/rewritten
 when Wget is invocated next time?


 By default, it should probably use a new, separate file. Exceptions
  would be when you specifically ask it to operate on an existing session
  db file. Continuing an aborted session, etc, should use the same session
  db it's continuing from.

I got it, thank you.


 Case 3: When New Wget wants to use new version SIDB file as Old
 version SIDB file,
 it can specify version of SIDB file like:
 # Wget -VSIDB 1.12
 which means even SIDB file version is 1.13, Wget treat it as
 version 1.12 file.
  
  
   This may be a good idea, but I'm not sure it will be necessary (of
course, it will be easy to add if it looks like it's useful).
  
   Yes, maybe no need.


 Well, when we get to new major numbers, at any rate, it'll almost
  certainly be useful; I should've been more specific that I wasn't sure
  about the minors.


It might be a good idea to include a mechanism for specifying that
certain headers must _not_ be ignored, and that if a particular version
of Wget does not understand them, it should fail out. I'm having some
trouble coming up with a case where we would actually need this, but it
really doesn't hurt to build it in just in case.
  
   Yes, but if Wget does not understand certain inevitable headers, it does 
 not
   know how it can fail out. So Wget should fail out if they can not find 
 certain
   inevitable headers. Do I make sense?


 Right: that's why the mechanism needs to be in place from the beginning,
  so that even though they're new headers, Wget can understand that it
  should not attempt to use the file if it can't understand these.

  It could be something as easy as a naming convention, or header lines
  beginning with a !, etc.

  OTOH, maybe it doesn't really buy us anything over simply bumping the
  major number... it was just an idea.

Maybe simple specification would be enough. like:
Wget check major number -- if it is within acceptable, keep going,
and just ignore
what it does not understand.
Wget check major number -- if it is not within acceptable, fail out.



 WGET SIDB 1.1   # different version of Wget, understands a little
 # more, might write new kinds of info.'
 TIME 2008-03-28T00:53:07
 CONTINUE RESOURCE http://foo.com/main/
   X-Wget-Current-Length: 57256 # size of current file on disk
   X-Wget-Status: ENETUNREACH
 END RESOURCE http://foo.com/main/
 END SESSION # Indicates Wget at least terminated normally
  
 WGET SIDB 1.1
 TIME 2008-03-28T11:15:27
 CONTINUE RESOURCE http://foo.com/main/
   X-Wget-Current-Length: 57256
   X-Wget-HTTP-Status: 206 Partial Content
   Content-Length: 20 # Length of the response
   X-Wget-Resource-Size: 257256 # Length of the file
   X-Wget-Status: success
 END RESOURCE http://foo.com/main/
 END SESSION  # All is well.
  
   This is a interim information which indicates Wget downloaded _part_ of 
 file.
   I'm not sure if this part is necessary, because I was thinking Wget writes 
 into
   SIDB only about _donloaded_ file information.


 No, not interim information; but you may be write that information about
  the partial content (namely, the Content-Length header) isn't really all
  that useful.

  The 206 Partial Content bit is actually meant to reflect that Wget,
  knowing that it had the first ~56k, asked the server for just the rest
  (partial content).

Oh, I got it.





It's not clear to me that we actually _need_ the minor number as part of
the SIDB format version. The minor number is useful in HTTP, mainly to
negotiate between two different programs which version will be used for
communication. But, since Wget will ignore the headers 

Re: About file format for MetaDataBase

2008-03-29 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Yoshihiro Tanaka wrote:
 Yes, if we could do without more information, it would be better.
 I just wandering it might be useful. How about the case like this?:
 
 Wget 1.12  SIDB 1.0
 Wget 1.13  SIDB 1.1
 Wget 1.14  SIDB 1.1
 Wget 1.15  SIDB 1.1
 Wget 1.16  SIDB 1.2
 
 For me, if SIDB has version number, it looks clear which version of
 Wget uses which format of SIDB.

Well, the Wget version should probably be included anyway, particularly
if some *ahem* unintended changes to the format were made in some version.

However, I think I've come up with some cases where the minor number for
the database could be useful. Instead of bumping it for new types of
information, we can bump it for actual structural changes, that are
designed so that older versions of Wget can still read the file,
ignoring the unknown structure.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH7rcm7M8hyUobTrERAuu3AJ9qHMmMqKfjlnhWDncV6Ci/YLGWLQCeLg4u
bEGg+SdHM+ZmB+EUqh45Cek=
=4JWk
-END PGP SIGNATURE-


Re: About file format for MetaDataBase

2008-03-28 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Yoshihiro Tanaka wrote:
  Also, this format essentially requires that all data about a particular
  entry be known before any of it may be written. I think it would be
  useful to write some information, e.g. Filepath and MIME-Type (and other
   HTTP headers), as soon as it's known. If Wget is killed in the middle
 
 If it is _as soon as_, I'm just wandering the case that Wget downloads
 prural files
 parallely, doesn't that mean the information might mix between there files?
 Is this not problem?

It's not a problem so long as the data is clearly associated with its file.

The sample file I gave in the previous post has a demonstration of this;
logo.png was being downloaded while index.html was still being fetched.
If more information had been available on index.html, it could be
written out with the appropriate CONTINUE directive preceding it.

It's not clear to me that that's the best way to deal with it; it could
be that associating an identifier with each URI, and then using that id
with each line, could be a good alternative as well.

Or, perhaps we should keep the block-oriented format (most information
will be available at the start, in the headers and whatnot), and use ids
for lines that indicate final status.

...

 Yes, and about this part I want to know how Wget should treat SIDB file.
 For example, I want to define the case like below:
 - When there is already SIDB file. Is this file modified/appended/rewritten
   when Wget is invocated next time?

By default, it should probably use a new, separate file. Exceptions
would be when you specifically ask it to operate on an existing session
db file. Continuing an aborted session, etc, should use the same session
db it's continuing from.

   Case 3: When New Wget wants to use new version SIDB file as Old
   version SIDB file,
   it can specify version of SIDB file like:
   # Wget -VSIDB 1.12
   which means even SIDB file version is 1.13, Wget treat it as
   version 1.12 file.


 This may be a good idea, but I'm not sure it will be necessary (of
  course, it will be easy to add if it looks like it's useful).
 
 Yes, maybe no need.

Well, when we get to new major numbers, at any rate, it'll almost
certainly be useful; I should've been more specific that I wasn't sure
about the minors.

  It might be a good idea to include a mechanism for specifying that
  certain headers must _not_ be ignored, and that if a particular version
  of Wget does not understand them, it should fail out. I'm having some
  trouble coming up with a case where we would actually need this, but it
  really doesn't hurt to build it in just in case.
 
 Yes, but if Wget does not understand certain inevitable headers, it does not
 know how it can fail out. So Wget should fail out if they can not find certain
 inevitable headers. Do I make sense?

Right: that's why the mechanism needs to be in place from the beginning,
so that even though they're new headers, Wget can understand that it
should not attempt to use the file if it can't understand these.

It could be something as easy as a naming convention, or header lines
beginning with a !, etc.

OTOH, maybe it doesn't really buy us anything over simply bumping the
major number... it was just an idea.

...
   # If the above didn't match Content-Length, that would indicate that
   # the connection had been prematurely closed (or that the server
   # lied).
 X-Wget-Status: success
   END RESOURCE http://foo.com/images/logo.png

   # !!! Wget was killed here, index.html still not done downloading.

   # New Wget invocation, continuing the session:
 
 Here, Is Wget writing into another file ?

Yes.

   WGET SIDB 1.1   # different version of Wget, understands a little
   # more, might write new kinds of info.'
   TIME 2008-03-28T00:53:07
   CONTINUE RESOURCE http://foo.com/main/
 X-Wget-Current-Length: 57256 # size of current file on disk
 X-Wget-Status: ENETUNREACH
   END RESOURCE http://foo.com/main/
   END SESSION # Indicates Wget at least terminated normally

   WGET SIDB 1.1
   TIME 2008-03-28T11:15:27
   CONTINUE RESOURCE http://foo.com/main/
 X-Wget-Current-Length: 57256
 X-Wget-HTTP-Status: 206 Partial Content
 Content-Length: 20 # Length of the response
 X-Wget-Resource-Size: 257256 # Length of the file
 X-Wget-Status: success
   END RESOURCE http://foo.com/main/
   END SESSION  # All is well.
 
 This is a interim information which indicates Wget downloaded _part_ of file.
 I'm not sure if this part is necessary, because I was thinking Wget writes 
 into
 SIDB only about _donloaded_ file information.

No, not interim information; but you may be write that information about
the partial content (namely, the Content-Length header) isn't really all
that useful.

The 206 Partial Content bit is actually meant to reflect that Wget,
knowing that it had the first ~56k, asked the server for just the rest
(partial 

Re: About file format for MetaDataBase

2008-03-27 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Yoshihiro Tanaka wrote:
 Hello, My name is Yoshihiro TANAKA.
 
 I'm interested in GSOC, and MetaDataBase project.
 
 So let me ask about file format for MetaDataBase(SIDB).
 Considering forwards-compatibility, Wget should be able to ignore items
 it does not recognize. For this, Wget has to know which data belongs to
 which item.
 So how about csv, with delimiter | ?
 
 It would look like below.
 
 -
 first  line:Wget Start at MMSSMMHH-DDMM
 second line:SIDB Version:1.13
 third  line:Wget invocation configration
 fourth line:titleline:URL|StatusCode|Filepath|MIME-Type|..
 fifth  line, and below:data lines bra|bra|bra|bra|bra|bra|...
 data lines bra|bra|bra|bra|bra|bra|...
 data lines bra|bra|bra|bra|bra|bra|...
 data lines bra|bra|bra|bra|bra|bra|...
 data lines bra|bra|bra|bra|bra|bra|...
 data lines bra|bra|bra|bra|bra|bra|...
 last line:Wget End at MMSSMMHH-DDMM
 ---

I'm not crazy about it. Putting so may different values on one line
hampers readability/editability, in my opinion. Also, some of the values
may not be required for all resources (in particular, if StatusCode
indicates a 404 failure or somesuch, Filepath will probably be
irrelevent, etc).

Also, if possible, it'd be nice for a newer version of Wget to just come
along, and continue the session (or append to it), including the newer
data it knows about, and still have a readable file for the older Wget.

Also, this format essentially requires that all data about a particular
entry be known before any of it may be written. I think it would be
useful to write some information, e.g. Filepath and MIME-Type (and other
 HTTP headers), as soon as it's known. If Wget is killed in the middle
of a file transfer, it won't have had the StatusCode available yet, and
so wouldn't have written Filepath and MIME-Type (or even URL)
information to the file yet. This makes it harder to see what Wget was
doing when it was interrupted.

 
 The advantage of this format is:
 1. Wget can recognize start/end of session

This is useful. In particular, it's useful to see when a session did not
have an explicit end, suggesting that it had not finished.

 2. Wget can recognize which data belongs to with item
(It includes configuration infor in title line)
 3. Wget can recognize the version of this SIDB file
(It does not have to be same to that of Wget)

Also agreed, here. I'd be in favor of adopting an HTTP-like convention
for the version name, where a higher minor number but same major number
indicates that the older Wget should be capable of reading it, but the
file may contain information that will not be understood. A higher
_major_ number means that versions of Wget that do not understand that
SIDB major number, should not attempt to use the file in any way.

 Case 1: When Older Wget reads newer version of SIDB file,
 it can only read items which it recognizes.
 
 Case 2: When Newer Wget wants to use old version SIDB file,
 it can check Version of file, and cope with it.

Yes; however, with the CSV format, it would be difficult for a newer
wget to take advantage of the newer data it knows how to write, as this
would require modification of the title (and all data) lines.

 Case 3: When New Wget wants to use new version SIDB file as Old
 version SIDB file,
 it can specify version of SIDB file like:
 # Wget -VSIDB 1.12
 which means even SIDB file version is 1.13, Wget treat it as
 version 1.12 file.

This may be a good idea, but I'm not sure it will be necessary (of
course, it will be easy to add if it looks like it's useful).

I think HTTP's header mechanism actually makes a pretty good model: we
can place data one-per-line, and versions of Wget that don't understand
specific headers can simply ignore them.

It might be a good idea to include a mechanism for specifying that
certain headers must _not_ be ignored, and that if a particular version
of Wget does not understand them, it should fail out. I'm having some
trouble coming up with a case where we would actually need this, but it
really doesn't hurt to build it in just in case.

One idea might be to actually _use_ HTTP headers for our entries. Then
we don't even have to write it specially: we can just copy the headers
out verbatim (possibly translating CRLF to LF). Example:

  WGET SIDB 1.0
  TIME 2008-03-27T12:06:50
  BEGIN CONFIG
  # Information about Wget invocation settings go here.
  END CONFIG
  REDIRECT http://foo.com/ - http://foo.com/main/

  BEGIN RESOURCE http://foo.com/main/ # Wget got a non-redirect response
X-Wget-HTTP-Status: 200 OK
X-Wget-File-Path: foo.com/main/index.html
Content-Type: text/html; charset=UTF-8
  TIME 2008-03-27T12:07:20
  BEGIN RESOURCE http://foo.com/images/logo.png
  # A multi-connection Wget begins downloading an