Re: About file format for MetaDataBase

Micah Cowan Thu, 27 Mar 2008 12:47:43 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Yoshihiro Tanaka wrote:
> Hello, My name is Yoshihiro TANAKA.
> 
> I'm interested in GSOC, and MetaDataBase project.
> 
> So let me ask about file format for MetaDataBase(SIDB).
> Considering forwards-compatibility, Wget should be able to ignore items
> it does not recognize. For this, Wget has to know which data belongs to
> which item.
> So how about csv, with delimiter "|" ?
> 
> It would look like below.
> 
> <-----------------------------------------------------
> first  line:Wget Start at MMSSMMHH-DDMMYYYY
> second line:SIDB Version:1.13
> third  line:Wget invocation configration
> fourth line:titleline:URL|StatusCode|Filepath|MIME-Type|......
> fifth  line, and below:data lines bra|bra|bra|bra|bra|bra|...
>         data lines bra|bra|bra|bra|bra|bra|...
>         data lines bra|bra|bra|bra|bra|bra|...
>         data lines bra|bra|bra|bra|bra|bra|...
>         data lines bra|bra|bra|bra|bra|bra|...
>         data lines bra|bra|bra|bra|bra|bra|...
> last line:Wget End at MMSSMMHH-DDMMYYYY
> ------------------------------------------------------->


I'm not crazy about it. Putting so may different values on one line
hampers readability/editability, in my opinion. Also, some of the values
may not be required for all resources (in particular, if StatusCode
indicates a 404 failure or somesuch, Filepath will probably be
irrelevent, etc).

Also, if possible, it'd be nice for a newer version of Wget to just come
along, and continue the session (or append to it), including the newer
data it knows about, and still have a readable file for the older Wget.

Also, this format essentially requires that all data about a particular
entry be known before any of it may be written. I think it would be
useful to write some information, e.g. Filepath and MIME-Type (and other
 HTTP headers), as soon as it's known. If Wget is killed in the middle
of a file transfer, it won't have had the StatusCode available yet, and
so wouldn't have written Filepath and MIME-Type (or even URL)
information to the file yet. This makes it harder to see what Wget was
doing when it was interrupted.

> 
> The advantage of this format is:
> 1. Wget can recognize start/end of session

This is useful. In particular, it's useful to see when a session did not
have an explicit end, suggesting that it had not finished.

> 2. Wget can recognize which data belongs to with item
>    (It includes configuration infor in title line)
> 3. Wget can recognize the version of this SIDB file
>    (It does not have to be same to that of Wget)

Also agreed, here. I'd be in favor of adopting an HTTP-like convention
for the version name, where a higher minor number but same major number
indicates that the older Wget should be capable of reading it, but the
file may contain information that will not be understood. A higher
_major_ number means that versions of Wget that do not understand that
SIDB major number, should not attempt to use the file in any way.

> Case 1: When Older Wget reads newer version of SIDB file,
>         it can only read items which it recognizes.
> 
> Case 2: When Newer Wget wants to use old version SIDB file,
>         it can check Version of file, and cope with it.

Yes; however, with the CSV format, it would be difficult for a newer
wget to take advantage of the newer data it knows how to write, as this
would require modification of the title (and all data) lines.

> Case 3: When New Wget wants to use new version SIDB file as Old
> version SIDB file,
>         it can specify version of SIDB file like:
>         # Wget -VSIDB 1.12
>         which means even SIDB file version is 1.13, Wget treat it as
> version 1.12 file.

This may be a good idea, but I'm not sure it will be necessary (of
course, it will be easy to add if it looks like it's useful).

I think HTTP's header mechanism actually makes a pretty good model: we
can place data one-per-line, and versions of Wget that don't understand
specific "headers" can simply ignore them.

It might be a good idea to include a mechanism for specifying that
certain headers must _not_ be ignored, and that if a particular version
of Wget does not understand them, it should fail out. I'm having some
trouble coming up with a case where we would actually need this, but it
really doesn't hurt to build it in just in case.

One idea might be to actually _use_ HTTP headers for our entries. Then
we don't even have to write it specially: we can just copy the headers
out verbatim (possibly translating CRLF to LF). Example:

  WGET SIDB 1.0
  TIME 2008-03-27T12:06:50
  BEGIN CONFIG
  # Information about Wget invocation settings go here.
  END CONFIG
  REDIRECT http://foo.com/ -> http://foo.com/main/

  BEGIN RESOURCE http://foo.com/main/ # Wget got a non-redirect response
    X-Wget-HTTP-Status: 200 OK
    X-Wget-File-Path: foo.com/main/index.html
    Content-Type: text/html; charset=UTF-8
  TIME 2008-03-27T12:07:20
  BEGIN RESOURCE http://foo.com/images/logo.png
  # A multi-connection Wget begins downloading an image
  # while the first page is still being downloaded
    X-Wget-HTTP-Status: 200 OK
    X-Wget-File-Path: foo.com/images/logo.png
    Content-Type: image/png
    Content-Length: 202024
  TIME 2008-03-27T12:09:50 # slow connection? :)
    X-Wget-Resource-Size: 202024
  # If the above didn't match Content-Length, that would indicate that
  # the connection had been prematurely closed (or that the server
  # lied).
    X-Wget-Status: success
  END RESOURCE http://foo.com/images/logo.png

  # !!! Wget was killed here, index.html still not done downloading.

  # New Wget invocation, continuing the session:
  WGET SIDB 1.1   # different version of Wget, understands a little
                  # more, might write new kinds of info.'
  TIME 2008-03-28T00:53:07
  CONTINUE RESOURCE http://foo.com/main/
    X-Wget-Current-Length: 57256 # size of current file on disk
    X-Wget-Status: ENETUNREACH
  END RESOURCE http://foo.com/main/
  END SESSION # Indicates Wget at least terminated normally

  WGET SIDB 1.1
  TIME 2008-03-28T11:15:27
  CONTINUE RESOURCE http://foo.com/main/
    X-Wget-Current-Length: 57256
    X-Wget-HTTP-Status: 206 Partial Content
    Content-Length: 200000 # Length of the response
    X-Wget-Resource-Size: 257256 # Length of the file
    X-Wget-Status: success
  END RESOURCE http://foo.com/main/
  END SESSION  # All is well.

The major inconvenience here is that there will be an _awful_ lot of
headers starting with "X-Wget-", which is a little jarring to read.
Also, I'm not sure I'd actually want all-caps for the various directives
(it's just an example).

It might actually make more sense to use a prefix to indicate which ones
_are_ real HTTP headers, rather than which ones aren't. That is:

    X-Wget-HTTP-Status: 200 OK
    X-Wget-File-Path: foo.com/images/logo.png
    Content-Type: image/png

would become:

    Response-Status: 200 OK
    File-Path: foo.com/images/logo.png
    HTTP-Content-Type: image/png

that looks much more pleasant, to me.

Note that, since the session info db is transmitted in a linear fashion,
printing information as its available, it's essentially a
program-parse-able logfile. If we augment it with periodic updates as to
how much we've downloaded of each file, it will be quite suitable for
use by GUI wrapper programs and the like.

It's not clear to me that we actually _need_ the minor number as part of
the SIDB format version. The minor number is useful in HTTP, mainly to
negotiate between two different programs which version will be used for
communication. But, since Wget will ignore the headers it doesn't
understand _anyway_, and any other important changes will pretty much
require a major version dump, does it actually make sense to distinguish
an SIDB 1.0 from an SIDB 1.1?

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH6/m97M8hyUobTrERAsoKAJsFtWXYcAHj4LacWU0KiMs0bJoAzQCbBFWX
WWWASTV+Lp4L4xR+ON+P+FQ=
=TY+C
-----END PGP SIGNATURE-----

Re: About file format for MetaDataBase

Reply via email to