Re: About file format for MetaDataBase

Micah Cowan Fri, 28 Mar 2008 02:12:29 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Yoshihiro Tanaka wrote:
>>  Also, this format essentially requires that all data about a particular
>>  entry be known before any of it may be written. I think it would be
>>  useful to write some information, e.g. Filepath and MIME-Type (and other
>>   HTTP headers), as soon as it's known. If Wget is killed in the middle
> 
> If it is _as soon as_, I'm just wandering the case that Wget downloads
> prural files
> parallely, doesn't that mean the information might mix between there files?
> Is this not problem?


It's not a problem so long as the data is clearly associated with its file.

The sample file I gave in the previous post has a demonstration of this;
logo.png was being downloaded while index.html was still being fetched.
If more information had been available on index.html, it could be
written out with the appropriate "CONTINUE" directive preceding it.

It's not clear to me that that's the best way to deal with it; it could
be that associating an identifier with each URI, and then using that id
with each line, could be a good alternative as well.

Or, perhaps we should keep the block-oriented format (most information
will be available at the start, in the headers and whatnot), and use ids
for lines that indicate final status.

...

> Yes, and about this part I want to know how Wget should treat SIDB file.
> For example, I want to define the case like below:
> - When there is already SIDB file. Is this file modified/appended/rewritten
>   when Wget is invocated next time?

By default, it should probably use a new, separate file. Exceptions
would be when you specifically ask it to operate on an existing session
db file. Continuing an aborted session, etc, should use the same session
db it's continuing from.

>>  > Case 3: When New Wget wants to use new version SIDB file as Old
>>  > version SIDB file,
>>  >         it can specify version of SIDB file like:
>>  >         # Wget -VSIDB 1.12
>>  >         which means even SIDB file version is 1.13, Wget treat it as
>>  > version 1.12 file.
>>
>>
>> This may be a good idea, but I'm not sure it will be necessary (of
>>  course, it will be easy to add if it looks like it's useful).
> 
> Yes, maybe no need.

Well, when we get to new major numbers, at any rate, it'll almost
certainly be useful; I should've been more specific that I wasn't sure
about the minors.

>>  It might be a good idea to include a mechanism for specifying that
>>  certain headers must _not_ be ignored, and that if a particular version
>>  of Wget does not understand them, it should fail out. I'm having some
>>  trouble coming up with a case where we would actually need this, but it
>>  really doesn't hurt to build it in just in case.
> 
> Yes, but if Wget does not understand certain inevitable headers, it does not
> know how it can fail out. So Wget should fail out if they can not find certain
> inevitable headers. Do I make sense?

Right: that's why the mechanism needs to be in place from the beginning,
so that even though they're new headers, Wget can understand that it
should not attempt to use the file if it can't understand these.

It could be something as easy as a naming convention, or header lines
beginning with a !, etc.

OTOH, maybe it doesn't really buy us anything over simply bumping the
major number... it was just an idea.

...
>>   # If the above didn't match Content-Length, that would indicate that
>>   # the connection had been prematurely closed (or that the server
>>   # lied).
>>     X-Wget-Status: success
>>   END RESOURCE http://foo.com/images/logo.png
>>
>>   # !!! Wget was killed here, index.html still not done downloading.
>>
>>   # New Wget invocation, continuing the session:
> 
> Here, Is Wget writing into another file ?

Yes.

>>   WGET SIDB 1.1   # different version of Wget, understands a little
>>                   # more, might write new kinds of info.'
>>   TIME 2008-03-28T00:53:07
>>   CONTINUE RESOURCE http://foo.com/main/
>>     X-Wget-Current-Length: 57256 # size of current file on disk
>>     X-Wget-Status: ENETUNREACH
>>   END RESOURCE http://foo.com/main/
>>   END SESSION # Indicates Wget at least terminated normally
>>
>>   WGET SIDB 1.1
>>   TIME 2008-03-28T11:15:27
>>   CONTINUE RESOURCE http://foo.com/main/
>>     X-Wget-Current-Length: 57256
>>     X-Wget-HTTP-Status: 206 Partial Content
>>     Content-Length: 200000 # Length of the response
>>     X-Wget-Resource-Size: 257256 # Length of the file
>>     X-Wget-Status: success
>>   END RESOURCE http://foo.com/main/
>>   END SESSION  # All is well.
> 
> This is a interim information which indicates Wget downloaded _part_ of file.
> I'm not sure if this part is necessary, because I was thinking Wget writes 
> into
> SIDB only about _donloaded_ file information.

No, not interim information; but you may be write that information about
the partial content (namely, the Content-Length header) isn't really all
that useful.

The "206 Partial Content" bit is actually meant to reflect that Wget,
knowing that it had the first ~56k, asked the server for just the rest
(partial content).

>>  It's not clear to me that we actually _need_ the minor number as part of
>>  the SIDB format version. The minor number is useful in HTTP, mainly to
>>  negotiate between two different programs which version will be used for
>>  communication. But, since Wget will ignore the headers it doesn't
>>  understand _anyway_, and any other important changes will pretty much
>>  require a major version dump, does it actually make sense to distinguish
                            ^^^^
(I meant "bump".)
>>  an SIDB 1.0 from an SIDB 1.1?
> 
> At least minor version would help when we check the contents of SIDB file.
> In the case like, "why this item is/is not writen here?"

That's true; but actually, using the Wget version number instead could
be more informative in that way. We could write that information as well
 (but give it no semantic meaning: just intended for human readers).
That way, we wouldn't have to remember to be sure to bump the SIDB
version number every time we add a new header type (I'm not as worried
about the major version bumps: I think we'll remember to bump for truly
incompatible changes).

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH7LZj7M8hyUobTrERAvWBAJ9gceAsyZFByVRQOo0M2HuEuyIFMwCdGWmH
TDj13cIJMdRi4rDR4IU4QKY=
=/IX4
-----END PGP SIGNATURE-----

Re: About file format for MetaDataBase

Reply via email to