[freenet-tech] DistribNet Update

Kevin Atkinson Mon, 15 Apr 2002 00:20:03 -0700


I now have some actual code for DistribNet which I will post latter this 
week as soon as the DistribNet sourceforge project becomes active.


The only thing I have implemented is storing and retrieval of data keys.

Here is the latest overview of DistribNet.  I describe the details of how 
data keys are managed at the end.  Please let me know what you think of
them.  I am especially interested in what you think of my choice of block 
sizes.

DistribNet

Global peer-to-peer internet file system in which anyone can tap into
or add content to.

Meta Goals:

*) To allow anyone, possibly anonymously, to publish web sites with
    out having to pay to for the bandwidth for a commercial provider
    or having to put up with the increasingly add ridden free web
    sites.  One should not have to worry about bandwidth
    considerations at all.

*) Bring back the sense of community on the Internet that was once
    present before the internet become so commercialized.

*) Serve as an efficient replacement for current file sharing networks
    such as Morpheus and Gnutella.

*) To have the network stable and working before some Commercial
    company designs a propitiatory network similar to what I envision
    that can only be accesses via freely available but not FSF
    approved free license.

(Possibly Impossible) Goals:

*) *Really* fast lookup to find data.  The worst case should be O(log(n))
    and the average case should be O(1) or very close to it.

*) Actually retrieving the data should also be really fast.  Popular
    data should be sitting on the same subnet.  On average it should
    be as fast or faster than a typical web site (such as slashdot,
    google, etc.).  It should make effective use of the
    topology of the internet to to minimize network load and maximize
    performance.

*) General searching based on keywords will be build into the protocol
    from the beginning.  The searching faculty will be designed in
    such a way to make message boards trivial to implement.

*) Ability to update data while keeping old revisions around so data never
    disappears until it is truly unwanted.  No one person will have
    the power to delete data once it spreads throughout the network.

*) Will try very hard to keep all but the most unpopular content from
    falling off the network.  Basically before deleting a locally
    unpopular key it will first check if other nodes are storing the
    key and how popular they find the key.  If not enough nodes are
    storing the key and there is any indication that the data may be
    useful at a latter date it will not delete it unless it absolutely
    has to.  And if it does delete it it will first try uploading it
    to other nodes with more disk space available.

*) Ability to store data indefinitely if someone is willing to provide
    the space for it (and being able to find that data in log(n)
    time).

*) Extremely robust so that the only way to kill the network is to
    disable almost all of the nodes.  The network should still
    function even if say 90% of it goes down.

*) Extremely effect cpu-wise so that a fully functional node can run in
    the background and only take 1-2% of the CPU.

Applications:

I would like the protocol to be able to effectually support (ie with out
any ugly hacks that many of the application for Freenet use)

1) Efficient Web like sites (with HTTP gateway to make browsing easy)
2) Efficient sharing of files large and small.
3) Public message forms (with IMAP gateway to make reading easy)
4) Private Email (with the message encrypted so only the intended
    recipient can read it, again with IMAP gateway)
5) Streaming Media
6) Online Chat (with possible IRC or similar gateway)

Anti-Goals:

(Also see philosophy for why I don't find these issues that important)

*) Complete anonymity for the browser.  I want to focus first on
    performance than on anonymity.  In fact I plan to use extensive
    logging in the development versions so that I track network
    performance and quickly cache performance bugs.  As DistribNet
    stabilizes anonymity will be improved at the expense of logging.

    The initial version will only use cryptology when absolutely
    necessary (for example key signing).  Most communications will be
    done in the clear.  After DistribNet stabilizes encryption will
    slowly be added.  When I add encryption I will carefully monitor
    the effect it has on CPU load and if proves to be expensive I will
    allow it to be optional. 

    Please note that I still wish to allow for anonymous posting of
    content.  However, without encryption, it probably won't be as
    anonymous as Freenet or your GNet.

*) Data in the cache will be stored in a straight forward manner.  No
    attempt will be made to prevent the node operate from knowing
    what is in his own cache.  Also, very little attempt will be made
    to prevent others from knowing what is a particular node cache.

Philosophy:

*) I have nothing against complete anonymity, it is just that I am
    afraid that both Freenet and GNet or more designed around the
    anonymity and privacy issues then they are around the performance
    and scalability issues.

*) For most type of things the level of anonymity that Freenet and
    GNet offers is simply not needed.  Even for copyrighted and
    censored material there is, in general, little risk in actually
    viewing the information because it is simply impractical to go
    after every single person who access forbidden information.  Most
    all of the time the lawsuits and such are after the original
    distributors of the information and not the viewers.  There for
    DistribNet will aim to provide anonymity for distributing
    information, but not for actually viewing it.  However, since
    there *is* some information where even viewing it is extremely
    risky, DistribNet will eventually be able to provide the same
    level of anonymity that Freenet or GNet offers, but it will be
    completely optional.

*) I also believe that knowing what is in one owns datastore and being
    able to block certain type of material from one owns node is not
    that big of a deal.  Unless almost everyone blocks a certain type
    of information the availability of blocked information will not be
    harmed.  This is because even if 90% of the nodes block say,
    kiddie porn, the information will still be available on the other
    10% of the nodes which, if the network is designed correctly,
    should be more than enough for anyone to get at blocked
    information.  Furthermore, since the source code for DistribNet
    will be protected under the GPL or similar license, it will be
    completely impractical for other to force a significant number of
    nodes to block information.  Due to the dynamic nature of the
    cache I find it legally difficult to hold anyone responsible for
    the contents of there cache as it is constantly changing.

DistribNet Key Types:

There will essentially be two types of keys.  Map keys and data keys.
Map keys will be uniquely identified in a similar manner as freenet SSK
keys.  Data keys will be identified in a similar manner as freenet's
CHK keys.

Map keys will contain the following information:

  * Short Description
  * Public Namespace Key
  * Timestamped Index pointers
  * Timestamped Data pointers

_At any given point in time_ each map key will only be associated with
one index pointer and one data pointer.  Map keys can be updated by
appending a new index or data pointer to the existing list.  By
default, when a map key is queried only the most recent pointer will
be returned.  However, older pointers are still there and may be
retrieved by specifying a specific date.  Thus, map keys may be
updated, but information is never lost or overwritten.

Data keys will be very much like freenet's CHK keys except that they will
not be encrypted.  Since they are not encrypted delta compression may
be used to save space.

There will not be anything like freenet's KSK keys as those proved to
be completely insure.  Instead Map keys may be requested with out a
signature.  If there is more than one map key by that name than a list
of keys is presented sorted by popularity.  To make such a list
meaning full every public key in freenet will have a descriptive
string associated with it.

Data Key Details:

Data keys will be stored in maximum size blocks of just under 32K.  If
an object is larger than 32K it will be broken down into smaller size
chunks and an index block, also with a maximum size of about 32K, will
be created so that the final object can be reassembled.  If an object
is too big to be indexed by one index block the index blocks themselves
will be split up.  This can be done as many times as unnecessary therefore
providing the ability to store files of arbitrary size.  DistribNet
will use 64 bit integers to store the file size therefore supporting
file sizes up to 2^64-1 bytes.

Data keys will be retrieved by blocks rather than all at once.  When a
client first requests a data key that is too large to fit in a block
an index block will be returned.  It is then up the client to figure out
how to retrieve the individual blocks.  For efficiency reasons a node
can be asked which blocks it has based on a given index block rather
than having to ask for each and every data block.

Data and index blocks will be indexed based on the SHA-1 hash of there
contents.  The content of the index block does not include the index
header therefore allowing the client to verify that a block really is
an index block.

The exact numbers of as follows:

Data Block Size:                         2^15 - 128 = 32640;
Index block header size:                 40
Maximum number of keys per index block:  1630
Key Size:                                20

Maximum object sizes:

direct   => 2^14.99 bytes , about 31.9 kilo
1 level  => 2^25.66 bytes , about 50.7 megs
2 levels => 2^36.34 bytes , about 80.8 gigs
3 levels => 2^47.01 bytes , about 129 tera
4 levels => 2^57.68 bytes
5 levels => 2^68.35 bytes (but limited to 2^64 - 1)

Index layout:

struct IdxBlock {
  char   id[6];
  uint16 key_count;
  uint64 real_size;
  byte   pad[24];
  byte   keys[1630][20];
};

Date blocks do not contain a header however the client is told ahead of
time what type of block it is receiving.

where id is "IDX?", ? is the level

Why 32640?

A block size of just under 32K was chosen because I wanted a size
which will allow most text files to fix in one block, most other files
with one level of indexing, and just about anything anybody would
think of transferring on a public network in two levels and 32K worked
out perfectly.  Also, files around 32K are rather rare therefor
preventing a lot of of unnecessary splitting of files that don't quite
make it.  32640 rather than exactly 32K was chosen to allow some
additional information to be transfered with the block without pushing
the total size over 32K.  32640 can also be stored nicely in a 16 bit
integer without having to worry if its signed or unsigned.

Lookup Details:

Lookup will probably be done by using the chord protocol.  See 
http://www.pdos.lcs.mit.edu/chord/

-- 
http://kevin.atkinson.dhs.org


_______________________________________________
freenet-tech mailing list
[EMAIL PROTECTED]
http://lists.freenetproject.org/mailman/listinfo/tech

[freenet-tech] DistribNet Update

Reply via email to