[ Postmaster: Please forward this message to developers of your
distributed programs, such as Freenet and Mojo Nation. -- Brad Allen
<[EMAIL PROTECTED]> ]
Subject: Distributed ideas missing -- pls fwd to developers;:
appropriate redundancy, sizing flexibility, appropriate
authentification and security, compression, gateways and caching.
Back in 1984 (A.D., Gregorian Calendar, Earth, Sol, Milky Way) I
pretty much thought of all of the concepts needed for reliable
distributed information (and for that matter, thought, but I'm not
covering that here). I noticed you missing some things from what I
thought of, according to one review in "MojoNation V0.956.1 Review
Posted by erik on Sun Mar 4th, 2001 02:58:26 PM , Reviews" on
InfoAnarchy.Org:
* Redundancy. According to the system there, your system does not
analyze the probability of pieces of a file being available or
the probability of servers being available. There need to be
probability scores for each server, plus a duality of both a
minimum probability redundancy plus a minimum absolute redundancy
(for instance, a nice transparent via invisible virtual redundant
network through some sort of transparent gateway with lots of
well designed parallel/redundant systems with lots of effective
RAID drives may get a rather high probability score and help
add to some probability threshold, say 900% (I'm using the same
as what DNS used for the TLDs just as an example), and yet the
total resultant number of servers adding to more than 900%
probability is still not enough in a combinatorial redundancy;
a disconnection from a very good server would create such a big
gap that that possibility itself is a big problem, and therefore
a minimum of 15 servers per file would also be enforced. Perhaps
a downward-slope equation on the actual probabilities could be
used before the add-up in order to use only one metric, so that
2(100%*f) is less than 4(40%*f), where f is that downward function.
This would make the fact that there are four servers a better thing.
Choosing good functions may be part of the math of the program,
and requires good long-term mathematics and quality control.
* Automatic flexibility in sizing. Some bandwidth is really slow
(like a low hertz connection to Pluto -- which brings up another
issue -- high latency -- it may make sense to start programming
networks today in order to be able to handle high latency
situations just as reliably and efficiently as low latency
situations, so that space travel will not require modifications
to the code); for instance, a 56Kbps modem is quite slow compared
to a 10megabit per second cable modem. It is quite OK to stick
a one megabyte "chunk" on a 10megabit per second cable modem with
excellent connections (say, a server running in reverse to a
typical Time Warner of Manhattan Road Runner service connection --
typical download speeds on that system are quite impressive, even
across administrative barriers to other networks, since they
have great connectivity to other networks), whereas even a 20KB
chunk can be equivilent or harder on some legacy phone modem
networks. This flexibility in sizing would use probability and
speed experiences for the chunks as well. In this way,
a user could allow files on their system to become part of the
overall network, and those files, rather than having to be
explicitly uploaded, would instead have the redundancy, probability,
sizing, and network closeness assessment metrics automatically rear
their head as being wrong, and the (distributed programmed) network
would automatically be jarred into migrating those files away from
the server. In this way, the tediousness of explicitly waiting
for uploads would be mitigated, and mojo would be earned a bit
less explicitly but more easily and efficiently to start with.
The difference may be subtle -- simply "backgrounding" the
initial uploads -- but since it will be indexed quicker, the
availability is really better. A larger set of files with
clients wanting certain files before others will help.
I also did not see mention of these other ideas:
* Bitwise comparison of file subchunks using compression as a goal;
unique IDs would be given to subchunks, so that files with subchunks
that are equivilent will have those subchunks be stored much more
efficiently and with a resultant better distribution. This way, for
example, two versions of a text document which are substantially
similar will take a lot less space to exactly store on the network.
In more complex cases, compression algorithms could use existant
files as compression similarity cache for initialized compression
sets which would get more compression done, and acceptability
indexes can be used (for instance, where a certain amount of loss is
acceptable, the downloading site can use a less perfect copy but
still increase speed if there happens to be closer or more
attainable previous or similar versions which correspond to
initialized cache; this would have to be calculated with respect to
resources used all around, including CPU. This actually is useful
in low bandwidth systems. Use a supercomputer in space and a
supercomputer on earth as a model and then interpolate to
minicomputers and then finally lower grade computers -- it is easy
to see that the supercomputer in space may in certain situations
have bad communications with Earth, but may gain wonderful speed
increases using this type of compression. Also, in military or
political situations, encryption may be limited and also be a
limitor of bandwidth. Also, bandwidth may be limited by other
factors -- telecommunications sabatoge by corporate customer
gouging; affordable amounts of cell phone digital signal bandwidth;
etc.)
* Caching of various content types. HTTP and FTP content are
obvious. These could be cached. In fact, comparison in this realm
is wonderful -- this could reduce the redundancy in the HTTP and FTP
realms by mirrors to nothing more than the usual redundancy you have
for one copy of the original plus the indexing of all the various
ways to refer to the original mirrors. A download for a different
file that has not yet been compared can be bitwise fully downloaded
by a set of computers that are closer to the HTTP and FTP sites, and
then compared against in-store chunks of files that are also close
to the requestor client, and then instead of sending the entire file
across, simply acknowledging perfect copies using the above
compression system would make a transparent download possible. The
additional data that the original sources are equal or similar (and
compressable in that way) would also be marked. This has obvious
web speedup possibilities. This type of functioning could be
programmed into a general gateway function that could gateway to any
service (such as also gopher, and any other file distribution
system, such as any of or all of those mentioned on
infoanarchy.org.)
* Indexing should just be another object type; I am surprised to hear
that an index search takes a long time. A dynamic distribution of
indexing should be done so that indexing data can be found fast.
For instance, if you order by strict byte (such as string, or
lookup, order), you can assign subpieces by redundancy and
probability needs such as:
+ server 2 contains index entries A-F and W
+ server 6 contains index entries A-Z
+ server 12 contains index entries G-V and X-Z
+ server 15 contains index entries A-G and Z
+ server 20 contains index entries H-Y
The actual assignments would again be done by probability,
redundancy, distribution metrics (in the case of mojonet including
the Mojo metric), etc.
This way, searches would be quite fast.
Of course, the entire protocol needs appropriate security, including
authentification and other cryptogaphic means. The major problem is
finding someone who says that such and such a file is really the real
thing, but it isn't; imagine a chunk of a file that has a worm in it.
Of course, multiple hashes (MD5, RMD160, SHA1, CRCs, etc.) to
crosscheck entire files from a larger set of confirmation sources would
be entirely pertinent. Files should habilitually contain lots of
signature from everywhere, and there should prettymuch be a full set
of authenticity for every object, whatever its purpose. Also,
encryption should be usable where appropriate. Groups, chunks of
encrypted files, etc. should be quite usable via public key
cryptography even in such a distributed system.
I copy this to a small set of distributed object transmission (and
caching/storage) developors so that the ideas will not be lost or
slowed down due to the lack of my being rich enough to implement them
on my own. (My only lack is money to keep me alive and have solace to
program; where I live, dog barks interrupt my every thought as to make
them useless. I can only give this information to you out of both
desperation and the fact that I already ingrained it in my brain in
1984. This is not a total dump of my brain. Omission does not imply
forfeiture of ownership. Permission to use these obvious ideas so
long as you do not lie about their origin is granted.)
[This message was a quick hack, and is not intended to be a perfect
result of a quality meditation. Please do your own brainstorming
using this as seed and start your own meditation, or if you already
have performing implemented application, just upgrade your application
to include these ideas so that other meditators can have more
seed sourced from implemented application usage.]
Sincerely yours,
Brad Allen <[EMAIL PROTECTED]>
PGP signature