Re: [Tech] Chunk Size?

Matthew Toseland Thu, 23 Jan 2003 16:13:54 -0800

On Fri, Jan 24, 2003 at 12:01:20AM +0000, Matthew Toseland wrote:
> On Thu, Jan 23, 2003 at 11:50:57PM +0000, Gordan Bobic wrote:
> > On Thursday 23 Jan 2003 5:18 pm, Matthew Toseland wrote:
> > 
> > > > Are files never separated into segments, unless FEC is used? What are the
> > > > minimum and maximum sizes for FEC segments?
> > >
> > > It is possible to insert non-redundant splitfiles. They are unreliable
> > > and slow. FEC splitfiles use "chunks" (segments are something else :)).
> > 
> > OK, terminology noted. :-)
> > 
> > > Fproxy uses 256kB to 1MB chunks, but other clients could use other
> > > sizes. That is however the recommended range for most uses.
> > 
> > ...
> > 
> > > Splitfiles therefore can fail if too many of the chunks are no longer
> > > fetchable.
> > 
> > Is it possible to use smaller chunks? Can you give me a link to a document 
> > that explains how to control the use of FEC via fproxy? For example, can I 
> > force the use of FEC for files smaller than 1 MB?
> No. Your application would not use Fproxy anyway, it would probably use
> a library to talk directly to the node using FCP. What language are you
> considering?
> > 
> > > > > However if you need to store
> > > > > chunks of more than a meg, you need to use redundant (FEC) splitfiles.
> > > >
> > > > As I said, I was looking for the limits on the small size, rather than
> > > > the large size. Now I know not to go much below 1 KB because it is
> > > > pointless. I doubt I'd ever need to use anything even remotely
> > > > approaching 1 MB for my application. I was thinking about using a size
> > > > between 1 KB and 4 KB, but wasn't sure if the minimum block size might
> > > > have been something quite a bit larger, like 64 KB.
> > >
> > > Well... I don't know. You gain performance from downloading many files
> > > at once (don't go completely over the top though... the splitfile
> > > downloader uses around 10)
> > 
> > Doesn't this depend entirely on the limits on the number of threads and 
> > concurrent connections, as set in the configuration file? And the hardware 
> > and network resources of course.
> It depends on lots of things. The operating system, the memory limit set
> for the process by the command line arguments to the java VM, and of
> course the hardware. The current freenet node has fairly limited
> performance due to not using nonblocking I/O, for example. Also, some
> operating systems limit the number of file descriptors that can be open
> at once...
> > 
> > > - but it means you insert more data, and you
> > > have to decode it; it's not designed for such small chunks, but we know
> > > it can work with sizes close to that from work on streaming... The
> > > overheads on a 1kB CHK are significant (something like 200 bytes?), I'd
> > > use 4kB chunks, at least...
> > 
> > Is this overhead included in the amount of space consumed? i.e. does this mean 
> > that 1 KB file + 200 bytes of overhead => 2KB of storage? Or is the overhead 
> > completely separate?
> Overhead is separate. The actual data content of the file is rounded up
> to the next power of 2.


Oh, one thing. Data content includes metadata.

> > 
> > Is the overhead of 200 bytes fixed for all file sizes, or does it vary with 
> > the file size?
> No, it varies depending on (mostly) the file type.
> > 
> > > > > > The reason for this is that I am trying to design a database
> > > > > > application that uses Frenet as the storage medium (yes, I know about
> > > > > > FreeSQL, and it doesn't do what I want in the way I want it done).
> > > > > > Files going missing are an obvious problem that needs to be tackled.
> > > > > > I'd like to know what the block size is in order to implement
> > > > > > redundancy padding in the data by exploiting the overheads produced
> > > > > > by the block size, when a single item of data is smaller than the
> > > > > > block that contains it.
> > >
> > > You do know that Freenet is lossy, right? Content which is not accessed
> > > very much will eventually expire.
> > 
> > Yes, this is why I am thinking about using DBR. There would potentially be a 
> > number of nodes that would once per day retrieve the data, compact it into 
> > bigger files, and re-insert it for the next day. This would be equivalent to 
> > vacuum (PostgreSQL) and optimize (MySQL) commands.
> > 
> > The daily operation data would involve inserting many small files (one file 
> > per record in a table, one file per delete flag, etc.)
> > 
> > This would all be gathered, compacted, and re-inserted. Any indices would also 
> > get re-generated in the same way.
> Hmm. Interesting.
> > 
> > > > > > This could be optimized out in run-time to make no impact on
> > > > > > execution speed (e.g. skip downloads of blocks that we can
> > > > > > reconstruct from already downloaded segments).
> > > > >
> > > > > Hmm. Not sure I follow.
> > > >
> > > > A bit like a hamming code, but allowing random access. Because it is
> > > > latency + download that is slow, downloading fewer files is a good
> > > > thing for performance, so I can re-construct some of the pending segments
> > > > rather than downloading them. Very much like FEC, in fact. :-)
> > >
> > > Latency is slow. Downloading many files in series is slow. Downloading
> > > many files in parallel, as long as you don't get bogged down waiting for
> > > the last retry on the last failing block in a non-redundant splitfile,
> > > is relatively fast. By all means use your own codes!
> > 
> > I haven't decided what to use for redundancy yet. My biggest reason for using 
> > my own method is that it would allow me to pad files to a minimal sensible 
> > size (I was thinking about 4 KB), and enable me to skip chunks that are not 
> > needed, or back-track to reconstruct a "hole" in the data from the files that 
> > are already there.
> > 
> > But FEC is very appealing because it already does most of that, so there would 
> > be less work involved in the implementation of my application.
> > 
> > > > Of course, I might not bother if I can use FEC for it instead, provided
> > > > it will work with very small file sizes (question I asked above).
> > >
> > > Well...
> > >
> > > FEC divides the file into segments of up to 128 chunks (I think).
> > > It then creates 64 check blocks for the 128 chunks (obviously fewer if
> > > fewer original chunks), and inserts the lot, along with a file
> > > specifying the CHKs of all the different chunks inserted for each
> > > segment.
> > 
> > Doesn't that mean that with maximum chunk size of 1 MB, this limits the file 
> > size to 128 MB? Or did I misunderstand the maximum chunk size, and it is 
> > purely a matter of caching as a factor of the store size?
> No. After 128MB, we use more than one segment. Within each segment, we
> need any 128 of the 192 chunks to reconstruct the file.
> > 
> > What is the smallest file size with which FEC can be use sensibly? Would I be 
> > correct in quessing this at 2 KB, to create 2 1KB chunks with 1 1KB check 
> > block?
> I wouldn't recommend it.
> > 
> > Is FEC fixed at 50% redundancy, or can the amount of redundancy be controlled 
> > (e.g. reduced to 25%, if requested)? Or has it been tried and tested that 
> > around 50% gives best results?
> Hmm. At the moment it is hard-coded. At some point we may change this.
> The original libraries support other amounts.
> > 
> > Thanks.
> > 
> > Gordan
> > 
> 
> -- 
> Matthew Toseland
> [EMAIL PROTECTED][EMAIL PROTECTED]
> Full time freenet hacker.
> http://freenetproject.org/
> Freenet Distribution Node (temporary) at 
>http://amphibian.dyndns.org:8889/x-aYyDpMj2E/
> ICTHUS.



-- 
Matthew Toseland
[EMAIL PROTECTED][EMAIL PROTECTED]
Full time freenet hacker.
http://freenetproject.org/
Freenet Distribution Node (temporary) at
ICTHUS.

msg01057/pgp00000.pgp
Description: PGP signature

Re: [Tech] Chunk Size?

Reply via email to