1. Because if it was predictable, it would inevitably appear in the actual data sometimes (e.g. imagine the Avro documentation, stating what the sync marker is, is downloaded by a web crawler and stored in an Avro data file; then the sync marker will appear in the actual data). Data may come from malicious sources; making the marker random makes it unfeasible to exploit.
2. Possibly, but extremely unlikely. The probability of a given random 16-byte string appearing in a petabyte of (uniformly distributed) data is about 10^-23. It's more likely that your data center is wiped out by a meteorite (http://preshing.com/20110504/hash-collision-probabilities). 3. If the sync marker appears in your data, it only breaks reading the file if you happen to also seek to that place in the file. If you just read over it sequentially, nothing happens. Martin On 23 January 2013 21:09, Josh Spiegel <[email protected]> wrote: > As I understand it, Avro container files contain synchronization markers > every so often to support splitting the file. See: > https://cwiki.apache.org/AVRO/faq.html#FAQ-Whatisthepurposeofthesyncmarkerintheobjectfileformat%3F > > (1) Why isn't the synchronization marker the same for every container file? > (i.e. what is the point of generating it randomly every time) > > (2) Is it possible, at least in theory, for naturally occurring data to > contain bytes that match the sync marker? If so, would this break > synchronization? > > Thanks, > Josh
