Ok, makes sense. Thanks for the answer.
On Thu, Jan 24, 2013 at 4:47 AM, Martin Kleppmann <[email protected]>wrote: > 1. Because if it was predictable, it would inevitably appear in the > actual data sometimes (e.g. imagine the Avro documentation, stating > what the sync marker is, is downloaded by a web crawler and stored in > an Avro data file; then the sync marker will appear in the actual > data). Data may come from malicious sources; making the marker random > makes it unfeasible to exploit. > > 2. Possibly, but extremely unlikely. The probability of a given random > 16-byte string appearing in a petabyte of (uniformly distributed) data > is about 10^-23. It's more likely that your data center is wiped out > by a meteorite (http://preshing.com/20110504/hash-collision-probabilities > ). > > 3. If the sync marker appears in your data, it only breaks reading the > file if you happen to also seek to that place in the file. If you just > read over it sequentially, nothing happens. > > Martin > > On 23 January 2013 21:09, Josh Spiegel <[email protected]> wrote: > > As I understand it, Avro container files contain synchronization markers > > every so often to support splitting the file. See: > > > https://cwiki.apache.org/AVRO/faq.html#FAQ-Whatisthepurposeofthesyncmarkerintheobjectfileformat%3F > > > > (1) Why isn't the synchronization marker the same for every container > file? > > (i.e. what is the point of generating it randomly every time) > > > > (2) Is it possible, at least in theory, for naturally occurring data to > > contain bytes that match the sync marker? If so, would this break > > synchronization? > > > > Thanks, > > Josh >
