Avro files have a "magic" prefix of "Obj\0x1", this might help. The schema is always embedded in the avro file in the "meta" field.
On Thu, Jan 17, 2013 at 2:11 PM, Public Network Services < [email protected]> wrote: > Folks, > > I am involved in a project to extract data from a large number of files > (to be provided at some point), in numerous formats, among which is some > Avro files (both binary and JSON-encoded), and thus I am looking for the > best way to tackle this. > > One of the things we would (ideally) like to do is auto-classify the data > generically, i.e. read a few lines or bytes off a file and be able to tell > what kind of format it is. > > This is fairly easy to do with, say, (non-Avro) JSON files, but I am not > sure how this would be done for Avro. > > For one thing, there is the necessity of a Schema, about which the > documentation says that > > - "Avro data is always serialized with its schema. Files that store > Avro data should always also include the schema for that data in the same > file." > > However, the Java code examples posted on the project website imply that > the Schema is supplied as a separate file and I am not sure whether this is > only the case with RPC. > > Are there any code examples for detecting the encoding format > (binary/json) of the data file, assessing whether there is a schema > embedded in it and extracting that schema? > > Thanks! >
