You mean "Avro binary files", yes? What about Avro JSON files? Would there be a trick to assess whether such a file is Avro and not generic JSON?
On Fri, Jan 18, 2013 at 10:49 AM, Miki Tebeka <[email protected]> wrote: > Avro files have a "magic" prefix of "Obj\0x1", this might help. > The schema is always embedded in the avro file in the "meta" field. > > > On Thu, Jan 17, 2013 at 2:11 PM, Public Network Services < > [email protected]> wrote: > >> Folks, >> >> I am involved in a project to extract data from a large number of files >> (to be provided at some point), in numerous formats, among which is some >> Avro files (both binary and JSON-encoded), and thus I am looking for the >> best way to tackle this. >> >> One of the things we would (ideally) like to do is auto-classify the data >> generically, i.e. read a few lines or bytes off a file and be able to tell >> what kind of format it is. >> >> This is fairly easy to do with, say, (non-Avro) JSON files, but I am not >> sure how this would be done for Avro. >> >> For one thing, there is the necessity of a Schema, about which the >> documentation says that >> >> - "Avro data is always serialized with its schema. Files that store >> Avro data should always also include the schema for that data in the same >> file." >> >> However, the Java code examples posted on the project website imply that >> the Schema is supplied as a separate file and I am not sure whether this is >> only the case with RPC. >> >> Are there any code examples for detecting the encoding format >> (binary/json) of the data file, assessing whether there is a schema >> embedded in it and extracting that schema? >> >> Thanks! >> > >
