Thanks for the help! I am trying to find sample Avro files and it turns out to be surprisingly difficult (at least via the Google searches I tried).
Would you know of any such files (preferably large-ish) in the open source? On Fri, Jan 18, 2013 at 6:53 AM, Terry Healy <[email protected]> wrote: > Check out avro-tools. With this you can dump the schema for a file, > extract the metadata, or export it in several formats: > > ---------------- > Available tools: > compile Generates Java code for the given schema. > fragtojson Renders a binary-encoded Avro datum as JSON. > fromjson Reads JSON records and writes an Avro data file. > fromtext Imports a text file into an avro data file. > getmeta Prints out the metadata of an Avro data file. > getschema Prints out schema of an Avro data file. > idl Generates a JSON schema from an Avro IDL file > induce Induce schema/protocol from Java class/interface via > reflection. > jsontofrag Renders a JSON-encoded Avro datum as binary. > recodec Alters the codec of a data file. > rpcreceive Opens an RPC Server and listens for one message. > rpcsend Sends a single RPC message. > tether Run a tethered mapreduce job. > tojson Dumps an Avro data file as JSON, one record per line. > totext Converts an Avro data file to a text file. > trevni_meta Dumps a Trevni file's metadata as JSON. > trevni_random Create a Trevni file filled with random instances of a > schema. > trevni_tojson Dumps a Trevni file as JSON. > > -Terry > > On 01/17/2013 05:11 PM, Public Network Services wrote: > > Folks, > > > > I am involved in a project to extract data from a large number of files > > (to be provided at some point), in numerous formats, among which is some > > Avro files (both binary and JSON-encoded), and thus I am looking for the > > best way to tackle this. > > > > One of the things we would (ideally) like to do is auto-classify the > > data generically, i.e. read a few lines or bytes off a file and be able > > to tell what kind of format it is. > > > > This is fairly easy to do with, say, (non-Avro) JSON files, but I am not > > sure how this would be done for Avro. > > > > For one thing, there is the necessity of a Schema, about which the > > documentation says that > > > > * "Avro data is always serialized with its schema. Files that store > > Avro data should always also include the schema for that data in the > > same file." > > > > However, the Java code examples posted on the project website imply that > > the Schema is supplied as a separate file and I am not sure whether this > > is only the case with RPC. > > > > Are there any code examples for detecting the encoding format > > (binary/json) of the data file, assessing whether there is a schema > > embedded in it and extracting that schema? > > > > Thanks! >
