Yes we have a Kafka event consumer that creates the files in HDFS. There are other non-Hadoop consumers as well.
On Aug 21, 2013, at 2:23 PM, "Mark" <[email protected]> wrote: > Some final questions. > > Since there is no need for the schema in each Kafka event do you just output > the message without the container file (file header, metadata, sync_markers)? > If so, how do you get this working with the Kafka hadoop consumers? Doing it > this way, does it require you to write your own consumer to write to hadoop? > > Thanks > > On Aug 20, 2013, at 11:01 AM, Eric Wasserman <[email protected]> wrote: > >> You may want to check out this Avro feature request: >> https://issues.apache.org/jira/browse/AVRO-1124 >> which has a lot of nice motivation and usage patterns. Unfortunately, its >> not yet a resolved request. >> >> There are really two broad use cases. >> >> 1) The data are "small" compared to the schema (perhaps because its a >> collection or stream of records encoded by different schemas) >> 2) The data are "big" compared to the schema. (very big records or lots of >> records that share a schema) >> >> Case (1) is often a candidate for a schema registry. Case (2) not as much. >> >> Examples from my own usage: >> >> For Kafka we include an MD5 digest of the writer's schema with each Message. >> It is serialized as a concatenation of the fixed-length MD5 and the binary >> Avro-encoded data. To decode we read off the MD5, look up the schema and use >> it to decode the remainder of the Message. >> [You could also segregate data written with different schemas into different >> Kafka topics. By knowing which topic a message is under you then arrange a >> way to look up the writer's schema. That lets you avoid even the cost of >> including the MD5 in the Messages.] >> >> In either case consumer code needs to look up the full schema from a >> "registry" in order to do the actual decode the Avro-encoded data. The >> registry serves the full schema that corresponds to the specified MD5 digest. >> >> We use a similar technique for storing MD5-tagged Avro data in "columns" of >> Cassandra and so on. >> >> Case (2) is pretty well handled by just embedding the full schema itself. >> >> For example, for Hadoop you can just use Avro data files which include the >> actual schema in a header. All the record in the file then adhere to that >> same schema. In this case using a registry to get the writer's schema is not >> necessary. >> >> Note: As described in the feature request linked above, some people use a >> schema registry as a way of coordinating schema evolution rather than just >> as a way of making schema access "economical". >> >> >> >> On Aug 20, 2013, at 9:19 AM, Mark wrote: >> >>> Can someone break down how message serialization would work with Avro and a >>> schema registry? We are planning to use Avro with Kafka and I've read >>> instead of adding a schema to every single event it would be wise to add >>> some sort of fingerprint with each message to identify which schema it >>> should used. What I'm having trouble understanding is, how do we read the >>> fingerprint without a schema? Don't we need the schema to deserialize? >>> Same question goes for working with Hadoop.. how does the input format know >>> which schema to use? >>> >>> Thanks >> >> > >
