Yes we have a Kafka event consumer that creates the files in HDFS. There are 
other non-Hadoop consumers as well. 

On Aug 21, 2013, at 2:23 PM, "Mark" <[email protected]> wrote:

> Some final questions.
> 
> Since there is no need for the schema in each Kafka event do you just output 
> the message without the container file (file header, metadata, sync_markers)? 
> If so, how do you get this working with the Kafka hadoop consumers? Doing it 
> this way, does it require you to write your own consumer to write to hadoop?
> 
> Thanks
> 
> On Aug 20, 2013, at 11:01 AM, Eric Wasserman <[email protected]> wrote:
> 
>> You may want to check out this Avro feature request: 
>> https://issues.apache.org/jira/browse/AVRO-1124
>> which has a lot of nice motivation and usage patterns. Unfortunately, its 
>> not yet a resolved request.
>> 
>> There are really two broad use cases. 
>> 
>> 1) The data are "small" compared to the schema (perhaps because its a 
>> collection or stream of records encoded by different schemas)
>> 2) The data are "big" compared to the schema. (very big records or lots of 
>> records that share a schema)
>> 
>> Case (1) is often a candidate for a schema registry. Case (2) not as much.
>> 
>> Examples from my own usage:
>> 
>> For Kafka we include an MD5 digest of the writer's schema with each Message. 
>> It is serialized as a concatenation of the fixed-length MD5 and the binary 
>> Avro-encoded data. To decode we read off the MD5, look up the schema and use 
>> it to decode the remainder of the Message.
>> [You could also segregate data written with different schemas into different 
>> Kafka topics. By knowing which topic a message is under you then arrange a 
>> way to look up the writer's schema. That lets you avoid even the cost of 
>> including the MD5 in the Messages.]
>> 
>> In either case consumer code needs to look up the full schema from a 
>> "registry" in order to do the actual decode the Avro-encoded data. The 
>> registry serves the full schema that corresponds to the specified MD5 digest.
>> 
>> We use a similar technique for storing MD5-tagged Avro data in "columns" of 
>> Cassandra and so on.
>> 
>> Case (2) is pretty well handled by just embedding the full schema itself.
>> 
>> For example, for Hadoop you can just use Avro data files which include the 
>> actual schema in a header. All the record in the file then adhere to that 
>> same schema. In this case using a registry to get the writer's schema is not 
>> necessary.
>> 
>> Note: As described in the feature request linked above, some people use a 
>> schema registry as a way of coordinating schema evolution rather than just 
>> as a way of making schema access "economical".
>> 
>> 
>> 
>> On Aug 20, 2013, at 9:19 AM, Mark wrote:
>> 
>>> Can someone break down how message serialization would work with Avro and a 
>>> schema registry? We are planning to use Avro with Kafka and I've read 
>>> instead of adding a schema to every single event it would be wise to add 
>>> some sort of fingerprint with each message to identify which schema it 
>>> should used. What I'm having trouble understanding is, how do we read the 
>>> fingerprint without a schema? Don't we need the schema to deserialize?  
>>> Same question goes for working with Hadoop.. how does the input format know 
>>> which schema to use?
>>> 
>>> Thanks
>> 
>> 
> 
> 

Reply via email to